From 23fd5b4a1132e4b12a0addf782e59b5c2d6ca542 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Joshua=20Niemel=C3=A4?= Date: Sat, 28 Dec 2024 13:04:14 +0100 Subject: [PATCH] Update report --- .gitignore | 1 + report/assets/kde-mlp2-2-bcelosswlogits.svg | 1141 + ...s.22073192-edc3-46fa-8948-84fb3d1fdb0d.svg | 1228 + ...ighbour_classifier_gaussian_activation.svg | 25601 ++++++++++++++++ ...neighbour_classifier_simple_activation.svg | 25541 +++++++++++++++ report/main.typ | 335 +- 6 files changed, 53749 insertions(+), 98 deletions(-) create mode 100644 report/assets/kde-mlp2-2-bcelosswlogits.svg create mode 100644 report/assets/output_layer_mlp2-2-bcelosswlogits.22073192-edc3-46fa-8948-84fb3d1fdb0d.svg create mode 100644 report/assets/three_neighbour_classifier_gaussian_activation.svg create mode 100644 report/assets/three_neighbour_classifier_simple_activation.svg diff --git a/.gitignore b/.gitignore index 832b5ec..08e3f71 100644 --- a/.gitignore +++ b/.gitignore @@ -10,3 +10,4 @@ __pycache__/ datasets/* .idea/ *.svg +!report/assets/*.svg diff --git a/report/assets/kde-mlp2-2-bcelosswlogits.svg b/report/assets/kde-mlp2-2-bcelosswlogits.svg new file mode 100644 index 0000000..9cba9af --- /dev/null +++ b/report/assets/kde-mlp2-2-bcelosswlogits.svg @@ -0,0 +1,1141 @@ + + + + + + + + 2024-12-27T12:39:46.554860 + image/svg+xml + + + Matplotlib v3.10.0, https://matplotlib.org/ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/report/assets/output_layer_mlp2-2-bcelosswlogits.22073192-edc3-46fa-8948-84fb3d1fdb0d.svg b/report/assets/output_layer_mlp2-2-bcelosswlogits.22073192-edc3-46fa-8948-84fb3d1fdb0d.svg new file mode 100644 index 0000000..e9e3d82 --- /dev/null +++ b/report/assets/output_layer_mlp2-2-bcelosswlogits.22073192-edc3-46fa-8948-84fb3d1fdb0d.svg @@ -0,0 +1,1228 @@ + + + + + + + + 2024-12-27T12:54:32.437429 + image/svg+xml + + + Matplotlib v3.10.0, https://matplotlib.org/ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/report/assets/three_neighbour_classifier_gaussian_activation.svg b/report/assets/three_neighbour_classifier_gaussian_activation.svg new file mode 100644 index 0000000..99198b1 --- /dev/null +++ b/report/assets/three_neighbour_classifier_gaussian_activation.svg @@ -0,0 +1,25601 @@ + + + + + + + + 2024-12-24T23:56:55.316420 + image/svg+xml + + + Matplotlib v3.9.2, https://matplotlib.org/ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/report/assets/three_neighbour_classifier_simple_activation.svg b/report/assets/three_neighbour_classifier_simple_activation.svg new file mode 100644 index 0000000..a9cebbd --- /dev/null +++ b/report/assets/three_neighbour_classifier_simple_activation.svg @@ -0,0 +1,25541 @@ + + + + + + + + 2024-12-24T23:55:14.734760 + image/svg+xml + + + Matplotlib v3.9.2, https://matplotlib.org/ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/report/main.typ b/report/main.typ index 6f2872a..cff089f 100644 --- a/report/main.typ +++ b/report/main.typ @@ -1,10 +1,14 @@ #import "@preview/bamdone-aiaa:0.1.2": * #import "@preview/fletcher:0.5.3" as fletcher: diagram, node, edge, shapes +#import "@preview/plotst:0.2.0": * + +#set page(numbering: "1 of 1") +#show link: underline #show: aiaa.with( - title: "A unifying framework for quanitifying the data propagation bottlenecks of graph representation learning methods", + title: "On the bottlenecks of Graph Neural Networks and model Benchmarking", - bibliography: bibliography("sources.bib", style: "ieee"), + bibliography: bibliography("sources.bib", style: "ieee"), authors-and-affiliations: ( ( name:"Mustafa Hekmat Al-Abdelamir", @@ -24,27 +28,37 @@ ), abstract: [[INSERT ABSTRACT WHEN POSSIBLE]] ) -#show link: underline - #outline( - depth: 3, + depth: 2, indent: auto, ) +#set par(spacing: 1.5em) +// provisional introduction partly generated by genai for INSPERATION +The rapid advancement of artificial intelligence (AI) has been significantly shaped by the widespread adoption of encoder-decoder transformer architectures, which have revolutionized various domains, including natural language processing, computer vision, and beyond. These architectures, characterized by their ability to model complex relationships in data through self-attention mechanisms, have achieved state-of-the-art performance across numerous tasks. However, despite their impressive capabilities, transformer models often require vast amounts of labeled data and extensive computational resources, raising concerns about their scalability and generalization in real-world applications. + +In light of these challenges, We want to shift our focus towards architectures that incorporate inductive biases, particularly graph neural networks (GNNs). GNNs are designed to operate on data structured as graphs, enabling them to effectively capture the relationships and interactions between entities. By leveraging the inherent connectivity of graph-structured data, GNNs can provide a more efficient and interpretable framework for learning, particularly in domains where relationships play a crucial role, such as social networks, biological systems, and knowledge representation. + + +// why GNNS are cool, implicit bias blah blah blah + +// TODO: the paper as of right now has 0 mentions of +// topological deep learning other than the introduction, remove? = Introduction -Topological Deep Learning (TDL) is gaining traction as a novel approach @papamarkou_position:_2024 for Graph Representation Learning (GRL). -Leveraging topology, TDL has shown promising results in alleviating various limitations of Graph Neural Networks (GNNs) @horn_topological_2022 + Two often-cited related @giraldo_trade-off_2023 shortcomings in GNNs are over-smoothing and over-squashing. Over-smoothing occurs when individual node features become washed out and too similar after multiple graph convolutions @li_deeper_2018. -In message-passing, nodes send fixed-size messages to their neighbours, said neighbours aggregate the messages, update their features, then send out new messages and so on. -This process inevitably leads to information loss, as an increasingly large amount of information is compressed into fixed-size vectors. -This is known as over-smoothing @alon_bottleneck_2021. Still, perhaps because of the young nature of the field, there is limited theoretical foundation for quantifying these possible advantages. This poses a problem in making quantitative comparisons between various architectures for GRL. In this paper, we attempt to lay the foundations for various metrics to provide insights on over-squashing and over-smoothing and how they relate to the model's ability to learn. To verify and support our theoretical foundations, we also run benchmarks against other approaches used for learning on graph data @horn_topological_2022. + +// known bottlenecks + +// + = Objectives - Gain a solid understanding of the theory and application of Graph Convolutional Networks (GCNs). @@ -52,6 +66,8 @@ To verify and support our theoretical foundations, we also run benchmarks agains - Gain insights and identify various shortcomings of modern machine learning research. - Shed light on what factors make a paper more or less reproducible. - Establish a framework / template for what should be done to ensure that papers are as reproducible as possible. + +// TODO: remove these? // - Gain insights and understanding about TDL, how topology is leveraged for learning, and how it relates to the aforementioned bottlenecks @horn_topological_2022. // - Construct generalisable metrics to quantify various geometric and topological properties of GNNs and the datasets they are trained on. // - Construct a model, for instance, a transformer or CCNN @tdlbook, that can learn topological features in data and benchmark against non-topological approaches. @@ -74,8 +90,6 @@ We construct a simple problem for our GCN, for which we initially analytically d === Regression problem Our dataset / problem, $cal(D)={(G_1, y_1), ..., (G_n, y_n)}$ is a graph of three nodes $G_n in (V_n, E_n)$, with two children pointing into the root node $x_1$. For a given graph, we let the root node be set to 0, $x_1 = 0$. The children are sampled $~cal(U){0, 999}$. the target ($y$) is the sum of the entire graph $sum _(forall V in G)$. The readout is done by reading the root node value $x_1$. - -// TODO: Node order is flipped dunno why #let nodes = ($x_1$, $x_2$, $x_3$) #let edges = ( @@ -177,45 +191,179 @@ $ Since sampling only happens on one half of our dataset, this means that our class distribution is $ P(y="true") = 1/2 - frac(1,19 dot 2) "and" P(y="false") = 1/2 + frac(1, 19 dot 2) $ -We tried with the same GCN model, here we got an accuracy of $approx 0.4736$. Since this is approximately the same as the distribution of the classes, we can conclude that the model has not managed to learn the objective function. +We tried with the same GCN model, where our readout method is simply to check the absolute value of $x_1$. Note that we take the absolute value because we expect the model to learn subtraction, which can naturally result in negative numbers, and we want to classify according to either $x_1 approx 0$ for $x_1=x_2+x_3$ or $x_1 eq.not x_2 + x_3$ otherwise. Here we got an accuracy of $approx 0.4736$. Since this is approximately the same as the distribution of the classes, we can conclude that the model has not managed to learn the objective function. In fact, we argue that this issue is not merely a matter of data or convergence but represents a fundamental limitation of the model. To determine whether $x_1 = x_2 + x_3$, the model must be able to process interactions between the messages (e.g., by subtracting $x_2 + x_3$ from $x_1$). However, since our GNN computes $z = w (x_1 + x_2 + x_3) + b$ as the node representation of $x_1$, it lacks the capacity to explicitly represent the subtraction or compare $x_1$ against $x_2 + x_3$. The shared weights ensure that each message contributes uniformly, pushing the node representation tensor in the same direction and thereby precluding the necessary relational reasoning. -We can analytically reason that giving $x_1$ a distinct weight $w_1$ and $x_2 + x_3$ a distinct weight $w_2$ and setting $-w_1 = w_2$, the node representation of $x_1$ will be: +We can analytically reason that giving $x_1$ a distinct weight $w_1$ and $x_2 + x_3$ a distinct weight $w_2$ and setting $-w_1 = w_2$, the node representation of $x_1$ after the conv layer will be $x_1^c$: -$ z &= w_1 x _(1 ) + w _(2 ) (x _2 + x_3) + b \ &= w_1 x_1 + (-w_1)(x _(2 ) + x _(3 )) + b \ &= w _(1 ) (x _(1 ) - (x _(2 ) + x_(3))) + b $ +$ x_1^c &= w_1 x _(1 ) + w _(2 ) (x _2 + x_3) + b \ &= w_1 x_1 + (-w_1)(x _(2 ) + x _(3 )) + b \ &= w _(1 ) (x _(1 ) - (x _(2 ) + x_(3))) + b $ When $x_1 = x_2 + x_3$, this simplifies to $z = b$, and more generally, we can find the difference between $x_1$ and $x_2 + x_3$ is given by: -$ lr(| z - frac(b ,w _(1 )) |) $ +$ lr(| x_1^c - frac(b ,w _(1 )) |) $ -Hence we opted to use the PyTorch implementation of SAGEConv @hamiltonYL17, this is similar to the GCN model but with an additional weight parameter for the self-loop. We also use a bias parameter. We got an accuracy of $approx 50.9%$. // aber why do we give it a bias parameter? -This is not much better than the GCN model. [THIS IS DEPENDENT ON EPSILON; EXPLAIN WHY] +Hence with the correct choice of $w_1$ and $w_2$ and $b=0$ (no bias) the hidden representation of $x_1$ directly provides the difference between $x_1$ and $x_2+x_3$. This observation implies that $x_1^c = 0$ should be classified as a positive instance, while $x_1^c eq.not 0$ warrants a negative classification. -// IMPORTED FROM LATEX -//[FORMALLY EXPLAIN THE PROBLEM] -// -//The models fail to learn since we have no non-linearity in the model. We need some behaviour -//that allows the model to learn that the further away from 0 the difference is, -//the more likely it is that the children nodes don't sum to the root node. -//Therefore, we want a function that maps 0 to 1 and values further away from 0 to 0. -//For this, we simply chose the gaussian function \(e^{-x^2}\) which we -//appended to the readout of the model. With this we get perfect results with an -//accuracy of 100\%. On the downside, this model is very likely to not converge to an optimal solution: SHOW PLOT. -// -//We then used the function \(1-x^{2}\) which we composed with the readout. This was also able to get perfect accuracy but it was much more stable and only in rare cases it would result in a suboptimal solution of >95\%. -//accuracy of 100\%. -// -//Even still we notice that our model is really unstable, most of the time, it never converges on a solution. Since it only has two parameters, we can actually plot the loss function in 3D and analytically investigate the issue. As can be seen on figure \ref{fig:loss_surface}, there are two local minima, in the quadrant where the model parameters are both negative and positive. -// -//\begin{figure}[H] -// \centering -// \input{assets/loss_3d.pgf} -// \caption{The loss surface of the three-node graph classification problem.} -// \label{fig:loss_surface} -//\end{figure} +We opted to use the PyTorch implementation of SAGEConv @hamiltonYL17, which is similar to the GCN model but with an additional weight parameter for the self-loop. + +Now if we were to train the model it would still not converge on a solution. This is because our hidden representation $x_1^c$ needs to somehow be mapped to a real number $y in [0;1]$ and our mapping needs to be continuous, and reflect that smaller $|x_1^c|$ is closer to 1 and larger $|x_1^c|$ is closer to 0. The natural choice here is to use a gaussian mapping $g(x) = e^(-x^2)$. But we find that sometimes we converge on a perfect solution $w_1=-w_2$ but not always. In fact it seems that the model is unstable, because of local minimals in the loss landscape, see @threenodesclassgaussianlosslandscape. [TODO: WHY ARE THERE LOCAL MINIMAS?]. + +To tackle this instability, we explored a simpler alternative mapping: $1 - x^2$. Although this mapping does not guarantee that $x_1^c$ is mapped to the interval $[0,1]$, instead providing a mapping to $]-infinity, 1]$, it is suitable for our classification task, since we interpret the model output, by assigning it to the closest class numerically. In our case since we are using classes 0 and 1, This is equivalent to thresholding at the interval $[0,1]$, specifically at $0.5$. Although this is an entirely arbitrary choice on our end. This simpler mapping generates a concave loss function, leading to a substantial improvement in the model's convergence rate, with a success rate approaching 100%. + +#figure( + image("assets/three_neighbour_classifier_gaussian_activation.svg", width: 75%), + caption: [Loss surface of SAGE model on Three-Nodes-classification data with gaussian activation] +) + +#figure( + image("assets/three_neighbour_classifier_simple_activation.svg", width: 75%), + caption: [Loss surface of SAGE model on Three-Nodes-classification data with simpler activation] +) + +We note though that we have made an analytical solution to the embedding-decoding problem. It would be more proper to train the model end to end, hence we should actually employ a trainable readout step. An MLP with just 1 hidden layer with a ReLU activation and a hidden dimension of 2 should in fact be enough to approximate our mapping. This is because want to approximate the function in @desired_readout which consists of just two gradients, selectively chosen at two intervals. Perfect for the two hidden parameters and ReLu activation for both in an MLP with one hidden layer of dim 2. Not that the function can be "stretched" and "squeezed" about the x-axis since we are simply interested in the intersection at $x=plus.minus 0.5$ + + + +#let data = ( + (-1.5, -2), (0, 1), (1.5, -2) +) + +#let x_axis = axis(min: -2, max: 2, step: 0.5, helper_lines: true, location: "bottom") +#let y_axis = axis(min: -2, max: 2, step: 0.5, helper_lines: true, location: "left") +#let pl = plot(data: data, axes: (x_axis, y_axis)) + +#graph_plot(pl, (50%, 25%), caption: "Desired readout function") + +#figure( + image("assets/output_layer_mlp2-2-bcelosswlogits.22073192-edc3-46fa-8948-84fb3d1fdb0d.svg", width: 50%), + caption: [The readout function trained numerically (and mlp with 2 hidden dims and 1 hidden layer, with ReLU activation).] +) + + +We train the model, but we do not converge on a correct solution. To make sure that our analytical reasoning is not at fault, we manually inject known good values for the parameters in the GNN and train just the MLP parameters. Although the training is still very unstable, we do sometimes converge on a correct solution. Hence the problem here is in our training setup. We are in fact using the CrossEntropyLoss [This is a problem but I don't know why]. We switched to BCELoss with logits (so that we may avoid compressing our MLP range to $[0,1]$ arbitrarily). This nets out good results, the accuracy distribution of our trained models can be seen on @accuracy_results_mlp2-2-bcelosslogit. + +When the GNN is converging on the correct solution (100% accuracy) the MLP is also converging roughly on the expected solution as can be seen on @actualtrainedreadoutout + + +// TODO: placeholder plot 1 MENTION BANDWIDTH +#figure( + image("assets/kde-mlp2-2-bcelosswlogits.svg", width: 50%), + caption: [Estimate for distribution of solution accuracies when training our GNN on the three nodes classification problem.] +) + + +=== Discussion + +The experiments conducted with the three-node GCN provide a fertile ground for understanding the limitations, bottlenecks, and issues associated with the simplistic implementation of Graph Convolutional Networks (GCNs). Below, we summarize the key challenges encountered in our exploration. + +==== Lack of Relational Reasoning because of shared weights +In the classification problem, the model failed to learn the objective function. The task of determining whether $x_1 = x_2 + x_3$ required the ability to compare $x_1$ against $x_2 + x_3$, which the GCN’s shared weights and uniform aggregation mechanisms could not facilitate. This highlights a fundamental limitation: GCNs, as implemented, cannot explicitly represent or process inter-relational reasoning between node features. + +==== Dataset and Class Imbalance +The slight imbalance in the class distribution (stemming from the probabilistic nature of $x_1 = x_2 + x_3$) posed an additional challenge in the evaluation of our experiments. We needed to know the exact distribution of our classes, so that when the model’s accuracy matched the class distribution baseline, we could conclusively say that it failed to learn the target task beyond random guessing. + +==== Loss Landscape Challenges +Training stability issues. The loss landscape included local minima, which hindered convergence. Switching to a simpler mapping (e.g., $1 - x^2$) improved stability but introduced arbitrary thresholding decisions and limited us in for example the choice of loss function (we couldnt use BCE even though it is the obvious best choice). Adding more parameters or activations in general complicates the numerical optimization methods stability. + +==== The Need for a Proper and Expressive Readout +The classification problem highlighted the necessity for a trainable readout mechanism to map hidden representations to target outputs. While introducing an MLP improved the mapping, its training remained unstable, especially under CrossEntropyLoss. Switching to BCELoss with logits yielded better results, but the instability persisted, indicating a need for further refinement in the training setup. + + +=== Limitations + +==== Contrived Problem Simplicity +The regression task was straightforward, and the GCN successfully learned the optimal parameters. However, this success stemmed from the fact that the task mirrored the inherent aggregation mechanism of the GCN. While useful as a pedagogical exercise, the oversimplification did not stress-test the model’s capabilities in complex scenarios. + + +== Five-nodes regression (under-reaching experiment) +This experiment is similar to the three-node experiment, but we now have five nodes instead of three. The difference being that the problem radius is now 2, hence we will need 2 hidden layers to propogate the information from the leaves to the root. +#pad( + figure( + diagram({ + let nodes = ($x_5$, $x_4$, $x_1$, $x_2$, $x_3$) + let edges = ( + (3, 2), + (4, 3), + (1, 2), + (0, 1), + ) + for (i, n) in nodes.enumerate() { + node((i,0), n, stroke: 0.5pt, name: str(i), shape: fletcher.shapes.circle) + } + for (from, to) in edges { + let bend = if (to, from) in edges { 10deg } else { 0deg } + // refer to nodes by label, e.g., <1> + edge(label(str(from)), label(str(to)), "-|>", bend: bend) + } + }), + caption: [The five nodes graph] + ) +) + +At one hidden layer we do not converge on a solution because of under-reaching, but at two hidden layers, and using our previous method, we get convergence on a correct solution. [TODO: Write more on this if we end up using it]. + + +#pad( + figure( + diagram({ + let nodes = ($x_5^0$, $x_4^0$, $x_1^0$, $x_2^0$, $x_3^0$, $x_5^1$, $x_4^1$, $x_1^1$, $x_2^1$, $x_3^1$, $x_5^2$, $x_4^2$, $x_1^2$, $x_2^2$, $x_3^2$) + let edges = ( + // (3, 2), + // (4, 3), + // (1, 2), + // (0, 1), + // (3+5, 2+5), + // (4+5, 3+5), + // (1+5, 2+5), + // (0+5, 1+5), + // (3+10, 2+10), + // (4+10, 3+10), + // (1+10, 2+10), + // (0+10, 1+10), + (3, 3+5), + (4, 4+5), + (1, 1+5), + (0, 0+5), + (2, 2+5), + (3+5, 3+5+5), + (4+5, 4+5+5), + (1+5, 1+5+5), + (0+5, 0+5+5), + (2+5, 2+5+5), + (3, 2+5), + (4, 3+5), + (1, 2+5), + (0, 1+5), + (3, 2+5+5), + (4, 3+5+5), + (1, 2+5+5), + (0, 1+5+5), + (3+5, 2+5+5), + (4+5, 3+5+5), + (1+5, 2+5+5), + (0+5, 1+5+5), + ) + for (i, n) in nodes.enumerate() { + let pos = calc.rem(i, 5) + let ypos = calc.floor(i/5) + node((pos,ypos), n, stroke: 0.5pt, name: str(i), shape: fletcher.shapes.circle) + } + for (from, to) in edges { + let bend = if (to, from) in ((0,16),(4,12)) { 10deg } else { 0deg } + // refer to nodes by label, e.g., <1> + edge(label(str(from)), label(str(to)), "-|>", bend: bend) + } + }), + caption: [The five nodes graph 2 GNN layers] + ) +) + + + += "Tree neighbors-match" // TODO: IS THIS NECCESARY TO KEEP // Our tree neighbours using the approximation for birthday problem has almost 0 likelihood for collisions between test and train // For each given depth \(d\), we have \(2^{d}! \cdot 2^{d}\) (\(2^{d}!\) permutations of the bottom layer, \(2^{d}\) possible root labels) possible trees / samples. We notice this means that for \(d=2\) and \(d=3\), we only get \(96\) and \(322560\) unique trees respectively. @@ -226,13 +374,6 @@ This is not much better than the GCN model. [THIS IS DEPENDENT ON EPSILON; EXPL - -== Five-nodes -TODO: blah blah we say how we generalised it to more nodes and it still worked - -= "Tree neighbors-match" - - == Introduction // TODO: mention that they also came up with the concept of oversquashign and that this should be a naive trivial solution that works In the paper by Alon et al. (2021) @alon_bottleneck_2021, the authors investigate the impact of modifying the last layer of Graph Convolutional Networks (GCNs) to be fully adjacent, meaning that it connects all nodes in the graph such that any node can send a message to any other node with just one intermediary node. This modification is posited to enhance the model's ability to capture global information from the graph, thereby improving its performance on various tasks such as node classification and link prediction. The authors provide empirical evidence demonstrating that this architectural change consistently leads to better results across different datasets and benchmarks. As part of this study, they construct a graph in which they claim to demonstrate the issue of over squashing. In this section, we aim to replicate their findings to validate their practical demonstration of over-squashing and the effectiveness of the fully adjacent last layer in GCNs. @@ -243,16 +384,17 @@ We let $A$ represent the adjacency matrix for a given tree in the dataset. Let $ A_(i,j)= cases( 1 "if" i=j, - 1 "if" floor(i/2)=j, 0 "else") + 1 "if" floor(i/2-1)=j, 0 "else") $ + The reason for this is as follows: -For a given node $n$, its parent node is determined by the position of $n$ on the layer. This position is calculated as the index of the node minus the sum of the number of nodes in all previous layers, or $n - sum _(i in [0 ... log_2(n)-1])$. To find the corresponding node in the previous layer, which has half as many nodes, we divide this position by 2 and round down (hence the addition of 1/2 and use of the floor function +For a given node $n$, its parent node is determined by the position of $n$ on the layer. This position is calculated as the index of the node minus the sum of the number of nodes in all previous layers, or $n - sum _(i in [0 ... log_2(n)-1]) 2^i$. To find the corresponding node in the previous layer, which has half as many nodes, we divide this position by 2 and round down (hence the addition of 1/2 and use of the floor function // TODO: fix this ugly sum notation $floor(frac(n - sum _(i in [0 ... log_2(n)-1]) 2^i + 1, 2))$). To find the index of this new position, we must add the number of nodes that came before this layer, or $sum _(i in [0 ... log_2(n)-2]) 2^i$. When we write out and simplify the expression, we get: -$ "parent"(n) &= sum _(i in [0 ... log_2(n)-2]) 2^i + floor(frac(n - sum _(i in [0 ... log_2(n)-1]) 2^i + 1, 2)) \ +$ "parent"(n) &= sum_(i in [0 ... log_2(n)-2]) 2^i + floor(frac(n - sum _(i in [0 ... log_2(n)-1]) 2^i + 1, 2)) \ &= floor(frac(sum _(i in [1 ... log_2(n)-1]) 2^i + n - sum _(i in [0 ... log_2(n)-1]) 2^i + 1, 2)) \ @@ -272,7 +414,7 @@ Although in @alon_bottleneck_2021, the authors present their findings on the tre This has led us to implement the fully adjacent last layer and compare the model with and without Last-FA. === Model architecture // TODO: redo this experiment??? -The two node features are embedded in a 32 dimensional space using a linear layer with trainable weights without a bias parameter. We used ReLU as our activation function and mean as our graph convolution aggregator. The models have $d+1$ layers, where $d$ is the depth of the trees in our given dataset. +The two node features are embedded in a 32 dimensional space using a linear layer with trainable weights without a bias parameter. We used `ReLU` as our activation function and mean as our graph convolution aggregator. The models have $d+1$ layers, where $d$ is the depth of the trees in our given dataset. We use normalisation as implemented in PyTorch Geometric. We used Adam and a reduce LR on plateau scheduler with [TODO: PARAMS?]. The last fully adjacent layer connects every single node to the root, we can omit the remaining pairwise connections since the resulting messages don't get propagated to the root before we finish the message passing. Let $r$ be the root node index, we then have: $E_("FA") subset.eq {(i, r) bar i in V}$. @@ -313,6 +455,10 @@ In the #link("https://github.com/tech-srl/bottleneck/")[code that the original a == Introduction // TODO: write +In message-passing, nodes send fixed-size messages to their neighbours, said neighbours aggregate the messages, update their features, then send out new messages and so on. +This process inevitably leads to information loss, as an increasingly large amount of information is compressed into fixed-size vectors. This is known as over-squashing @alon_bottleneck_2021. + + // TODO: find name for this placeholder == placeholder (Fully adjacent mean injection layer) We suspect that over-squashing is caused by the fact that long range interactions are being @@ -326,20 +472,20 @@ We do this by building on top of the SAGE framework @hamiltonYL17 and we simply //TODO source???? $ x^(l+1)_i &= w^l_1 + w^l_2 dot "AGG"({x^l_j: j in cal(N)(i)})+f^l (x^l_i, x^l_G)+b^l\ -x^(l+1)_G &= "AGG"({x^l_j: j in cal(V)}) +x^(l+1)_G &= "AGG"({x^l_j: j in V^l}) $ -In this setup, we have two trainable weights, one for the local aggregation, and one for the self loop. Furthermore we have a trainable bias. Lastly there is $f^l$, this is some trainable differentiable function, in our case we let it be a two-layer MLP with a hidden dimension of 512 and $tanh$ activation. The method of aggregation used in our experiments was mean aggregation, but any aggregator could be used. +In this setup, we have two trainable weights, one for the local aggregation, and one for the self loop. Furthermore we have a trainable bias. Lastly there is $f^l$, this is some trainable differentiable function, in our case we let it be a two-layer MLP with a hidden dimension of 512 and `tanh` activation. The method of aggregation used in our experiments was mean aggregation, but any aggregator could be used. The remaining normal GCN layers were set to have 128 hidden dimensions. The global node is equivalent to performing a global aggregation of the graph. We can then compute a weighting for information from the global node to the local nodes which is then added to each of the local nodes. === Experiment -To make our experiment as reproducible as possible, we chose to use Bayesian hyperparameter optimisation (BO), first randomly initialising with 5 points in our hyperparameter space, and then we searched using BO for an additional 15 new sets of parameters. Each parameter was then trained on once, and then the validation loss was used to select the best set. The validation loss was used as our objective, and we added a tiny penalty of $4 dot 10^(-7) dot "epochs"$, the motivation behind this additional penalty term was merely to reduce training times since very similar performance but significantly fewer epochs would be a preferred solution in this optimisation problem. +To make our experiment as reproducible as possible, we chose to use Bayesian hyperparameter optimisation (BO), first randomly initialising with 5 points in our hyperparameter space, and then we searched using BO for an additional 15 new sets of parameters. Each parameter was then trained on once, and then the validation loss was used to select the best set. The validation loss was used as our objective, and we added a tiny penalty of $4 dot 10^(-7) dot #`epochs`)$, the motivation behind this additional penalty term was merely to reduce training times since very similar performance but significantly fewer epochs would be a preferred solution in this optimisation problem. The parameters that were searched across were: -$log_10 ("epochs") in [2, 3.5[$, $log_10 ("learning_rate") in [-4, -1[$ and $log_10 (C) in [-4, -2 [$. -$C$ was used as the weight_decay argument in the PyTorch implementation of Adam, this is equivalent to applying L2 regularisation. +$log_10 (#`epochs`) in [2, 3.5[$, $log_10 (#`learning_rate`) in [-4, -1[$ and $log_10 (#`C`) in [-4, -2 [$. +`C` was used as the `weight_decay` argument in the PyTorch implementation of Adam, this is equivalent to applying L2 regularisation. We performed this search across hidden layer depths 1 through 3 and with our TODO: placeholder method on the last layer and not on the last layer. @@ -348,9 +494,11 @@ Each of these configurations was then trained on and evaluated 15 or more times. === Results -We compared the two methods with 1 to 3 hidden layers, 1 hidden layer performed better across the board and we thus only plot that for the distributions, and the QQ plot contains all three layer depths. +We compared the two methods with 1 to 3 hidden layers, 1 hidden layer performed better across the board and we thus only plot that for the distributions, and the Q-Q plot contains all three layer depths. The observations were truncated between the 25 to 100 quantile to remove subpar runs. + +// TODO: MENTION BANDWIDTH #figure(image("assets/combined_plot.svg", width: 90%), caption: [Comparison between the TODO: placeholder and base models.] ) @@ -385,6 +533,7 @@ Bad nodes for the edge $i~j$ are defined as ${i, j} union \#_triangle (i,j)$. We start off by finding the neighbours of $i$, then remove the bad nodes. These will be our first set of candidate nodes. For each of these candidate nodes, we then found their neighbours, removed the bad nodes again, and then added the nodes that yielded a path to $j$ (there exists an edge from this neighbouring candidate and $j$). The $gamma_max$, was found by computing the 4-cycles on $i~j$ and counting how many times a neighbouring node was included in the cycle. This was then also done for $j~i$. +// TODO: maybe the notation above can be explained by citation instead of re-explaining what the SDRF paper said We chose to not explore this implementation further due to computational constraints and moved on to trying to replicate the results using the author's optimised implementation. @@ -401,7 +550,7 @@ The biggest bottleneck in the SDRF algorithm is computing the Ricci or Balanced $ "Ric"(i, j) &:= 2/d_i + 2/d_j - 2 + 2 (|\#_triangle (i, j)|)/(max{d_i, d_j}) + 2 (|\#_triangle (i, j)|)/(min{d_i, d_j})\ -&+ 2 ((gamma_max)^(-1))/(max{d_i, d_j})(|\#_square^i (i, j)|+|\#_square^j (i, j)|) +&+ 2 ((gamma_max (i, j))^(-1))/(max{d_i, d_j})(|\#_square (i, j)|+|\#_square (j, i)|) $ Let $B_r(i)$ be the set of nodes reachable from $i$ within $r$ hops. We need to recompute the BFC for all edges in $B_2(i) union B_2(j)$ every time we add or remove an edge $i ~ j$. @@ -422,10 +571,36 @@ We chose to not investigate this algorithm further due to the aforementioned iss == TOGL (TOpological Graph Layer) // Performance wise we notice that the Persistance diagram computation runs the same, when single threaded, on both a low end cpu with just 3MB og cache and a midrange cpu core with 16MB cache. -//This confirms our suspicion that the random ordering of nodes in the filtration process, is maximally not spacially local. This is because of the uniform sampling of the nodes across the datastructure which stores the graph. -//Although the filtration could in principle be computed with high effeciency, in practice, the computation is done from the very slow system RAM and only sequencially, since the algorithm requires information from the previous step to compute the next step. This makes it inherently disadvantagous to algorithm with high spacial locality and algorithms relying on vectorizable operations. +// This confirms our suspicion that the random ordering of nodes in the filtration process, is maximally not spacially local. This is because of the uniform sampling of the nodes across the datastructure which stores the graph. +// Although the filtration could in principle be computed with high effeciency, in practice, the computation is done from the very slow system RAM and only sequencially, since the algorithm requires information from the previous step to compute the next step. This makes it inherently disadvantagous to algorithm with high spacial locality and algorithms relying on vectorizable operations. +=== Introduction +Topological Deep Learning (TDL) is gaining traction as a novel approach @papamarkou_position:_2024 for Graph Representation Learning (GRL). + +Leveraging topology, TDL has shown promising results in alleviating various limitations of Graph Neural Networks (GNNs) @horn_topological_2022 + +In this section, we will utilise the TOGL framework as delineated in the paper @horn_topological_2022 as a foundational reference for our topological embedding experiments. Due to the unavailability of runnable code resulting from a missing dependency, our initial objective will be to reimplement the methodology presented in the original paper, thereby contributing to the discourse on the reproducibility of machine learning research. + +Subsequently, we will select a dataset characterized by demonstrably useful topological information to investigate the application of the TOGL layer. Our aim is to assess the impact of the TOGL framework on non-standard benchmarks, specifically focusing on topologically rich datasets. We hypothesize that the TOGL layer may leverage the inherent topological structures within these datasets, potentially yielding more significant results than those reported in the original publication. This exploration will not only enhance our understanding of the TOGL framework but also contribute to the broader conversation surrounding the integration of topological methods in machine learning. + +=== TOGL Layer Replication + +==== The filtration +The filtration process is described quite well in the original paper, in fact, we can with only small modifications replicate the exact filtration process. We simply use a feed-forward neural network with an input layer of `n_features`, a hidden layer of 32 ReLU-activated neurons, and an output layer of `n_filtrations`. We use batching to compute all n filtration in parallel for all nodes in the graph. We then sort each filtration seperatly, and use the indexes of the sorted nodes to represent the filtration order. The implementation is quite straightforward and efficient. + +==== Dim-0 Persistence Diagram + +In the original paper, a comprehensive theoretical description of the procedure is provided. However, the implementation details are somewhat lacking. In our implementation, we aim to optimise the overall asymptotic runtime. Nevertheless, the filtration process is conducted sequentially for each node, as parallel execution would necessitate significantly more memory and the recalculation of connected components at each node. + +In contrast, our implementation employs a modified union-find algorithm in which the rank of a node corresponds to its position in the filtration. This approach allows us to maintain a parent node for each connected component, utilizing path compression to enhance efficiency. + +==== Dim-1 Persistance Diagram + +The original paper provides no description, either theoretical or implementation-related, regarding the dim-0 persistance diagram calculation. We assume that the authors are utilizing the same implementation employed in the original paper on barcodes; however, this is not explicitly stated. In our implementation, we adopt the original implementation from the barcode paper. + +==== The embedding layer +The original paper provides limited details regarding the embedding. We have chosen to utilise the DeepSets approach, as it appears to be the primary method referenced in the original work. While the authors indicate that they employ a method based on DeepSets, they do not specify which particular method is used. Consequently, we have opted to implement the original DeepSet layer as described in [Insert Citation]. // TODO: sub-conclusion of this section but not the whole paper == Discussion (takeaways from experiments) @@ -546,47 +721,11 @@ learning replicability: the method used to train the model is not reproducible. data replicability: data is not accesible or not possible to reproduce. methodology replicability == Reproducibility -REMOVE? +TODO: REMOVE? -== NOTES (REMOVE) +== NOTES (TODO: REMOVE) TODO: NOTE ABOUT MEAN INJECTION LAYER: Might be better to do multiple validation trains per searched param to get an average loss before moving on to the next one === bayesian search thoughts Multiple runs within the objective function?: yields an average which goes against our idea of allowing param configurations that might be unstable in training but yield good model performance Multiple bayesian searches: Maybe not neccesary? Cross validation???: Yes no? - -// They done done it Don, then they did damn did it didnt they? - - - -// TODO LIST -// 1. Epsilon explanation on three nodes - Mustafa - -// 2a. Tabular data report for three_nodes_classification - Mustafa -// See summary.jl in three_nodes_regression -// See train.py, rewrite similar `save_runs` function so we get CSV and models saved -// 2b. Reexport loss surface plots (mby plots.jl if we have time) as svg - -// 3. Do the same for five-nodes - Mustafa - -// Tree neighbors: The awful one -// 4a. Redo tree-neighbors-match with bayesian optimisation and the same setup as in -// graph_benchmark folder (main.py & summary.jl) -// optimise on log_epochs, log_learning_rate, weight_decay -// sweep through all combinations of depth (2-5), with/without Last-FA -// - Mustafa - -// 4b. Visualise the data - Josh -// 4c. Repeat 4.a with MLP aggregation -// 4d. Josh visualises stuff (again) - Josh -// 4e. Write discussion for tree neighbours - Josh/Mustafa - -// 5. TOGL - Mustafa - -// 6. what is todo:placeholder: fa to be called - Josh/Mustafa - -// 7. Write discussion for all experiments - -// 8. ML Framework - To be assigned - -// 9. Choice of template (line spacing, numbering etc)