diff --git a/README.md b/README.md
index 9166f18..b5a2988 100755
--- a/README.md
+++ b/README.md
@@ -70,7 +70,6 @@ pip install -e .
Just in case you are not familiar, here is the formulation of the scaling law estimation:
-
Variables
@@ -85,14 +84,12 @@ Just in case you are not familiar, here is the formulation of the scaling law es
**Intuition**:
- $E$ corresponds to the **irreducible loss** that can only be atained with an ideal model with infinite compute
- $A / N ^ \alpha$ accconts for the additional loss coming from insufficiency of model size;
- - $ B / D ^ \beta$, insufficiency of data amount.
+ - $B / D ^ \beta$, insufficiency of data amount.
-
-
Objective
1. Optimize the parameters $\{E, A, B, \alpha, \beta\}$ to better predict losses $L_i$ from $(N_i, D_i)$
@@ -105,7 +102,7 @@ Just in case you are not familiar, here is the formulation of the scaling law es
### 1. Fitting the scaling law on existing dataset
> [!NOTE]
-> An example of this usage can be found [here](examples/llm/)
+> An example of this usage can be found [here](https://github.com/kyo-takano/chinchilla/blob/master/examples/llm/main.ipynb)
First, prepare a CSV looking like this and save it as `df.csv`:
@@ -157,7 +154,7 @@ You can get an estimatedly compute-optimal allocation of compute to $N$ and $D$.
### 2. Scaling from scratch
> [!NOTE]
-> An example of this usage can be found [here](examples/llm)
+> An example of this usage can be found [here](https://github.com/kyo-takano/chinchilla/blob/master/examples/efficientcube.ipynb)
> **Procedure**:
>
diff --git a/docs/TIPS.md b/docs/TIPS.md
index d2ad703..1394ce8 100644
--- a/docs/TIPS.md
+++ b/docs/TIPS.md
@@ -65,28 +65,9 @@ The minima are smoother and more stable, allowing for easier convergence during
As a matter of fact, this technique is so effective that even a naive grid search can work almost as good as L-BFGS:
-
-
-
-
- ➡️
-
-
-
-
-
-## 2. Keep `scaling_factor` moderate
+![Algorithms' performance by initialization quality](imgs/algorithm.comparison.png)
-Scaling compute according to the loss predictor involves ***extrapolation*** beyond the FLOPs regime used for fitting the predictor.
-To avoid overstepping, it's advisable to:
-
-- **Incrementally scale compute** rather than making large jumps.
-- ***Continuously update*** the scaling law as a new data point becomes available.
-
-As a rule of thumb, I would suggest using`scaling_factor=2.0` as a good starting point.
-This approach balances the compute budget by dedicating roughly half of it to scaling law estimation and the other half to final model training.
-
-## 3. Beware of "failure modes"
+## 2. Beware of "failure modes"
When fitting the loss predictor, several common failure modes may arise. These are often tied to poor configurations, including;
@@ -98,6 +79,17 @@ When fitting the loss predictor, several common failure modes may arise. These a
![Underfitting failure](imgs/optim--underfit.jpg)
+## 3. Keep `scaling_factor` moderate
+
+Scaling compute according to the loss predictor involves ***extrapolation*** beyond the FLOPs regime used for fitting the predictor.
+To avoid overstepping, it's advisable to:
+
+- **Incrementally scale compute** rather than making large jumps.
+- ***Continuously update*** the scaling law as a new data point becomes available.
+
+As a rule of thumb, I would suggest using`scaling_factor=2.0` as a good starting point.
+This approach balances the compute budget by dedicating roughly half of it to scaling law estimation and the other half to final model training.
+
---
> [!NOTE]
diff --git a/docs/imgs/algorithm.comparison.png b/docs/imgs/algorithm.comparison.png
new file mode 100644
index 0000000..7fa466b
Binary files /dev/null and b/docs/imgs/algorithm.comparison.png differ
diff --git a/docs/imgs/algorithm.init-improved.png b/docs/imgs/algorithm.init-improved.png
deleted file mode 100644
index 160ce51..0000000
Binary files a/docs/imgs/algorithm.init-improved.png and /dev/null differ
diff --git a/docs/imgs/algorithm.init-original.png b/docs/imgs/algorithm.init-original.png
deleted file mode 100644
index 002a9d2..0000000
Binary files a/docs/imgs/algorithm.init-original.png and /dev/null differ
diff --git a/examples/llm/main.ipynb b/examples/llm/main.ipynb
index b7fe6e4..7022026 100644
--- a/examples/llm/main.ipynb
+++ b/examples/llm/main.ipynb
@@ -6,6 +6,9 @@
"id": "XT3xW5kr3dT2"
},
"source": [
+ "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kyo-takano/chinchilla/blob/master/examples/llm/main.ipynb)\n",
+ "[![GitHub Repository](https://img.shields.io/badge/-chinchilla-2dba4e?logo=github)](https://github.com/kyo-takano/chinchilla)\n",
+ "\n",
"# Allocating $10^{24}$ FLOPs to a single LLM\n",
"\n",
"This notebook guides you through **estimating the scaling law for LLMs** (with `vocab_size=32000`) using a subset of Chinchilla training runs (filter: $10^{18} < C \\wedge N < D$).\n",
@@ -18,6 +21,17 @@
"- How the \"20 tokens per parameter\" heuristic compares"
]
},
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "\"\"\"Uncomment these lines if not cloning\"\"\"\n",
+ "# %pip install -U chinchilla\n",
+ "# !wget -nc https://github.com/kyo-takano/chinchilla/raw/refs/heads/preview/examples/llm/df.csv"
+ ]
+ },
{
"cell_type": "code",
"execution_count": 1,
diff --git a/examples/llm/simulation--optim.png b/examples/llm/simulation--optim.png
deleted file mode 100644
index 765f598..0000000
Binary files a/examples/llm/simulation--optim.png and /dev/null differ
diff --git a/examples/llm/simulation--parametric_fit.png b/examples/llm/simulation--parametric_fit.png
deleted file mode 100644
index aed5887..0000000
Binary files a/examples/llm/simulation--parametric_fit.png and /dev/null differ