[minor] Fix minor presentation & utility issues

kyo-takano · Jan 27, 2025 · cf858ea · cf858ea
1 parent e1847b3
commit cf858ea
Show file tree

Hide file tree

Showing 8 changed files with 30 additions and 27 deletions.
diff --git a/README.md b/README.md
@@ -70,7 +70,6 @@ pip install -e .
 
 Just in case you are not familiar, here is the formulation of the scaling law estimation:
 
-<!-- ### Definitions -->
 <details>
 
 <summary style="font-weight: bold;">Variables</summary>
@@ -85,14 +84,12 @@ Just in case you are not familiar, here is the formulation of the scaling law es
   **Intuition**:
   - $E$ corresponds to the **irreducible loss** that can only be atained with an ideal model with infinite compute
   - $A / N ^ \alpha$ accconts for the additional loss coming from insufficiency of model size;
-  - $ B / D ^ \beta$, insufficiency of data amount.
+  - $B / D ^ \beta$, insufficiency of data amount.
 
 </details>
 
 <details>
 
-<!-- ### Compute-Optimal Allocation -->
-
 <summary style="font-weight: bold;">Objective</summary>
 
 1. Optimize the parameters $\{E, A, B, \alpha, \beta\}$ to better predict losses $L_i$ from $(N_i, D_i)$
@@ -105,7 +102,7 @@ Just in case you are not familiar, here is the formulation of the scaling law es
 ### 1. Fitting the scaling law on existing dataset
 
 > [!NOTE]
-> An example of this usage can be found [here](examples/llm/)
+> An example of this usage can be found [here](https://github.com/kyo-takano/chinchilla/blob/master/examples/llm/main.ipynb)
 
 First, prepare a CSV looking like this and save it as `df.csv`:
 
@@ -157,7 +154,7 @@ You can get an estimatedly compute-optimal allocation of compute to $N$ and $D$.
 ### 2. Scaling from scratch
 
 > [!NOTE]
-> An example of this usage can be found [here](examples/llm)
+> An example of this usage can be found [here](https://github.com/kyo-takano/chinchilla/blob/master/examples/efficientcube.ipynb)
 
 > **Procedure**:
 >

diff --git a/docs/TIPS.md b/docs/TIPS.md
@@ -65,28 +65,9 @@ The minima are smoother and more stable, allowing for easier convergence during
 
 As a matter of fact, this technique is so effective that even a naive grid search can work almost as good as L-BFGS:
 
-<div style="display: flex; justify-content: center; gap: 1.5rem; align-items: center; font-size: 1.5rem;">
-    <div>
-        <img src="./imgs/algorithm.init-original.png" alt="Original Algorithm">
-    </div>
-    ➡️
-    <div>
-      <img src="./imgs/algorithm.init-improved.png" alt="Improved Algorithm">
-    </div>
-</div>
-
-## 2. Keep `scaling_factor` moderate
+![Algorithms' performance by initialization quality](imgs/algorithm.comparison.png)
 
-Scaling compute according to the loss predictor involves ***extrapolation*** beyond the FLOPs regime used for fitting the predictor.
-To avoid overstepping, it's advisable to:
-
-- **Incrementally scale compute** rather than making large jumps.
-- ***Continuously update*** the scaling law as a new data point becomes available.
-
-As a rule of thumb, I would suggest using`scaling_factor=2.0` as a good starting point.
-This approach balances the compute budget by dedicating roughly half of it to scaling law estimation and the other half to final model training.
-
-## 3. Beware of "failure modes"
+## 2. Beware of "failure modes"
 
 When fitting the loss predictor, several common failure modes may arise. These are often tied to poor configurations, including;
 
@@ -98,6 +79,17 @@ When fitting the loss predictor, several common failure modes may arise. These a
 
   ![Underfitting failure](imgs/optim--underfit.jpg)
 
+## 3. Keep `scaling_factor` moderate
+
+Scaling compute according to the loss predictor involves ***extrapolation*** beyond the FLOPs regime used for fitting the predictor.
+To avoid overstepping, it's advisable to:
+
+- **Incrementally scale compute** rather than making large jumps.
+- ***Continuously update*** the scaling law as a new data point becomes available.
+
+As a rule of thumb, I would suggest using`scaling_factor=2.0` as a good starting point.
+This approach balances the compute budget by dedicating roughly half of it to scaling law estimation and the other half to final model training.
+
 ---
 
 > [!NOTE]

diff --git a/docs/imgs/algorithm.comparison.png b/docs/imgs/algorithm.comparison.png
diff --git a/docs/imgs/algorithm.init-improved.png b/docs/imgs/algorithm.init-improved.png
diff --git a/docs/imgs/algorithm.init-original.png b/docs/imgs/algorithm.init-original.png
diff --git a/examples/llm/main.ipynb b/examples/llm/main.ipynb
@@ -6,6 +6,9 @@
     "id": "XT3xW5kr3dT2"
    },
    "source": [
+    "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kyo-takano/chinchilla/blob/master/examples/llm/main.ipynb)\n",
+    "[![GitHub Repository](https://img.shields.io/badge/-chinchilla-2dba4e?logo=github)](https://github.com/kyo-takano/chinchilla)\n",
+    "\n",
     "# Allocating $10^{24}$ FLOPs to a single LLM\n",
     "\n",
     "This notebook guides you through **estimating the scaling law for LLMs** (with `vocab_size=32000`) using a subset of Chinchilla training runs (filter: $10^{18} < C \\wedge N < D$).\n",
@@ -18,6 +21,17 @@
     "- How the \"20 tokens per parameter\" heuristic compares"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\"\"\"Uncomment these lines if not cloning\"\"\"\n",
+    "# %pip install -U chinchilla\n",
+    "# !wget -nc https://github.com/kyo-takano/chinchilla/raw/refs/heads/preview/examples/llm/df.csv"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 1,

diff --git a/examples/llm/simulation--optim.png b/examples/llm/simulation--optim.png
diff --git a/examples/llm/simulation--parametric_fit.png b/examples/llm/simulation--parametric_fit.png