diff --git a/README.md b/README.md index 9166f18..b5a2988 100755 --- a/README.md +++ b/README.md @@ -70,7 +70,6 @@ pip install -e . Just in case you are not familiar, here is the formulation of the scaling law estimation: -
Variables @@ -85,14 +84,12 @@ Just in case you are not familiar, here is the formulation of the scaling law es **Intuition**: - $E$ corresponds to the **irreducible loss** that can only be atained with an ideal model with infinite compute - $A / N ^ \alpha$ accconts for the additional loss coming from insufficiency of model size; - - $ B / D ^ \beta$, insufficiency of data amount. + - $B / D ^ \beta$, insufficiency of data amount.
- - Objective 1. Optimize the parameters $\{E, A, B, \alpha, \beta\}$ to better predict losses $L_i$ from $(N_i, D_i)$ @@ -105,7 +102,7 @@ Just in case you are not familiar, here is the formulation of the scaling law es ### 1. Fitting the scaling law on existing dataset > [!NOTE] -> An example of this usage can be found [here](examples/llm/) +> An example of this usage can be found [here](https://github.com/kyo-takano/chinchilla/blob/master/examples/llm/main.ipynb) First, prepare a CSV looking like this and save it as `df.csv`: @@ -157,7 +154,7 @@ You can get an estimatedly compute-optimal allocation of compute to $N$ and $D$. ### 2. Scaling from scratch > [!NOTE] -> An example of this usage can be found [here](examples/llm) +> An example of this usage can be found [here](https://github.com/kyo-takano/chinchilla/blob/master/examples/efficientcube.ipynb) > **Procedure**: > diff --git a/docs/TIPS.md b/docs/TIPS.md index d2ad703..1394ce8 100644 --- a/docs/TIPS.md +++ b/docs/TIPS.md @@ -65,28 +65,9 @@ The minima are smoother and more stable, allowing for easier convergence during As a matter of fact, this technique is so effective that even a naive grid search can work almost as good as L-BFGS: -
-
- Original Algorithm -
- ➡️ -
- Improved Algorithm -
-
- -## 2. Keep `scaling_factor` moderate +![Algorithms' performance by initialization quality](imgs/algorithm.comparison.png) -Scaling compute according to the loss predictor involves ***extrapolation*** beyond the FLOPs regime used for fitting the predictor. -To avoid overstepping, it's advisable to: - -- **Incrementally scale compute** rather than making large jumps. -- ***Continuously update*** the scaling law as a new data point becomes available. - -As a rule of thumb, I would suggest using`scaling_factor=2.0` as a good starting point. -This approach balances the compute budget by dedicating roughly half of it to scaling law estimation and the other half to final model training. - -## 3. Beware of "failure modes" +## 2. Beware of "failure modes" When fitting the loss predictor, several common failure modes may arise. These are often tied to poor configurations, including; @@ -98,6 +79,17 @@ When fitting the loss predictor, several common failure modes may arise. These a ![Underfitting failure](imgs/optim--underfit.jpg) +## 3. Keep `scaling_factor` moderate + +Scaling compute according to the loss predictor involves ***extrapolation*** beyond the FLOPs regime used for fitting the predictor. +To avoid overstepping, it's advisable to: + +- **Incrementally scale compute** rather than making large jumps. +- ***Continuously update*** the scaling law as a new data point becomes available. + +As a rule of thumb, I would suggest using`scaling_factor=2.0` as a good starting point. +This approach balances the compute budget by dedicating roughly half of it to scaling law estimation and the other half to final model training. + --- > [!NOTE] diff --git a/docs/imgs/algorithm.comparison.png b/docs/imgs/algorithm.comparison.png new file mode 100644 index 0000000..7fa466b Binary files /dev/null and b/docs/imgs/algorithm.comparison.png differ diff --git a/docs/imgs/algorithm.init-improved.png b/docs/imgs/algorithm.init-improved.png deleted file mode 100644 index 160ce51..0000000 Binary files a/docs/imgs/algorithm.init-improved.png and /dev/null differ diff --git a/docs/imgs/algorithm.init-original.png b/docs/imgs/algorithm.init-original.png deleted file mode 100644 index 002a9d2..0000000 Binary files a/docs/imgs/algorithm.init-original.png and /dev/null differ diff --git a/examples/llm/main.ipynb b/examples/llm/main.ipynb index b7fe6e4..7022026 100644 --- a/examples/llm/main.ipynb +++ b/examples/llm/main.ipynb @@ -6,6 +6,9 @@ "id": "XT3xW5kr3dT2" }, "source": [ + "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kyo-takano/chinchilla/blob/master/examples/llm/main.ipynb)\n", + "[![GitHub Repository](https://img.shields.io/badge/-chinchilla-2dba4e?logo=github)](https://github.com/kyo-takano/chinchilla)\n", + "\n", "# Allocating $10^{24}$ FLOPs to a single LLM\n", "\n", "This notebook guides you through **estimating the scaling law for LLMs** (with `vocab_size=32000`) using a subset of Chinchilla training runs (filter: $10^{18} < C \\wedge N < D$).\n", @@ -18,6 +21,17 @@ "- How the \"20 tokens per parameter\" heuristic compares" ] }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\"\"\"Uncomment these lines if not cloning\"\"\"\n", + "# %pip install -U chinchilla\n", + "# !wget -nc https://github.com/kyo-takano/chinchilla/raw/refs/heads/preview/examples/llm/df.csv" + ] + }, { "cell_type": "code", "execution_count": 1, diff --git a/examples/llm/simulation--optim.png b/examples/llm/simulation--optim.png deleted file mode 100644 index 765f598..0000000 Binary files a/examples/llm/simulation--optim.png and /dev/null differ diff --git a/examples/llm/simulation--parametric_fit.png b/examples/llm/simulation--parametric_fit.png deleted file mode 100644 index aed5887..0000000 Binary files a/examples/llm/simulation--parametric_fit.png and /dev/null differ