[v0.2.0] Merge branch 'preview'

kyo-takano · Jan 27, 2025 · 6e34bdd · 6e34bdd
2 parents 3db6ab5 + cf858ea
commit 6e34bdd
Show file tree

Hide file tree

Showing 27 changed files with 1,617 additions and 340 deletions.
diff --git a/README.md b/README.md
@@ -1,6 +1,15 @@
 # `chinchilla`
 
-`chinchilla` is a research toolkit designed to estimate scaling laws and train compute-optimal models for various deep learning tasks.
+![Parametric fit on LLM training runs](docs/imgs/parametric_fit.png)
+
+`chinchilla` is a research toolkit designed to estimate scaling laws & train compute-optimal models for various deep learning tasks.
+
+## Features
+
+- **Scaling Law Estimation**: Fit a loss predictor based on multiple training runs.
+- **Compute-Optimal Allocation**: Train the best possible model within a given compute budget.
+- **Progressive Scaling**: Iteratively update the scaling law estimation and scale up the compute.
+- **Simulation Mode**: Test scaling law estimations in hypothetical scenarios.
 
 <table>
 <tr>
@@ -11,7 +20,6 @@
   </td>
   <td>
 
-- Researching the neural scaling law itself
 - Scaling compute for
   - Large Language Models (LLM)
   - Vision Transformers (ViT)
@@ -20,76 +28,142 @@
   - Knowledge distillation
 - Evaluating compute efficiencies of new algorithms & architectures
 - Researching the neural scaling law itself
+
 </td>
 <tr>
   <td>
 
-  **Probably Not For**:
+  Probably **NOT** For...
   </td>
   <td>
 
 - Fine-tuning tasks
 - Data-scarce domains
+- etc.
 
   </td>
+
 </tr>
 </table>
 
 > [!IMPORTANT]
 > This work builds upon the scaling law formulation proposed in [the original Chinchilla paper](https://deepmind.google/discover/blog/an-empirical-analysis-of-compute-optimal-large-language-model-training/) by DeepMind (2022),
 > with some modifications detailed in [./docs/changes.md](https://github.com/kyo-takano/chinchilla/tree/master/docs/changes.md).
 
-## Features
+## Installation
 
-- **Scaling Law Estimation**: Fit a loss predictor based on multiple training runs.
-- **Compute-Optimal Allocation**: Train the best possible model within a given compute budget.
-- **Progressive Scaling**: Iteratively update the scaling law estimation and scale up the compute.
-- **Simulation Mode**: Test scaling law estimations in hypothetical scenarios.
+**From PyPI**
 
-## Basics
+```bash
+pip install -U chinchilla
+```
 
-### Definitions
+**From Source**
+
+```bash
+git clone https://github.com/kyo-takano/chinchilla.git
+cd chinchilla
+pip install -e .
+```
+
+## Prerequisite: Chinchilla formulation
+
+Just in case you are not familiar, here is the formulation of the scaling law estimation:
+
+<details>
+
+<summary style="font-weight: bold;">Variables</summary>
 
 - $N$: The number of parameters
 - $D$: The number of data samples
 - $C$: Total compute in FLOPs ($C\approx 6\ ND$)
-- $L(N,\ D) = E + A/N ^ \alpha + B / D ^ \beta$: A loss predictor parameterized by $\{E, A, B, \alpha\}$ and $\beta$
+- $L(N,\ D) = E + A / N ^ \alpha + B / D ^ \beta$: A loss predictor parameterized by $\{E, A, B, \alpha\}$ and $\beta$
+
+  ---
+
+  **Intuition**:
+  - $E$ corresponds to the **irreducible loss** that can only be atained with an ideal model with infinite compute
+  - $A / N ^ \alpha$ accconts for the additional loss coming from insufficiency of model size;
+  - $B / D ^ \beta$, insufficiency of data amount.
+
+</details>
+
+<details>
 
-### Compute-Optimal Allocation
+<summary style="font-weight: bold;">Objective</summary>
 
 1. Optimize the parameters $\{E, A, B, \alpha, \beta\}$ to better predict losses $L_i$ from $(N_i, D_i)$
 2. Solve $\underset{N,\ D}{argmin}\ L(N,\ D\ |\ C)$, which can be derived from $\{A, B, \alpha, \beta\}$
 
-### `chinchilla` Procedure
+</details>
 
-- `seed`: Sample X training runs $(N_i, D_i, L_i)$, referred to as **seeds**
-- For i = 0 to K:
-  - `fit`: Optimize the scaling law parameters to fit $L(N,\ D)$ on the training runs
-  - `scale`: Configure a new model with a **scaled** compute
-  - Evaluate the allocation by training a model
-  - `append`: Add the result to the database of training runs
+## Usage
 
-## Installation
+### 1. Fitting the scaling law on existing dataset
 
-> [!WARNING]
->
-> `chinchilla` requires Python >= 3.8
+> [!NOTE]
+> An example of this usage can be found [here](https://github.com/kyo-takano/chinchilla/blob/master/examples/llm/main.ipynb)
 
-**From Source** (Recommended for Customization)
+First, prepare a CSV looking like this and save it as `df.csv`:
 
-```bash
-git clone https://github.com/kyo-takano/chinchilla.git
-cd chinchilla
-pip install -e .
+```csv
+C,N,D,loss
+1.3972367362937152e+18,73824672,3154403320,3.405928
+1.7656304230443515e+18,89818214,3276303602,3.325255
+2.0558971596900728e+18,105811837,3238291053,3.300442
+...
 ```
 
-**From PyPI**
+Second, define a grid of initial parameters to fit like:
 
-```bash
-pip install -U chinchilla
+```python
+import numpy as np
+from chinchilla import Chinchilla
+cc = Chinchilla(
+    "./",  # Assuming `df.csv` is under ./
+    param_grid=dict(
+        E=np.linspace(1, 2, 5),
+        a=np.linspace(1, 10, 5),    # a: log(A)
+        b=np.linspace(1, 10, 5),    # b: log(B)
+        alpha=np.linspace(0.1, 0.7, 5),
+        beta=np.linspace(0.1, 0.7, 5),
+    ),
+)
 ```
 
-## Usage
+Finally, call `cc.fit()` & you'll get the parameters fit on your dataset, which you can easily access as `cc.params`
+
+```python
+>>> cc.fit()
+>>> cc.params
+{'E': 1.7004437920205586,
+ 'A': 185.388090185727,
+ 'B': 1627.0012474587165,
+ 'alpha': 0.28923265350161337,
+ 'beta': 0.3556020928031086}
+ ```
+
+By calling `cc.scale` with FLOPs specified like
+
+```python
+cc.allocate_compute(C=1e24)
+```
+
+You can get an estimatedly compute-optimal allocation of compute to $N$ and $D$.
+
+### 2. Scaling from scratch
+
+> [!NOTE]
+> An example of this usage can be found [here](https://github.com/kyo-takano/chinchilla/blob/master/examples/efficientcube.ipynb)
+
+> **Procedure**:
+>
+> - `seed`: Sample X training runs $(N_i, D_i, L_i)$, referred to as **seeds**
+> - For i = 0 to K:
+>   - `fit`: Optimize the scaling law parameters to fit $L(N,\ D)$ on the training runs
+>   - `scale`: Configure a new model with a **scaled** compute
+>   - Evaluate the allocation by training a model
+>   - `append`: Add the result to the database of training runs
 
 Below is an example to get started with `chinchilla`.
 
@@ -143,7 +217,9 @@ Ensure you define functionally equivalent versions of:
 - `YourModelClass`: Your model class definition.
 - `train_and_evaluate`: Function to train and evaluate your model.
 
-## Simulation
+<details>
+
+<summary style="font-size: 1.5rem; font-weight: bold;"> Simulation Mode</summary>
 
 You can also visualize how `chinchilla` would perform under the given setup and a hypothetical scaling law, optionally with a **_noise term_**:
 
@@ -166,17 +242,25 @@ cc.simulate(
 )
 ```
 
-Please see [API Reference](https://github.com/kyo-takano/chinchilla/tree/master/docs/api-reference.md) for more.
+</details>
 
 ## Examples
 
-Find a practical application of `chinchilla` in the [`examples`](https://github.com/kyo-takano/chinchilla/tree/master/examples) directory (more to come):
+Find practical applications/examples of `chinchilla` in the [`examples`](https://github.com/kyo-takano/chinchilla/tree/master/examples) directory (more to come):
 
-- [Training Compute-Optimal Rubik's Cube Solvers](https://github.com/kyo-takano/chinchilla/blob/master/examples/efficientcube.ipynb) (100 PetaFLOPs)
+- [Allocating $10^{24}$ FLOPs to a single LLM](https://github.com/kyo-takano/chinchilla/blob/master/examples/llm) [NEW]
+
+- [Scaling Rubik's Cube Solvers from Scratch](https://github.com/kyo-takano/chinchilla/blob/master/examples/efficientcube.ipynb)
 
 ## Documentation
 
-For a detailed API Reference, tips, differences from the original Chinchilla paper, etc., please browse to [./docs](https://github.com/kyo-takano/chinchilla/tree/master/docs).
+- [API Reference](https://github.com/kyo-takano/chinchilla/tree/master/docs/api-reference.md)
+
+- [Tips](https://github.com/kyo-takano/chinchilla/tree/master/docs/TIPS.md)
+
+- [Math](https://github.com/kyo-takano/chinchilla/tree/master/docs/math.md)
+
+- [Differences from the original Chinchilla](https://github.com/kyo-takano/chinchilla/tree/master/docs/changes.md)
 
 ## Contributing
 

diff --git a/chinchilla/_logger.py b/chinchilla/_logger.py
@@ -1,5 +1,5 @@
 """
-Contains a utility function `get_logger`. This module also filters out noisy debug messages 
+Contains a utility function `get_logger`. This module also filters out noisy debug messages
 from `matplotlib` and suppresses redundant warnings from `numpy` and `matplotlib`.
 """
 

diff --git a/chinchilla/_metrics.py b/chinchilla/_metrics.py
@@ -1,4 +1,5 @@
 """A few loss & weight functions you can use on demand."""
+
 from __future__ import annotations  # PEP 604 backport
 
 import numpy as np

diff --git a/chinchilla/_utils.py b/chinchilla/_utils.py
@@ -1,4 +1,5 @@
 """Utility functions."""
+
 from __future__ import annotations  # PEP 604 backport
 
 import itertools