Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.2.0: Imrprove UX and performance #5

Merged
merged 2 commits into from
Jan 27, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ __pycache__/
# Distribution / packaging
dist/
*.egg-info/
build/

# Misc.
.pytest_cache/
Expand Down
158 changes: 121 additions & 37 deletions README.md
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,6 +1,15 @@
# `chinchilla`

`chinchilla` is a research toolkit designed to estimate scaling laws and train compute-optimal models for various deep learning tasks.
![Parametric fit on LLM training runs](docs/imgs/parametric_fit.png)

`chinchilla` is a research toolkit designed to estimate scaling laws & train compute-optimal models for various deep learning tasks.

## Features

- **Scaling Law Estimation**: Fit a loss predictor based on multiple training runs.
- **Compute-Optimal Allocation**: Train the best possible model within a given compute budget.
- **Progressive Scaling**: Iteratively update the scaling law estimation and scale up the compute.
- **Simulation Mode**: Test scaling law estimations in hypothetical scenarios.

<table>
<tr>
Expand All @@ -11,7 +20,6 @@
</td>
<td>

- Researching the neural scaling law itself
- Scaling compute for
- Large Language Models (LLM)
- Vision Transformers (ViT)
Expand All @@ -20,76 +28,142 @@
- Knowledge distillation
- Evaluating compute efficiencies of new algorithms & architectures
- Researching the neural scaling law itself

</td>
<tr>
<td>

**Probably Not For**:
Probably **NOT** For...
</td>
<td>

- Fine-tuning tasks
- Data-scarce domains
- etc.

</td>

</tr>
</table>

> [!IMPORTANT]
> This work builds upon the scaling law formulation proposed in [the original Chinchilla paper](https://deepmind.google/discover/blog/an-empirical-analysis-of-compute-optimal-large-language-model-training/) by DeepMind (2022),
> with some modifications detailed in [./docs/changes.md](https://github.com/kyo-takano/chinchilla/tree/master/docs/changes.md).

## Features
## Installation

- **Scaling Law Estimation**: Fit a loss predictor based on multiple training runs.
- **Compute-Optimal Allocation**: Train the best possible model within a given compute budget.
- **Progressive Scaling**: Iteratively update the scaling law estimation and scale up the compute.
- **Simulation Mode**: Test scaling law estimations in hypothetical scenarios.
**From PyPI**

## Basics
```bash
pip install -U chinchilla
```

### Definitions
**From Source**

```bash
git clone https://github.com/kyo-takano/chinchilla.git
cd chinchilla
pip install -e .
```

## Prerequisite: Chinchilla formulation

Just in case you are not familiar, here is the formulation of the scaling law estimation:

<details>

<summary style="font-weight: bold;">Variables</summary>

- $N$: The number of parameters
- $D$: The number of data samples
- $C$: Total compute in FLOPs ($C\approx 6\ ND$)
- $L(N,\ D) = E + A/N ^ \alpha + B / D ^ \beta$: A loss predictor parameterized by $\{E, A, B, \alpha, \beta\}$ and $C$
- $L(N,\ D) = E + A / N ^ \alpha + B / D ^ \beta$: A loss predictor parameterized by $\{E, A, B, \alpha\}$ and $\beta$

---

**Intuition**:
- $E$ corresponds to the **irreducible loss** that can only be atained with an ideal model with infinite compute
- $A / N ^ \alpha$ accconts for the additional loss coming from insufficiency of model size;
- $B / D ^ \beta$, insufficiency of data amount.

</details>

<details>

### Compute-Optimal Allocation
<summary style="font-weight: bold;">Objective</summary>

1. Optimize the parameters $\{E, A, B, \alpha, \beta\}$ to better predict losses $L_i$ from $(N_i, D_i)$
2. Solve $\underset{N,\ D}{argmin}\ L(N,\ D\ |\ C)$, which can be derived from $\{A, B, \alpha, \beta\}$

### `chinchilla` Procedure
</details>

- `seed`: Sample X training runs $(N_i, D_i, L_i)$, referred to as **seeds**
- For i = 0 to K:
- `fit`: Optimize the scaling law parameters to fit $L(N,\ D)$ on the training runs
- `scale`: Configure a new model with a **scaled** compute
- Evaluate the allocation by training a model
- `append`: Add the result to the database of training runs
## Usage

## Installation
### 1. Fitting the scaling law on existing dataset

> [!WARNING]
>
> `chinchilla` requires Python >= 3.8
> [!NOTE]
> An example of this usage can be found [here](https://github.com/kyo-takano/chinchilla/blob/master/examples/llm/main.ipynb)

**From Source** (Recommended for Customization)
First, prepare a CSV looking like this and save it as `df.csv`:

```bash
git clone https://github.com/kyo-takano/chinchilla.git
cd chinchilla
pip install -e .
```csv
C,N,D,loss
1.3972367362937152e+18,73824672,3154403320,3.405928
1.7656304230443515e+18,89818214,3276303602,3.325255
2.0558971596900728e+18,105811837,3238291053,3.300442
...
```

**From PyPI**
Second, define a grid of initial parameters to fit like:

```bash
pip install -U chinchilla
```python
import numpy as np
from chinchilla import Chinchilla
cc = Chinchilla(
"./", # Assuming `df.csv` is under ./
param_grid=dict(
E=np.linspace(1, 2, 5),
a=np.linspace(1, 10, 5), # a: log(A)
b=np.linspace(1, 10, 5), # b: log(B)
alpha=np.linspace(0.1, 0.7, 5),
beta=np.linspace(0.1, 0.7, 5),
),
)
```

## Usage
Finally, call `cc.fit()` & you'll get the parameters fit on your dataset, which you can easily access as `cc.params`

```python
>>> cc.fit()
>>> cc.params
{'E': 1.7004437920205586,
'A': 185.388090185727,
'B': 1627.0012474587165,
'alpha': 0.28923265350161337,
'beta': 0.3556020928031086}
```

By calling `cc.scale` with FLOPs specified like

```python
cc.allocate_compute(C=1e24)
```

You can get an estimatedly compute-optimal allocation of compute to $N$ and $D$.

### 2. Scaling from scratch

> [!NOTE]
> An example of this usage can be found [here](https://github.com/kyo-takano/chinchilla/blob/master/examples/efficientcube.ipynb)

> **Procedure**:
>
> - `seed`: Sample X training runs $(N_i, D_i, L_i)$, referred to as **seeds**
> - For i = 0 to K:
> - `fit`: Optimize the scaling law parameters to fit $L(N,\ D)$ on the training runs
> - `scale`: Configure a new model with a **scaled** compute
> - Evaluate the allocation by training a model
> - `append`: Add the result to the database of training runs

Below is an example to get started with `chinchilla`.

Expand Down Expand Up @@ -143,7 +217,9 @@ Ensure you define functionally equivalent versions of:
- `YourModelClass`: Your model class definition.
- `train_and_evaluate`: Function to train and evaluate your model.

## Simulation
<details>

<summary style="font-size: 1.5rem; font-weight: bold;"> Simulation Mode</summary>

You can also visualize how `chinchilla` would perform under the given setup and a hypothetical scaling law, optionally with a **_noise term_**:

Expand All @@ -166,17 +242,25 @@ cc.simulate(
)
```

Please see [API Reference](https://github.com/kyo-takano/chinchilla/tree/master/docs/api-reference.md) for more.
</details>

## Examples

Find a practical application of `chinchilla` in the [`examples`](https://github.com/kyo-takano/chinchilla/tree/master/examples) directory (more to come):
Find practical applications/examples of `chinchilla` in the [`examples`](https://github.com/kyo-takano/chinchilla/tree/master/examples) directory (more to come):

- [Training Compute-Optimal Rubik's Cube Solvers](https://github.com/kyo-takano/chinchilla/blob/master/examples/efficientcube.ipynb) (100 PetaFLOPs)
- [Allocating $10^{24}$ FLOPs to a single LLM](https://github.com/kyo-takano/chinchilla/blob/master/examples/llm) [NEW]

- [Scaling Rubik's Cube Solvers from Scratch](https://github.com/kyo-takano/chinchilla/blob/master/examples/efficientcube.ipynb)

## Documentation

For a detailed API Reference, tips, differences from the original Chinchilla paper, etc., please browse to [./docs](https://github.com/kyo-takano/chinchilla/tree/master/docs).
- [API Reference](https://github.com/kyo-takano/chinchilla/tree/master/docs/api-reference.md)

- [Tips](https://github.com/kyo-takano/chinchilla/tree/master/docs/TIPS.md)

- [Math](https://github.com/kyo-takano/chinchilla/tree/master/docs/math.md)

- [Differences from the original Chinchilla](https://github.com/kyo-takano/chinchilla/tree/master/docs/changes.md)

## Contributing

Expand Down
2 changes: 1 addition & 1 deletion chinchilla/_logger.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
"""
Contains a utility function `get_logger`. This module also filters out noisy debug messages
Contains a utility function `get_logger`. This module also filters out noisy debug messages
from `matplotlib` and suppresses redundant warnings from `numpy` and `matplotlib`.
"""

Expand Down
1 change: 1 addition & 0 deletions chinchilla/_metrics.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
"""A few loss & weight functions you can use on demand."""

from __future__ import annotations # PEP 604 backport

import numpy as np
Expand Down
1 change: 1 addition & 0 deletions chinchilla/_utils.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
"""Utility functions."""

from __future__ import annotations # PEP 604 backport

import itertools
Expand Down
Loading