A machine learning pipeline for preprocessing and training models on source code, specifically optimized for Rust codebases.
- Python 3.10 or higher
- CUDA-capable GPU (tested with RTX 4070)
- WSL2 or Linux environment
- Clone the repository:
git clone https://github.com/argakiig/code-preprocessor.git
cd code-preprocessor
- Install the package and dependencies:
pip install -e .
For development, install with additional tools:
pip install -e ".[dev]"
Process and train on a Rust codebase:
code-preprocess --config config.yaml --code-path /path/to/rust/code
Create a config.yaml
file with your settings:
# Model configuration
model:
name: "codellama/CodeLlama-7b-hf"
vocab_size: 50000
max_sequence_length: 256
# Training configuration
training:
batch_size: 2
num_workers: 8
epochs: 3
gradient_accumulation_steps: 1
eval_split: 0.1
seed: 42
# Paths
paths:
output_dir: "./output"
cache_dir: "~/.cache/huggingface"
# Logging
logging:
level: "INFO"
file: "logs/training.log"
# Weights & Biases
wandb:
project: "your-project-name"
# GPU Configuration
gpu:
device: "cuda"
precision: "fp16"
memory_efficient: true
The tool supports both YAML configuration files and command-line arguments. When both are provided:
- Command-line arguments take precedence over the config file settings
- Any settings not specified in command-line arguments will fall back to the config file values
- If a setting is not specified in either place, default values will be used
For example:
# This will use batch_size=4 from CLI, but keep other settings from config.yaml
code-preprocess --config config.yaml --code-path /path/to/code --batch-size 4
Available CLI arguments match the configuration file options and can be viewed with:
code-preprocess --help
The project includes several development tools:
black
: Code formattingisort
: Import sortingflake8
: Code lintingmypy
: Type checkingpytest
: Testingpre-commit
: Git hooks
Set up pre-commit hooks:
pre-commit install
Run tests:
pytest
- Processes Rust source code for training
- Custom tokenizer trained on Rust code
- Memory-efficient training with LoRA and 4-bit quantization
- Integrated with Weights & Biases for experiment tracking
- Automatic evaluation during training
- Best model checkpoint saving
code-preprocessor/
├── code_preprocessor/
│ ├── __init__.py
│ ├── __main__.py
│ ├── config.py
│ ├── processor.py
│ ├── constants.py
│ ├── models/
│ ├── parsers/
│ ├── tokenizers/
│ ├── training/
│ └── utils/
├── tests/
├── .pre-commit-config.yaml
├── pyproject.toml
└── README.md
MIT License