An unofficial implementation of "Stealing Part of a Production Language Model"
Attack reimplementations are for research and model safety/defense purposes alone. We don't use any proprietary API for the detailed experiments.
Llama 2 7b:
- Recover hidden dim
$\pm 1$ : 4095 - Predict RMSNorm as normalization layer
- Last layer reconstructed with an RMS of
$2 * 10^{-5}$
With All Logits Available:
- Recovering Hidden Dimensionality
- Normalization Layer Prediction
- Full Final Layer Extraction
With Top-K Logits and Logit-bias
- Recover complete logit vector
- Top-K logits
- Top-K logprobs
- Cost-optimal Top-K logprobs variant
- Top-1 Logprob
- Have not tested logit recovery due to limited resources.
Logprob-free
- Recover complete logit vector
- Binary Search
- Hyperrectangle Relaxation Center
- With better queries
- Bounding methods can be referenced here.
- Have not tested logit recovery due to limited resources.
- Optimized Top-K logprobs method with linear constraint
- Shortest path formulation of logprob-free attack
Authors published their own supplementary code after this repo was made, so please do reference theirs for any additional necessary clarity. You can find their repository here.
@misc{carlini2024stealing,
title={Stealing Part of a Production Language Model},
author={Nicholas Carlini and Daniel Paleka and Krishnamurthy Dj Dvijotham and Thomas Steinke and Jonathan Hayase and A. Feder Cooper and Katherine Lee and Matthew Jagielski and Milad Nasr and Arthur Conmy and Eric Wallace and David Rolnick and Florian Tramèr},
year={2024},
eprint={2403.06634},
archivePrefix={arXiv},
primaryClass={cs.CR}
}