Stealing Part of a Language Model

An unofficial implementation of "Stealing Part of a Production Language Model"

Details

Attack reimplementations are for research and model safety/defense purposes alone. We don't use any proprietary API for the detailed experiments.

Llama 2 7b:

Recover hidden dim $\pm 1$: 4095
Predict RMSNorm as normalization layer
Last layer reconstructed with an RMS of $2 * 10^{-5}$

Methods

With All Logits Available:

Recovering Hidden Dimensionality
- Normalization Layer Prediction
Full Final Layer Extraction

With Top-K Logits and Logit-bias

Recover complete logit vector
- Top-K logits
- Top-K logprobs
- Cost-optimal Top-K logprobs variant
- Top-1 Logprob
  - Have not tested logit recovery due to limited resources.

Logprob-free

Recover complete logit vector
- Binary Search
- Hyperrectangle Relaxation Center
  - With better queries
  - Bounding methods can be referenced here.
- Have not tested logit recovery due to limited resources.

Extras

Optimized Top-K logprobs method with linear constraint
Shortest path formulation of logprob-free attack

Citation

Authors published their own supplementary code after this repo was made, so please do reference theirs for any additional necessary clarity. You can find their repository here.

@misc{carlini2024stealing,
    title={Stealing Part of a Production Language Model}, 
    author={Nicholas Carlini and Daniel Paleka and Krishnamurthy Dj Dvijotham and Thomas Steinke and Jonathan Hayase and A. Feder Cooper and Katherine Lee and Matthew Jagielski and Milad Nasr and Arthur Conmy and Eric Wallace and David Rolnick and Florian Tramèr},
    year={2024},
    eprint={2403.06634},
    archivePrefix={arXiv},
    primaryClass={cs.CR}
}

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
attacks		attacks
LICENSE		LICENSE
README.md		README.md
test_llama.ipynb		test_llama.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stealing Part of a Language Model

Details

Methods

Extras

Citation

About

Releases

Packages

Languages

License

sramshetty/stealing-part-of-an-LM

Folders and files

Latest commit

History

Repository files navigation

Stealing Part of a Language Model

Details

Methods

Extras

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages