Skip to content

An unofficial implementation of "Stealing Part of a Production Language Model"

License

Notifications You must be signed in to change notification settings

sramshetty/stealing-part-of-an-LM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Stealing Part of a Language Model

An unofficial implementation of "Stealing Part of a Production Language Model"

Details

Attack reimplementations are for research and model safety/defense purposes alone. We don't use any proprietary API for the detailed experiments.

Llama 2 7b:

  • Recover hidden dim $\pm 1$: 4095
  • Predict RMSNorm as normalization layer
  • Last layer reconstructed with an RMS of $2 * 10^{-5}$

Methods

With All Logits Available:

  • Recovering Hidden Dimensionality
    • Normalization Layer Prediction
  • Full Final Layer Extraction

With Top-K Logits and Logit-bias

  • Recover complete logit vector
    • Top-K logits
    • Top-K logprobs
    • Cost-optimal Top-K logprobs variant
    • Top-1 Logprob
      • Have not tested logit recovery due to limited resources.

Logprob-free

  • Recover complete logit vector
    • Binary Search
    • Hyperrectangle Relaxation Center
      • With better queries
      • Bounding methods can be referenced here.
    • Have not tested logit recovery due to limited resources.

Extras

  • Optimized Top-K logprobs method with linear constraint
  • Shortest path formulation of logprob-free attack

Citation

Authors published their own supplementary code after this repo was made, so please do reference theirs for any additional necessary clarity. You can find their repository here.

@misc{carlini2024stealing,
    title={Stealing Part of a Production Language Model}, 
    author={Nicholas Carlini and Daniel Paleka and Krishnamurthy Dj Dvijotham and Thomas Steinke and Jonathan Hayase and A. Feder Cooper and Katherine Lee and Matthew Jagielski and Milad Nasr and Arthur Conmy and Eric Wallace and David Rolnick and Florian Tramèr},
    year={2024},
    eprint={2403.06634},
    archivePrefix={arXiv},
    primaryClass={cs.CR}
}

About

An unofficial implementation of "Stealing Part of a Production Language Model"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published