Highlights
- Pro
Stars
[NeurIPS'24 Spotlight] Observational Scaling Laws
Inspect: A framework for large language model evaluations
An implementation of the Llama architecture, to instruct and delight
A fast + lightweight implementation of the GCG algorithm in PyTorch
AI Logging for Interpretability and Explainability🔬
Decomposing and Editing Predictions by Modeling Model Computation
The nnsight package enables interpreting and manipulating the internals of deep learned models.
Influence Functions with (Eigenvalue-corrected) Kronecker-Factored Approximate Curvature
ML Benchmarks in Algebraic Combinatorics
Universal Neurons in GPT2 Language Models
A benchmark to evaluate language models on questions I've previously asked them to solve.
[ICML 2024] LESS: Selecting Influential Data for Targeted Instruction Tuning
Training Sparse Autoencoders on Language Models
Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions
Public reposetory for code and results of parts of "Steering Llama 2 via Contrastive Activation Addition" by Rimsky, Gabrieli, Schulz et al.
llmstep: [L]LM proofstep suggestions in Lean 4.
NeuroSurgeon is a package that enables researchers to uncover and manipulate subnetworks within models in Huggingface Transformers
[NeurIPS 2023] Learning Transformer Programs
Sparse probing paper full code.
Mech-Interp / PySvelte
Forked from anthropics/PySvelteA library for bridging Python and HTML/Javascript (via Svelte) for creating interactive visualizations
A concise but complete full-attention transformer with a set of promising experimental features from various papers