Finetuned language models are zero-shot learners

Adversarial & Visualization

Transformer Circuits Thread

Adversarial Examples Are Not Bugs, They Are Features - Ilyas - A popular paper, I think

LLM Lies: Hallucinations are not Bugs, but Features as Adversarial Examples

Distill

A Discussion of Adversarial Examples Are Not Bugs, They Are Features

Visualizing Weights

Zoom In: An Introduction to Circuits

Exploring Neural Networks with Activation Atlases

Interpreting GPT

The nodes are the variables and the weights are the program.
IOI
Superposition
Causal interventions on the internals of a model.
Activation patching - run the model twice; replace a node’s activation in one run with the activation from the other run.
It measures counter-factuals. What would
Path patching
Automatic circuit discovery
With patching in general, you have to choose your input distribution carefully.
Attention heads (?) can refer to other info by value or by pointer. They showed that, in their IOI example, the second John was being referred to by (relative) pointer.

A mechanism for solving relational tasks in transformer language models

in an in-context learning setting

reverse-engineering the data structures and algorithms that are implicitly encoded in the model’s weights

LLMs implement a basic vector-addition mechanism which plays an important role in a number of in-context-learning tasks

a distinct processing signature in the forward pass which characterizes this mechanism

if models need to perform the get capital(x) function, which takes an argument x and yields an answer y, they first surface the argument x in earlier layers which enables them to apply the function and yield y as the final output

the vector arithmetic mechanism is often implemented by mid-to-late layer feedforward networks (FFNs) in a way that is modular and supports intervention

From start to end, x is only updated by additive updates, forming a residual stream. Thus, the token representation x_i represents all of the additions made into the residual stream up to layer i

LMs incrementally update the token representation x to build and refine an encoding of the vocabulary distribution

Finally, the model enters Saturation, where the model recognizes it has solved the next token, and ceases updating the token representation for the remaining layers

Saturation events are described in Geva et al. (2022a) where detection of such events is used to “early-exit” out of the forward pass

Our findings invite future work aimed at understanding why, and under what conditions, LMs learn to use this mechanism when they are capable of solving such tasks using, e.g., adhoc memorization

IOI links

A Mathematical Framework for Transformer Circuits

In-context Learning and Induction Heads

LLM Visualization

https://arxiv.org/pdf/2310.16028

https://openreview.net/pdf?id=ZmzLrl8nTa