Intro

From writing poetry to giving medical advice, large language models are amazingly general, giving us a first hint of the G in artificial general intelligence. They are also surprisingly simple, and yet deeply opaque.

Weaknesses

As Dan Piponi noted:

more has been written about what ChatGPT can’t do than has been written about what any other tool can’t do. It’s all very strange.

and:

“The problem with citations about ChatGPT is that their authenticity can be difficult to discern” – Plato

Anyway, here are some weaknesses:

They don’t know when they are telling the truth - they never know when to shut up!
They can’t do anything requiring precision (unlike, say, a database).
They need a huge amount of training data (a trillion words, unlike a ten-year-old child, who has seen or heard about a hundred million words).
They only correspond to a small part of the human brain, and so they may not be as smart as we think.

Picture of Broca's and Wernicke's part of the brain

Wolfram highlights another major gap:

Try to give it rules for an actual “deep” computation that involves many potentially computationally irreducible steps and it just won’t work. (Remember that at each step it’s always just “feeding data forward” in its network, never looping except by virtue of generating new tokens.)

Of course, the network can learn the answer to specific “irreducible” computations. But as soon as there are combinatorial numbers of possibilities, no such “table-lookup-style” approach will work.

Or in summary:

Cases that a human “can solve in a glance” the neural net can solve too. But cases that require doing something “more algorithmic” the neural net tends to somehow be “too computationally shallow” to reliably do.

Prompt injection: What’s the worst that can happen?

Hallucinations

Looking for an interpretability paper

Chain-of-Verification Reduces Hallucination in Large Language Models

Scale

The GPT-3 paper, Language Models are Few-Shot Learners, has this:

Screenshot of a table of model sizes.

The last row shows that the full GPT-3 has 175 billion parameters, arranged in 96 blocks where each block has an attention layer with 4 * 12288 * 12288 parameters and a couple of feed forward layers with a further 8 * 12288 * 12288 parameters.

The parameter count doesn’t depend on the context window size (which for GPT-3 is 2048 and for GPT-3.5 is 4096), or the number of attention heads, or the vocabulary size (which for GPT-3 is 50,257). Instead, it just depends on the number of layers and the model dimension (the embedding size of each token).

According to this, GPT-3 needs 740 teraflops to infer a single token, due to each parameter being used many times (which, in turn, allows the model to do more computation than the raw parameter count would imply). Later versions of GPT-3 had lower inference costs (?).

In the paper, they say:

for most tasks we find relatively smooth scaling with model capacity

and then, intriguingly:

the gap between zero-, one-, and few-shot performance often grows with model capacity, perhaps suggesting that larger models are more proficient meta-learners

For completeness, here are the GPT-2 small, medium, large and XL model sizes.

Screenshot of a table of GPT-2 model sizes.

Memory bandwidth constraints imply economies of scale in AI inference - avoiding GPU bandwidth limits by favouring things like matrix multiply where the computation is O(N³) and the memory is O(N²), possibly with tiling to make sure things fit in the cache.

Chinchilla’s wild implications - running out of data.

The Bitter Lesson

RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens

Text generation

How to generate text - Greedy search, Beam search, Top-K Sampling, Top-p sampling

As ad-hoc decoding methods, top-p and top-K sampling seem to produce more fluent text than traditional greedy - and beam search on open-ended language generation. There is evidence that the apparent flaws of greedy and beam search - mainly generating repetitive word sequences - are caused by the model (especially the way the model is trained), rather than the decoding method

The Difficulties of Text Generation using Autoregressive Language Models: A Brief Overview

While some may criticize the autoregressive formulation because people generally don’t write purely autoregressively, there actually are authors who use this sort of technique to write entire books.

“GPT learning has been great at capturing the underlying reality and maybe the weak point is the text generation” - Sutskever - YouTube logo https://www.youtube.com/watch?v=SjhIlw3Iffs

The Curious Case of Neural Text Degeneration - Beam search text (blue) is less surprising than human text (orange): A graph of the surprisingness of beam search vs human text

Why is human-written text not the most probable text? … people optimize against stating the obvious.

GPT-3 has a habit of repeating its input

Interpreting GPT

Othello-GPT

A Comprehensive Mechanistic Interpretability Explainer & Glossary

Interpreting GPT: the logit lens - Also see the “Mentioned in” section at the end. A thumbnail of some logits in successive GPT layers

A jargon-free explanation of how AI large language models work A layer where words are getting annotated - including the brilliant squirrels analogy for how NNs are trained.

You can think of the attention mechanism as a matchmaking service for words.

You can think of all those [12288] dimensions as a kind of “scratch space” that GPT-3 can use to write notes to itself about the context of each word. Notes made by earlier layers can be read and modified by later layers, allowing the model to gradually sharpen its understanding of the passage as a whole.

There’s also this quip:

A kind of “clever Hans” effect, only in language models rather than horses.

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small a circuit in GPT-2 small that implements IOI

Transformer Feed-Forward Layers Are Key-Value Memories

A Mechanism for Solving Relational Tasks in Transformer Language Models - e.g., capital_of(Poland)=Warsaw

A Mathematical Framework for Transformer Circuits - Anthropic

In-context Learning and Induction Heads - Anthropic

The Scaling Hypothesis

GPT-3 has “learned how to learn”: in its endless training on so many gigabytes of text, it encounters so many different kinds of text that it had no choice but to learn abstractions.

This family of phenomena is perhaps driven by neural networks functioning as ensembles of many sub-networks.

It is sufficiently powerful a model that its sub-models can do anything from poetry to arithmetic, and it is trained on so much data that those superficial models may do well early on, but gradually fall behind more abstract models.

Beyond Surface Statistics: Scene Representations in a Latent Diffusion Model

Toy Models of Superposition Thumbnail of feature sparsity

Softmax Linear Units - Anthropic - Making models more interpretable

Many MLP neurons appear to be polysemantic, responding to multiple unrelated features (unlike the Grandmother cell).
One plausible explanation for polysemanticity is the superposition hypothesis, which suggests that neural network layers have more features than neurons as part of a “sparse coding” strategy to simulate a much larger layer.

Training

From the GPT-3 paper, Language Models are Few-Shot Learners:

GPT-3 datasets

Datasets used to train GPT-3. “Weight in training mix” refers to the fraction of examples during training that are drawn from a given dataset, which we intentionally do not make proportional to the size of the dataset. As a result, when we train for 300 billion tokens, some datasets are seen up to 3.4 times during training while other datasets are seen less than once.

Here is the training data for LLaMA: Open and Efficient Foundation Language Models:

Llama training data

Optimization

BitNet: Scaling 1-bit Transformers for Large Language Models - reduce the FF matrices to 1-bit precision, by using just +1 or -1 values.

Generating Long Sequences with Sparse Transformers - sparse factorizations of the attention matrix which reduce the cost to O(n√n).

Nvidia increases performance a thousandfold:

Number Representation: 16x
Complex Instructions: 12.5x
Moore’s Law: 2.5x
Sparsity: 2x

You can now run a GPT-3-level AI model on your laptop, phone, and Raspberry Pi

Regarding AI’s rate of progress, a fellow AI reporter told Ars, “It’s like those videos of dogs where you upend a crate of tennis balls on them. [They] don’t know where to chase first and get lost in the confusion.”

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tuning

“In general, it’s interesting how little “poking” the “originally trained” network seems to need to get it to usefully go in particular directions.” - Wolfram

Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning

LoRA: Low-Rank Adaptation of Large Language Models - explanation here.

Using LLMs

Catching up on the weird world of LLMs

I read academic papers by pasting pieces of them into GPT-4 and asking it to explain every jargon term in the extract.

I no longer dread naming things. I can ask it for 20 ideas for names, and maybe option number 15 is the one I go with. (It beats brainstorming for an hour). Always ask for “twenty ideas for” — you’ll find that the first ten are super-obvious, but once you get past those things start getting interesting.

“With GPT questions, the vaguer the better sometimes” - Terence Tao

Brex’s Prompt Engineering Guide ⊗

A prompt may include a few examples for a model to learn from, such as “maison -> house, chat -> cat, chien ->”, an approach called few-shot learning. - Wikipedia

Brains

Thinking, Fast and Slow, says unexpected words cause a distinctive pattern in brain activity to start within a fifth of a second. So, the brain is continuously running an LLM! Apparently, the peak brain clock rate is 200 Hz and the synapse speed is less than a millisecond.

“The reason a neural net can be successful in writing an essay is because writing an essay turns out to be a “computationally shallower” problem than we thought” - Wolfram

Overviews

State of GPT - including finetuning and RHLF at 13:00 to 17:00, and LoRA at 35:18.

Why GPT-3 Matters

The Decade of Deep Learning

Adam is based on the idea of adapting separate learning rates for each parameter.

AlphaGo consists of a policy network and a value network that narrow the search tree and allow for truncation of the search tree, respectively.

Neural Architecture Search has become common practice in the field for squeezing every drop of performance out of networks.

The Lottery Ticket Hypothesis asserts that most of a network’s performance comes from a certain subnetwork due to a lucky initialization.

Landmark papers

Attention Is All You Need - Transformers

Improving Language Understanding by Generative Pre-Training - GPT

Language Models are Unsupervised Multitask Learners - GPT-2

Language Models are Few-Shot Learners - GPT-3

Resources

https://paperswithcode.com/

https://github.com/huggingface/transformers

AI Canon

YouTube logo Neural Networks: Zero to Hero - Karpathy’s 12 hour video series that builds towards this implementation.

The Illustrated GPT-2

The Illustrated Transformer

Visualizing Matrix Multiplication

A Comprehensive Overview of Large Language Models

The Bloom base model

The impacts of AI

AI and the automation of work

AI Alignment Is Turning from Alchemy Into Chemistry

On the Opportunities and Risks of Foundation Models

Random thoughts

We can’t help equating language ability and intelligence. Therefore, the LLMs’ eloquence might be fooling us into thinking they are smart.

LLMs don’t contain control-flow things like “if” statements.

An LLM is like a fuzzy hologram of the web.

Retrieval Augmented Generation. OpenAI embeddings api.

Wikipedia: GPT-2, GPT-3, Prompt engineering ⊗, GPT-4, The Pile,

What OpenAI Really Wants - A fun history of OpenAI.

Where We See Shapes, AI Sees Textures

YouTube logo [Hinton: Evolution is a tinkerer … we probably don’t need all those kinds of neurons to get an intelligent system] (https://youtu.be/Gg-w_n9NJIE?t=4355)

YouTube logo [Hassabis: AI-enabled scientific revolution] (https://youtu.be/Gg-w_n9NJIE?t=4423)