Intro
From writing poetry to giving medical advice, large language models are amazingly general, giving us a first hint of the G in artificial general intelligence. They are also surprisingly simple, and yet deeply opaque.
Weaknesses
As Dan Piponi noted:
more has been written about what ChatGPT can’t do than has been written about what any other tool can’t do. It’s all very strange.
and:
“The problem with citations about ChatGPT is that their authenticity can be difficult to discern” – Plato
Anyway, here are some weaknesses:
-
They don’t know when they are telling the truth - they never know when to shut up!
-
They can’t do anything requiring precision (unlike, say, a database).
-
They need a huge amount of training data (a trillion words, unlike a ten-year-old child, who has seen or heard about a hundred million words).
-
They only correspond to a small part of the human brain, and so they may not be as smart as we think.

Wolfram highlights another major gap:
Try to give it rules for an actual “deep” computation that involves many potentially computationally irreducible steps and it just won’t work. (Remember that at each step it’s always just “feeding data forward” in its network, never looping except by virtue of generating new tokens.)
Of course, the network can learn the answer to specific “irreducible” computations. But as soon as there are combinatorial numbers of possibilities, no such “table-lookup-style” approach will work.
Or in summary:
Cases that a human “can solve in a glance” the neural net can solve too. But cases that require doing something “more algorithmic” the neural net tends to somehow be “too computationally shallow” to reliably do.
Prompt injection: What’s the worst that can happen?
Hallucinations
Looking for an interpretability paper
Chain-of-Verification Reduces Hallucination in Large Language Models
Scale
The GPT-3 paper, Language Models are Few-Shot Learners, has this:

The last row shows that the full GPT-3 has 175 billion parameters, arranged in 96 blocks where each block has an attention layer with 4 * 12288 * 12288 parameters and a couple of feed forward layers with a further 8 * 12288 * 12288 parameters.
The parameter count doesn’t depend on the context window size (which for GPT-3 is 2048 and for GPT-3.5 is 4096), or the number of attention heads, or the vocabulary size (which for GPT-3 is 50,257). Instead, it just depends on the number of layers and the model dimension (the embedding size of each token).
According to this, GPT-3 needs 740 teraflops to infer a single token, due to each parameter being used many times (which, in turn, allows the model to do more computation than the raw parameter count would imply). Later versions of GPT-3 had lower inference costs (?).
In the paper, they say:
for most tasks we find relatively smooth scaling with model capacity
and then, intriguingly:
the gap between zero-, one-, and few-shot performance often grows with model capacity, perhaps suggesting that larger models are more proficient meta-learners
For completeness, here are the GPT-2 small, medium, large and XL model sizes.

Memory bandwidth constraints imply economies of scale in AI inference - avoiding GPU bandwidth limits by favouring things like matrix multiply where the computation is O(N3) and the memory is O(N2), possibly with tiling to make sure things fit in the cache.
Chinchilla’s wild implications - running out of data.
RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens
Text generation
How to generate text - Greedy search, Beam search, Top-K Sampling, Top-p sampling
As ad-hoc decoding methods, top-p and top-K sampling seem to produce more fluent text than traditional greedy - and beam search on open-ended language generation. There is evidence that the apparent flaws of greedy and beam search - mainly generating repetitive word sequences - are caused by the model (especially the way the model is trained), rather than the decoding method
The Difficulties of Text Generation using Autoregressive Language Models: A Brief Overview
While some may criticize the autoregressive formulation because people generally don’t write purely autoregressively, there actually are authors who use this sort of technique to write entire books.
“GPT learning has been great at capturing the underlying reality and maybe the weak point is the text generation” - Sutskever -
https://www.youtube.com/watch?v=SjhIlw3Iffs
The Curious Case of Neural Text Degeneration - Beam search text (blue) is less surprising than human text (orange):

Why is human-written text not the most probable text? … people optimize against stating the obvious.
GPT-3 has a habit of repeating its input
Interpreting GPT
A Comprehensive Mechanistic Interpretability Explainer & Glossary
Interpreting GPT: the logit lens - Also see the “Mentioned in” section at the end.

A jargon-free explanation of how AI large language models work
- including the brilliant squirrels analogy for how NNs are trained.
You can think of the attention mechanism as a matchmaking service for words.
You can think of all those [12288] dimensions as a kind of “scratch space” that GPT-3 can use to write notes to itself about the context of each word. Notes made by earlier layers can be read and modified by later layers, allowing the model to gradually sharpen its understanding of the passage as a whole.
There’s also this quip:
A kind of “clever Hans” effect, only in language models rather than horses.
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Transformer Feed-Forward Layers Are Key-Value Memories
A Mechanism for Solving Relational Tasks in Transformer Language Models - e.g., capital_of(Poland)=Warsaw
A Mathematical Framework for Transformer Circuits - Anthropic
In-context Learning and Induction Heads - Anthropic
GPT-3 has “learned how to learn”: in its endless training on so many gigabytes of text, it encounters so many different kinds of text that it had no choice but to learn abstractions.
This family of phenomena is perhaps driven by neural networks functioning as ensembles of many sub-networks.
It is sufficiently powerful a model that its sub-models can do anything from poetry to arithmetic, and it is trained on so much data that those superficial models may do well early on, but gradually fall behind more abstract models.
Beyond Surface Statistics: Scene Representations in a Latent Diffusion Model
Softmax Linear Units - Anthropic - Making models more interpretable
-
Many MLP neurons appear to be polysemantic, responding to multiple unrelated features (unlike the Grandmother cell).
-
One plausible explanation for polysemanticity is the superposition hypothesis, which suggests that neural network layers have more features than neurons as part of a “sparse coding” strategy to simulate a much larger layer.
Training
From the GPT-3 paper, Language Models are Few-Shot Learners:

Datasets used to train GPT-3. “Weight in training mix” refers to the fraction of examples during training that are drawn from a given dataset, which we intentionally do not make proportional to the size of the dataset. As a result, when we train for 300 billion tokens, some datasets are seen up to 3.4 times during training while other datasets are seen less than once.
Here is the training data for LLaMA: Open and Efficient Foundation Language Models:

Optimization
BitNet: Scaling 1-bit Transformers for Large Language Models - reduce the FF matrices to 1-bit precision, by using just +1 or -1 values.
Generating Long Sequences with Sparse Transformers - sparse factorizations of the attention matrix which reduce the cost to O(n√n).
Nvidia increases performance a thousandfold:
- Number Representation: 16x
- Complex Instructions: 12.5x
- Moore’s Law: 2.5x
- Sparsity: 2x
You can now run a GPT-3-level AI model on your laptop, phone, and Raspberry Pi
Regarding AI’s rate of progress, a fellow AI reporter told Ars, “It’s like those videos of dogs where you upend a crate of tennis balls on them. [They] don’t know where to chase first and get lost in the confusion.”
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tuning
“In general, it’s interesting how little “poking” the “originally trained” network seems to need to get it to usefully go in particular directions.” - Wolfram
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning
LoRA: Low-Rank Adaptation of Large Language Models - explanation here.
Using LLMs
Catching up on the weird world of LLMs
I read academic papers by pasting pieces of them into GPT-4 and asking it to explain every jargon term in the extract.
I no longer dread naming things. I can ask it for 20 ideas for names, and maybe option number 15 is the one I go with. (It beats brainstorming for an hour). Always ask for “twenty ideas for” — you’ll find that the first ten are super-obvious, but once you get past those things start getting interesting.
“With GPT questions, the vaguer the better sometimes” - Terence Tao
Brex’s Prompt Engineering Guide ⊗
A prompt may include a few examples for a model to learn from, such as “maison -> house, chat -> cat, chien ->”, an approach called few-shot learning. - Wikipedia
Brains
Thinking, Fast and Slow, says unexpected words cause a distinctive pattern in brain activity to start within a fifth of a second. So, the brain is continuously running an LLM! Apparently, the peak brain clock rate is 200 Hz and the synapse speed is less than a millisecond.
“The reason a neural net can be successful in writing an essay is because writing an essay turns out to be a “computationally shallower” problem than we thought” - Wolfram
Overviews
State of GPT - including finetuning and RHLF at 13:00 to 17:00, and LoRA at 35:18.
Adam is based on the idea of adapting separate learning rates for each parameter.
AlphaGo consists of a policy network and a value network that narrow the search tree and allow for truncation of the search tree, respectively.
Neural Architecture Search has become common practice in the field for squeezing every drop of performance out of networks.
The Lottery Ticket Hypothesis asserts that most of a network’s performance comes from a certain subnetwork due to a lucky initialization.
Landmark papers
Attention Is All You Need - Transformers
Improving Language Understanding by Generative Pre-Training - GPT
Language Models are Unsupervised Multitask Learners - GPT-2
Language Models are Few-Shot Learners - GPT-3
Resources
https://github.com/huggingface/transformers
Neural Networks: Zero to Hero -
Karpathy’s 12 hour video series that builds towards this implementation.
Visualizing Matrix Multiplication
A Comprehensive Overview of Large Language Models
The impacts of AI
AI Alignment Is Turning from Alchemy Into Chemistry
On the Opportunities and Risks of Foundation Models
Random thoughts
We can’t help equating language ability and intelligence. Therefore, the LLMs’ eloquence might be fooling us into thinking they are smart.
LLMs don’t contain control-flow things like “if” statements.
An LLM is like a fuzzy hologram of the web.
Retrieval Augmented Generation. OpenAI embeddings api.
Wikipedia: GPT-2, GPT-3, Prompt engineering ⊗, GPT-4, The Pile,
What OpenAI Really Wants - A fun history of OpenAI.
Where We See Shapes, AI Sees Textures
[Hinton: Evolution is a tinkerer … we probably don’t need all those kinds of neurons to get an intelligent system]
(https://youtu.be/Gg-w_n9NJIE?t=4355)
[Hassabis: AI-enabled scientific revolution]
(https://youtu.be/Gg-w_n9NJIE?t=4423)
