Training
From the GPT-3 paper, Language Models are Few-Shot Learners:

Datasets used to train GPT-3. “Weight in training mix” refers to the fraction of examples during training that are drawn from a given dataset, which we intentionally do not make proportional to the size of the dataset. As a result, when we train for 300 billion tokens, some datasets are seen up to 3.4 times during training while other datasets are seen less than once.
Here is the training data for LLaMA: Open and Efficient Foundation Language Models:

Optimization
BitNet: Scaling 1-bit Transformers for Large Language Models - reduce the FF matrices to 1-bit precision, by using just +1 or -1 values.
Generating Long Sequences with Sparse Transformers - sparse factorizations of the attention matrix which reduce the cost to O(n√n).
Nvidia increases performance a thousandfold:
- Number Representation: 16x
- Complex Instructions: 12.5x
- Moore’s Law: 2.5x
- Sparsity: 2x
You can now run a GPT-3-level AI model on your laptop, phone, and Raspberry Pi
Regarding AI’s rate of progress, a fellow AI reporter told Ars, “It’s like those videos of dogs where you upend a crate of tennis balls on them. [They] don’t know where to chase first and get lost in the confusion.”
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tuning
“In general, it’s interesting how little “poking” the “originally trained” network seems to need to get it to usefully go in particular directions.” - Wolfram
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning
LoRA: Low-Rank Adaptation of Large Language Models - explanation here.
Using LLMs
Catching up on the weird world of LLMs
I read academic papers by pasting pieces of them into GPT-4 and asking it to explain every jargon term in the extract.
I no longer dread naming things. I can ask it for 20 ideas for names, and maybe option number 15 is the one I go with. (It beats brainstorming for an hour). Always ask for “twenty ideas for” — you’ll find that the first ten are super-obvious, but once you get past those things start getting interesting.
“With GPT questions, the vaguer the better sometimes” - Terence Tao
Brex’s Prompt Engineering Guide ⊗
Overviews
Adam is based on the idea of adapting separate learning rates for each parameter.
AlphaGo consists of a policy network and a value network that narrow the search tree and allow for truncation of the search tree, respectively.
Neural Architecture Search has become common practice in the field for squeezing every drop of performance out of networks.
The Lottery Ticket Hypothesis asserts that most of a network’s performance comes from a certain subnetwork due to a lucky initialization.
Landmark papers
Attention Is All You Need - Transformers
Improving Language Understanding by Generative Pre-Training - GPT
Language Models are Unsupervised Multitask Learners - GPT-2
Language Models are Few-Shot Learners - GPT-3
Resources
https://github.com/huggingface/transformers
Neural Networks: Zero to Hero -
Karpathy’s 12 hour video series that builds towards this implementation.
Visualizing Matrix Multiplication
A Comprehensive Overview of Large Language Models
The impacts of AI
AI Alignment Is Turning from Alchemy Into Chemistry
On the Opportunities and Risks of Foundation Models
Random thoughts
We can’t help equating language ability and intelligence. Therefore, the LLMs’ eloquence might be fooling us into thinking they are smart.
LLMs don’t contain control-flow things like “if” statements.
An LLM is like a fuzzy hologram of the web.
Retrieval Augmented Generation. OpenAI embeddings api.
Wikipedia: Prompt engineering ⊗, The Pile,
What OpenAI Really Wants - A fun history of OpenAI.