Misc | Dominic’s notes

Training

From the GPT-3 paper, Language Models are Few-Shot Learners:

GPT-3 datasets

Datasets used to train GPT-3. “Weight in training mix” refers to the fraction of examples during training that are drawn from a given dataset, which we intentionally do not make proportional to the size of the dataset. As a result, when we train for 300 billion tokens, some datasets are seen up to 3.4 times during training while other datasets are seen less than once.

Here is the training data for LLaMA: Open and Efficient Foundation Language Models:

Llama training data

Optimization

BitNet: Scaling 1-bit Transformers for Large Language Models - reduce the FF matrices to 1-bit precision, by using just +1 or -1 values.

Generating Long Sequences with Sparse Transformers - sparse factorizations of the attention matrix which reduce the cost to O(n√n).

Nvidia increases performance a thousandfold:

Number Representation: 16x
Complex Instructions: 12.5x
Moore’s Law: 2.5x
Sparsity: 2x

You can now run a GPT-3-level AI model on your laptop, phone, and Raspberry Pi

Regarding AI’s rate of progress, a fellow AI reporter told Ars, “It’s like those videos of dogs where you upend a crate of tennis balls on them. [They] don’t know where to chase first and get lost in the confusion.”

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tuning

“In general, it’s interesting how little “poking” the “originally trained” network seems to need to get it to usefully go in particular directions.” - Wolfram

Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning

LoRA: Low-Rank Adaptation of Large Language Models - explanation here.

Using LLMs

Catching up on the weird world of LLMs

I read academic papers by pasting pieces of them into GPT-4 and asking it to explain every jargon term in the extract.

I no longer dread naming things. I can ask it for 20 ideas for names, and maybe option number 15 is the one I go with. (It beats brainstorming for an hour). Always ask for “twenty ideas for” — you’ll find that the first ten are super-obvious, but once you get past those things start getting interesting.

“With GPT questions, the vaguer the better sometimes” - Terence Tao

Brex’s Prompt Engineering Guide ⊗