Scale

The GPT-3 paper, Language Models are Few-Shot Learners, has this:

Screenshot of a table of model sizes.

The last row shows that the full GPT-3 has 175 billion parameters, arranged in 96 blocks where each block has an attention layer with 4 * 12288 * 12288 parameters and a couple of feed forward layers with a further 8 * 12288 * 12288 parameters.

The parameter count doesn’t depend on the context window size (which for GPT-3 is 2048 and for GPT-3.5 is 4096), or the number of attention heads, or the vocabulary size (which for GPT-3 is 50,257). Instead, it just depends on the number of layers and the model dimension (the embedding size of each token).

According to this, GPT-3 needs 740 teraflops to infer a single token, due to each parameter being used many times (which, in turn, allows the model to do more computation than the raw parameter count would imply). Later versions of GPT-3 had lower inference costs (?).

In the paper, they say:

for most tasks we find relatively smooth scaling with model capacity

and then, intriguingly:

the gap between zero-, one-, and few-shot performance often grows with model capacity, perhaps suggesting that larger models are more proficient meta-learners

For completeness, here are the GPT-2 small, medium, large and XL model sizes.

Screenshot of a table of GPT-2 model sizes.

Memory bandwidth constraints imply economies of scale in AI inference - avoiding GPU bandwidth limits by favouring things like matrix multiply where the computation is O(N³) and the memory is O(N²), possibly with tiling to make sure things fit in the cache.

Chinchilla’s wild implications - running out of data.

The Bitter Lesson

RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens