The end of programming

Matt Welsh:

How does all of this change how we think about the field of computer science? The new atomic unit of computation becomes not a processor, memory, and I/O system implementing a von Neumann machine, but rather a massive, pre-trained, highly adaptive AI model. This is a seismic shift in the way we think about computation—not as a predictable, static process, governed by instruction sets, type systems, and notions of decidability. AI-based computation has long since crossed the Rubicon of being amenable to static analysis and formal proof. We are rapidly moving toward a world where the fundamental building blocks of computation are temperamental, mysterious, adaptive agents.

This shift is underscored by the fact that nobody actually understands how large AI models work.People are publishing research papers3,4,5 actually discovering new behaviors of existing large models, even though these systems have been “engineered” by humans. Large AI models are capable of doing things that they have not been explicitly trained to do, which should scare the living daylights out of Nick Bostrom2 and anyone else worried (rightfully) about an superintelligent AI running amok. We currently have no way, apart from empirical study, to determine the limits of current AI systems. As for future AI models that are orders of magnitude larger and more complex—good luck!

The shift in focus from programs to models should be obvious to anyone who has read any modern machine learning papers. These papers barely mention the code or systems underlying their innovations; the building blocks of AI systems are much higher-level abstractions like attention layers, tokenizers, and datasets. A time traveler from even 20 years ago would have a hard time making sense of the three sentences in the (75-page!) GPT-3 paper3 describing the actual software built for the model: “We use the same model and architecture as GPT-2, including the modified initialization, pre-normalization, and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer. To study the dependence of ML performance on model size, we train eight different sizes of model, ranging over three orders of magnitude from 125 million parameters to 175 billion parameters, with the last being the model we call GPT-3. Previous work suggests that with enough training data, scaling of validation loss should be approximately a smooth power law as a function of size; training models of many different sizes allows us to test this hypothesis both for validation loss and for downstream language tasks.”