Capturing Attention: Decoding the Success of Transformer Models in Natural Language Processing
Zian (Andy) Wang
The Transformer model, introduced in the paper “Attention is All You Need,” has influenced virtually every subsequent language modeling architecture or technique. From novel models such as BERT, Transformer-XL, and RoBERTa to the recent ChatGPT, which has enthralled the internet as one of the most impressive conversational chatbots yet. It is clear that transformer-based architecture has left an undeniable imprint in the field of language modeling and machine learning in general.
They are so powerful that a study by Buck Shlegeris et al. found that Transformer-based language models outperformed humans [1, see references below] in next-word prediction tasks, with humans achieving an accuracy of 38% compared to GPT-3’s 56% accuracy. In a Lex Fridman podcast interview, computer scientist Andrej Kaparthy stated that “the Transformers [are] taking over AI” and that it has proven “remarkably resilient,” going even further by calling it a “general purpose differentiable computer.”
Transformers and their variants have proven to be incredibly effective in understanding and deciphering the intricate structure of languages. Their exceptional ability for logical reasoning and text comprehension has generated curiosity about what makes them so powerful.
In the same podcast, Kaparthy mentioned that the Transformer is “powerful in the forward pass because it can express […] general computation […] as something that looks like message passing.” This leads to the first critical factor contributing to the Transformer’s success, the residual stream. [2]
The Residual Stream
Typically, most people’s understanding of neural networks is that they operate sequentially. Essentially, each layer in the network receives a tensor (for those unfamiliar, in our context, “tensors” are just processed matrices from previous layers/input) as an input, processes it, and outputs it for the subsequent layer to consume. In this way, each layer’s output tensor serves as the subsequent layer’s input, creating a dependency between layers. Instead, transformer-based models operate by extracting information from a common “residual stream” shared by all attention and MLP blocks.
Transformer-based models, such as the GPT family, comprise stacked residual blocks consisting of an attention layer followed by a multilayer perceptron (MLP) layer. Regardless of MLP or attention heads, every layer reads from the “residual stream” and “writes” its results back to it. The residual stream is an aggregation of the outputs from prior layers. Specifically, every layer reads from the residual stream through linear projection and writes to it through a linear projection followed by addition.
To provide a high-level understanding of how the residual stream contributes to Transformer-based models, imagine a team of data scientists working on a machine learning problem. They start by collecting and storing a large dataset in a database. They divide the work among themselves based on their specialties. The first few members of the team, like data engineers and data analysts, perform initial processing on the data. They might clean and organize it or identify patterns and trends.
They then pass the “preprocessed” data to the machine learning engineers, who use it to build and train predictive models. In a transformer-based model, we can think of each layer as a team member, with each layer (or series of layers) having its specialty, and they can learn to work together like a real team through training. For example, a set of attention blocks might work together like a “data engineering” team responsible for moving and manipulating data samples and removing useless information (and yes, in reality, attention blocks can do that!).