- The core of the Transformer's processing lies in the Transformer block, which comprises
- multi-head self-attention and a Multi-Layer Perceptron layer. Most models consist of
- multiple such blocks that are stacked sequentially one after the other. The token
- representations evolve through layers, from the first block to the 12th one, allowing the
- model to build up an intricate understanding of each token. This layered approach leads to
- higher-order representations of the input.
+ Transformer 处理的核心在于 Transformer 块,它由多头自注意力和多层感知器层组成。大多数模型由多个这样的块组成,
+ 这些块按顺序一个接一个地堆叠在一起。Token 表示通过层级演变,从第一个块到第 12 个块,使模型能够对每个 Token 建立复杂的理解。
+ 这种分层方法可以实现输入的高阶表示。
-
Multi-Head Self-Attention
+
多头自注意力
- The self-attention mechanism enables the model to focus on relevant parts of the input
- sequence, allowing it to capture complex relationships and dependencies within the data.
- Let’s look at how this self-attention is computed step-by-step.
+ 自注意力机制使模型能够专注于输入序列的相关部分,从而能够捕获数据中的复杂关系和依赖关系。
+ 让我们一步步看看这种自注意力是如何计算的。
-
Step 1: Query, Key, and Value Matrices
+
第一步:查询、键和值矩阵(Query, Key, and Value Matrices)
- Figure 2 . Computing Query, Key, and Value matrices from
- the original embedding.
+ 图2 ,根据原始嵌入计算查询、键和值矩阵
- Each token's embedding vector is transformed into three vectors:
- Query (Q) ,
- Key (K) , and
- Value (V) . These vectors are derived by multiplying the
- input embedding matrix with learned weight matrices for
- Q ,
- K , and
- V . Here's a web search analogy to help us build some
- intuition behind these matrices:
+ 每个 token 的嵌入向量被转换成三个向量:
+ Query (Q) 、
+ Key (K) 和
+ Value (V) 。这些向量是通过将输入嵌入矩阵与学习到的权重矩阵相乘而得出的
+ Q 、
+ K 和
+ V 。这里有一个网络搜索类比,可以帮助我们建立这些矩阵背后的一些直觉:
- Query (Q) is the search text you type in
- the search engine bar. This is the token you want to
- "find more information about" .
+ Query (Q) 是您在搜索引擎栏中输入的搜索文本。
+ 这是您想要“查找更多信息” 的标记。
- Key (K) is the title of each web page in the
- search result window. It represents the possible tokens the query can attend to.
+ Key (K) 是搜索结果窗口中每个网页的标题。
+ 它表示查询可以关注的可能的标记。
- Value (V) is the actual content of web pages
- shown. Once we matched the appropriate search term (Query) with the relevant results (Key),
- we want to get the content (Value) of the most relevant pages.
+ Value (V) 是网页显示的实际内容。
+ 当我们将适当的搜索词(Query)与相关结果(Key)匹配后,我们希望获得最相关页面的内容(Value)。
- By using these QKV values, the model can calculate attention scores, which determine how
- much focus each token should receive when generating predictions.
+ 通过使用这些 QKV 值,模型可以计算注意力分数,这决定了每个标记在生成预测时应该获得多少关注。
-
Step 2: Masked Self-Attention
+
第二步:掩码自注意力机制
- Masked self-attention allows the model to generate sequences by focusing on relevant
- parts of the input while preventing access to future tokens.
+ 掩码自注意力机制(Masked Self-Attention)允许模型通过关注输入的相关部分来生成序列,同时阻止访问未来的标记。
- Figure 3 . Using Query, Key, and Value matrices to
- calculate masked self-attention.
+ 图3 ,使用查询、键和值矩阵来计算掩蔽自注意力
- Attention Score : The dot product of
- Query
- and Key matrices determines the alignment of each query with
- each key, producing a square matrix that reflects the relationship between all input tokens.
+ 注意力分数 :Query 和Key
+ 矩阵的点积确定每个查询与每个键的对齐方式,从而产生一个反映所有输入标记之间关系的方阵。
- Masking : A mask is applied to the upper triangle of the attention
- matrix to prevent the model from accessing future tokens, setting these values to
- negative infinity. The model needs to learn how to predict the next token without
- “peeking” into the future.
+ 掩码 :对注意力矩阵的上三角应用掩码,以防止模型访问未来的标记,并将这些值设置为负无穷大。
+ 模型需要学习如何在不“窥视”未来的情况下预测下一个标记。
- Softmax : After masking, the attention score is converted into
- probability by the softmax operation which takes the exponent of each attention score.
- Each row of the matrix sums up to one and indicates the relevance of every other token
- to the left of it.
+ Softmax :经过掩码处理后,注意力得分通过 softmax 运算转换为概率,该运算取每个注意
+ 力得分的指数。矩阵的每一行总和为 1,并表示其左侧每个其他标记的相关性。
-
Step 3: Output
+
第三步:输出
- The model uses the masked self-attention scores and multiplies them with the
- Value matrix to get the
- final output
- of the self-attention mechanism. GPT-2 has 12
self-attention heads, each capturing
- different relationships between tokens. The outputs of these heads are concatenated and passed
- through a linear projection.
+ 该模型使用掩码后的自注意力得分,并将其与 Value 矩阵相乘,
+ 以获得自注意力机制的 最终输出 。GPT-2 有 12
个
+ 自注意力 heads,每个 head 捕获 token 之间的不同关系。这些 head 的输出被连接起来并通过线性投影(linear projection)。
-
MLP: Multi-Layer Perceptron
+
多层感知器
- Figure 4 . Using MLP layer to project the self-attention
- representations into higher dimensions to enhance the model's representational capacity.
+ 图4 ,使用 MLP 层将自注意力表征投影到更高维度,以增强模型的表征能力
- After the multiple heads of self-attention capture the diverse relationships between the
- input tokens, the concatenated outputs are passed through the Multilayer Perceptron
- (MLP) layer to enhance the model's representational capacity. The MLP block consists of
- two linear transformations with a GELU activation function in between. The first linear
- transformation increases the dimensionality of the input four-fold from 768
- to 3072
. The second linear transformation reduces the dimensionality back
- to the original size of 768
, ensuring that the subsequent layers receive
- inputs of consistent dimensions. Unlike the self-attention mechanism, the MLP processes
- tokens independently and simply map them from one representation to another.
+ 在多个自注意力机制捕获输入 token 之间的不同关系后,连接的输出将通过多层感知器(MLP,Multi-Layer Perceptron)层,
+ 以增强模型的表示能力。MLP 块由两个线性变换组成,中间有一个 GELU 激活函数。
+ 第一个线性变换将输入的维数从 768
增加了四倍至 3072
。
+ 第二个线性变换将维数降低回原始大小 768
,确保后续层接收一致维度的输入。
+ 与自注意力机制不同,MLP 独立处理 token 并简单地将它们从一种表示映射到另一种表示。
-
Output Probabilities
+
输出概率
- After the input has been processed through all Transformer blocks, the output is passed
- through the final linear layer to prepare it for token prediction. This layer projects
- the final representations into a 50,257
- dimensional space, where every token in the vocabulary has a corresponding value called
- logit
. Any token can be the next word, so this process allows us to simply
- rank these tokens by their likelihood of being that next word. We then apply the softmax
- function to convert the logits into a probability distribution that sums to one. This
- will allow us to sample the next token based on its likelihood.
+ 在输入经过所有 Transformer 块处理后,输出将通过最后的线性层,为标记预测做好准备。
+ 此层将最终表示投影到 50,257
维空间中,词汇表中的每个标记都有一个对应的值,
+ 称为 logit
。任何标记都可以是下一个单词,因此此过程允许我们根据它们成为
+ 下一个单词的可能性对这些标记进行简单排序。然后,我们应用 softmax 函数将 logit 转换为
+ 总和为 1 的概率分布。这将使我们能够根据其可能性对下一个标记进行采样。
- Figure 5 . Each token in the vocabulary is assigned a
- probability based on the model's output logits. These probabilities determine the
- likelihood of each token being the next word in the sequence.
+ 图5 ,词汇表中的每个标记都根据模型的输出逻辑
+ 分配一个概率,这些概率决定了每个标记成为序列中下一个单词的可能性
- The final step is to generate the next token by sampling from this distribution The temperature
- hyperparameter plays a critical role in this process. Mathematically speaking, it is a very
- simple operation: model output logits are simply divided by the
- temperature
:
+ 最后一步是从该分布中采样来生成下一个标记。temperature
超参数在
+ 此过程中起着关键作用。从数学上讲,这是一个非常简单的操作:模型输出 logits 只
+ 需除以 temperature
:
- temperature = 1
: Dividing logits by one has no effect on the softmax
- outputs.
+ temperature = 1
:将 logits 除以 1 对 softmax 输出没有影响。
- temperature < 1
: Lower temperature makes the model more confident and
- deterministic by sharpening the probability distribution, leading to more predictable
- outputs.
+ temperature < 1
:较低的温度通过锐化概率分布使模型更加自信和确定,从而产生更可预测的输出。
- temperature > 1
: Higher temperature creates a softer probability
- distribution, allowing for more randomness in the generated text – what some refer to
- as model “creativity” .
+ temperature > 1
:较高的温度会产生更柔和的概率分布,从而允许生成的文本具有更多的随机性 - 有些人称之为模型“创造力” 。
- Adjust the temperature and see how you can balance between deterministic and diverse
- outputs!
+ 调节温度,看看如何在确定性和多样化输出之间取得平衡!
-
Advanced Architectural Features
+
高级架构功能
- There are several advanced architectural features that enhance the performance of
- Transformer models. While important for the model's overall performance, they are not as
- important for understanding the core concepts of the architecture. Layer Normalization,
- Dropout, and Residual Connections are crucial components in Transformer models,
- particularly during the training phase. Layer Normalization stabilizes training and
- helps the model converge faster. Dropout prevents overfitting by randomly deactivating
- neurons. Residual Connections allows gradients to flow directly through the network and
- helps to prevent the vanishing gradient problem.
+ 有几种高级架构功能可增强 Transformer 模型的性能。虽然它们对于模型的整体性能很重要,
+ 但对于理解架构的核心概念却不那么重要。层规范化、Dropout 和残差连接是 Transformer
+ 模型中的关键组件,尤其是在训练阶段。层规范化可以稳定训练并帮助模型更快地收敛。
+ Dropout 通过随机停用神经元来防止过度拟合。残差连接允许梯度直接流过网络并有助于防止梯度消失问题。
-
Layer Normalization
+
层归一化
- Layer Normalization helps to stabilize the training process and improves convergence.
- It works by normalizing the inputs across the features, ensuring that the mean and
- variance of the activations are consistent. This normalization helps mitigate issues
- related to internal covariate shift, allowing the model to learn more effectively and
- reducing the sensitivity to the initial weights. Layer Normalization is applied twice
- in each Transformer block, once before the self-attention mechanism and once before
- the MLP layer.
+ 层归一化(Layer Normalization)有助于稳定训练过程并提高收敛性。它通过对特征之间的输入进行归一化来工作,
+ 确保激活的均值和方差一致。这种归一化有助于缓解与内部协变量偏移相关的问题,
+ 使模型能够更有效地学习并降低对初始权重的敏感度。每个 Transformer 块中都会
+ 应用两次层归一化,一次在自注意力机制之前,一次在 MLP 层之前。
-
Dropout
+
暂退法
- Dropout is a regularization technique used to prevent overfitting in neural networks
- by randomly setting a fraction of model weights to zero during training. This
- encourages the model to learn more robust features and reduces dependency on specific
- neurons, helping the network generalize better to new, unseen data. During model
- inference, dropout is deactivated. This essentially means that we are using an
- ensemble of the trained subnetworks, which leads to a better model performance.
+ 暂退法(Dropout)是一种正则化技术,通过在训练期间随机将模型权重的一部分设置为零来防止神经网络过度拟合。
+ 这鼓励模型学习更稳健的特征并减少对特定神经元的依赖,帮助网络更好地推广到新的、未见过的数据。
+ 在模型推理期间,Dropout 被停用。这本质上意味着我们正在使用经过训练的子网络的集合,从而提高模型性能。
-
Residual Connections
-
+
残差连接
- Residual connections were first introduced in the ResNet model in 2015. This
- architectural innovation revolutionized deep learning by enabling the training of very
- deep neural networks. Essentially, residual connections are shortcuts that bypass one
- or more layers, adding the input of a layer to its output. This helps mitigate the
- vanishing gradient problem, making it easier to train deep networks with multiple
- Transformer blocks stacked on top of each other. In GPT-2, residual connections are
- used twice within each Transformer block: once before the MLP and once after, ensuring
- that gradients flow more easily, and earlier layers receive sufficient updates during
- backpropagation.
+ 残差连接(Residual Connections)于 2015 年首次在 ResNet 模型中引入。这种架构创新通过实现非常深的神经网络的训练,
+ 彻底改变了深度学习。本质上,残差连接是绕过一个或多个层的捷径,将层的输入添加到其输出中。
+ 这有助于缓解梯度消失问题,从而更容易训练堆叠在一起的多个 Transformer 块的深度网络。
+ 在 GPT-2 中,每个 Transformer 块内使用两次残差连接:一次在 MLP 之前,一次在 MLP 之后,
+ 以确保梯度更容易流动,并且较早的层在反向传播期间获得足够的更新。
-
Interactive Features
+
互动功能
- Transformer Explainer is built to be interactive and allows you to explore the inner
- workings of the Transformer. Here are some of the interactive features you can play
- with:
+ Transformer Explainer 是交互式的,可让您探索 Transformer 的内部工作原理。以下是您可以使用的一些交互式功能:
- Input your own text sequence to see how the model processes it and predicts
- the next word. Explore attention weights, intermediate computations, and see how the final
- output probabilities are calculated.
+ 输入您自己的文本序列 ,看看模型如何处理它并预测下一个单词。探索注意力权重、中间计算,
+ 并看看最终输出概率是如何计算的。
- Use the temperature slider to control the randomness of the model’s predictions.
- Explore how you can make the model output more deterministic or more creative by changing
- the temperature value.
+ 使用温度滑块 控制模型预测的随机性。探索如何通过更改温度值使模型输出更具确定性或更具创造性。
- Interact with attention maps to see how the model focuses on different
- tokens in the input sequence. Hover over tokens to highlight their attention weights and
- explore how the model captures context and relationships between words.
+ 与注意力图交互 ,查看模型如何关注输入序列中的不同标记。将鼠标悬停在标记上
+ 以突出显示其注意力权重,并探索模型如何捕获上下文和单词之间的关系。
-
Video Tutorial
+
视频教程
+
VIDEO
-
How is Transformer Explainer Implemented?
+
Transformer Explainer 是如何构建的?
- Transformer Explainer features a live GPT-2 (small) model running directly in the
- browser. This model is derived from the PyTorch implementation of GPT by Andrej
- Karpathy's
+ Transformer Explainer 具有一个可直接在浏览器中运行的实时 GPT-2(小型)模型。
+ 该模型源自 Andrej Karpathy 的
nanoGPT project nanoGPT 项目
- and has been converted to
+ PyTorch GPT 实现,并已转换为
ONNX Runtime
- for seamless in-browser execution. The interface is built using JavaScript, with
+ 实现浏览器内无缝执行。该界面使用 JavaScript 构建,借助
Svelte
- as a front-end framework and
+ 作为前端框架以及使用
D3.js
- for creating dynamic visualizations. Numerical values are updated live following the user
- input.
+ 创建动态可视化。数值根据用户输入实时更新。
-
Who developed the Transformer Explainer?
+
谁开发了 Transformer Explainer?
- Transformer Explainer was created by
+ Transformer Explainer 的作者包括
- Aeree Cho ,
+ Aeree Cho ,
Grace C. Kim ,
- Alexander Karpekov ,
- Alec Helbling ,
- Jay Wang ,
- Seongmin Lee ,
- Benjamin Hoover , and
- Polo Chau
-
- at the Georgia Institute of Technology.
+ >,
+ Alexander Karpekov ,
+ Alec Helbling ,
+ Jay Wang ,
+ Seongmin Lee ,
+ Benjamin Hoover ,以及佐治亚理工学院的
+ Polo Chau 。