diff --git a/src/components/Attention.svelte b/src/components/Attention.svelte index 0ea8df8..db38bff 100644 --- a/src/components/Attention.svelte +++ b/src/components/Attention.svelte @@ -47,11 +47,11 @@
-
Multi-head Self Attention
+
多头自注意力
-
Transformer Block 1
+
Transformer 块 1
-
Out
+
输出
{#each $tokens as token, index}
diff --git a/src/components/AttentionMatrix.svelte b/src/components/AttentionMatrix.svelte index d2a06c0..0f4bf68 100644 --- a/src/components/AttentionMatrix.svelte +++ b/src/components/AttentionMatrix.svelte @@ -276,7 +276,7 @@ shape={'circle'} colorScale={qkColorScale} /> -
Dot product
+
点积
@@ -333,7 +333,7 @@ colorScale={maskedColorScale} />
-
Scaling · Mask
+
缩放 · 掩码
@@ -393,7 +393,7 @@ />
-
Softmax · Dropout
+
归一化指数函数 · 暂退
@@ -423,7 +423,7 @@ colorScale={softmaxColorScale} /> -
Attention
+
注意力
diff --git a/src/components/Embedding.svelte b/src/components/Embedding.svelte index d5d14d9..b3ae56a 100644 --- a/src/components/Embedding.svelte +++ b/src/components/Embedding.svelte @@ -118,7 +118,7 @@ on:mouseenter={handleMouseEnter} on:mouseleave={handleMouseLeave} > -
Embedding
+
嵌入
@@ -142,10 +142,10 @@
- Token
Embedding
Token 嵌入{`Converts tokens into \nsemantically meaningful \nnumerical representations.`}{`将 Token 转换为具有语义意义的数字表示`}
{#each $tokens as token, index} @@ -205,10 +205,10 @@
- Positional
Encoding
位置编码{`Encodes positional \ninformation of tokens into \nnumerical representations.`}{`将标记的位置信息编码为数字表示`}
{#each $tokens as token, index} diff --git a/src/components/HelpPopover.svelte b/src/components/HelpPopover.svelte index b6a4796..d08bc1f 100644 --- a/src/components/HelpPopover.svelte +++ b/src/components/HelpPopover.svelte @@ -35,7 +35,7 @@ {#if goTo}
- Read more + 阅读更多
{/if}
- Examples + 示例 {#each inputTextExample as text, index} @@ -185,10 +185,10 @@ {/if} {#if $isLoaded && $isFetchingModel} Try the examples while GPT-2 model is being downloaded (600MB)T在下载 GPT-2 模型时尝试示例(600MB) {:else if exceedLimit} - You can enter up to {wordLimit} words. + 你最多输入 {wordLimit} 个词 {/if}
@@ -201,7 +201,7 @@ type="submit" on:click={handleSubmit} > - Generate + 生成 diff --git a/src/components/LinearSoftmax.svelte b/src/components/LinearSoftmax.svelte index dfa7783..2439dc2 100644 --- a/src/components/LinearSoftmax.svelte +++ b/src/components/LinearSoftmax.svelte @@ -150,7 +150,7 @@ on:mouseenter={handleMouseEnter} on:mouseleave={handleMouseLeave} > -
Probabilities
+
概率
Logits
-
Exponents
+
指数
diff --git a/src/components/Mlp.svelte b/src/components/Mlp.svelte index 34a6173..15e1682 100644 --- a/src/components/Mlp.svelte +++ b/src/components/Mlp.svelte @@ -30,7 +30,7 @@
- MLP + 多层感知器 MLP
diff --git a/src/components/Operation.svelte b/src/components/Operation.svelte index 768b326..cb1a94b 100644 --- a/src/components/Operation.svelte +++ b/src/components/Operation.svelte @@ -62,7 +62,7 @@ - Layer Normalization + 层归一化 {/if}
{:else if type === 'residual-start'} @@ -70,7 +70,7 @@
{#if head} - Residual{/if} + 残差{/if}
diff --git a/src/components/Popovers/ActivationPopover.svelte b/src/components/Popovers/ActivationPopover.svelte index fc3a3aa..f1b01d8 100644 --- a/src/components/Popovers/ActivationPopover.svelte +++ b/src/components/Popovers/ActivationPopover.svelte @@ -11,7 +11,7 @@
- Applies activation function to neuron outputs. + 将激活函数应用于神经元输出。
diff --git a/src/components/Popovers/CommonPopover.svelte b/src/components/Popovers/CommonPopover.svelte index 985f6e5..3688b6e 100644 --- a/src/components/Popovers/CommonPopover.svelte +++ b/src/components/Popovers/CommonPopover.svelte @@ -42,7 +42,7 @@ {#if goTo}
- Read more + 阅读更多
{/if}
-
Disables randomly selected neurons.
禁用随机选择的神经元 diff --git a/src/components/Popovers/LayerNormPopover.svelte b/src/components/Popovers/LayerNormPopover.svelte index f0a8094..50b5304 100644 --- a/src/components/Popovers/LayerNormPopover.svelte +++ b/src/components/Popovers/LayerNormPopover.svelte @@ -14,7 +14,7 @@
- Standardizes layer inputs to maintain consistent mean and variance. + 标准化层输入以保持一致的平均值和方差
diff --git a/src/components/Popovers/MLPWeightPopover.svelte b/src/components/Popovers/MLPWeightPopover.svelte index 26c81e8..1db3268 100644 --- a/src/components/Popovers/MLPWeightPopover.svelte +++ b/src/components/Popovers/MLPWeightPopover.svelte @@ -283,7 +283,7 @@
-
Token Embedding
+
Token 嵌入
×
-
Q·K·V Weights
+
Q·K·V 权重
+
-
bias
+
偏置
-
Position
+
位置
{#each $tokens as token, token_idx}
-
Embedding
+
嵌入
{#each $tokens as token, token_idx}
-
Encoding Matrix
+
编码矩阵
diff --git a/src/components/Popovers/QKVWeightPopover.svelte b/src/components/Popovers/QKVWeightPopover.svelte index 8b9f42d..fefb0b2 100644 --- a/src/components/Popovers/QKVWeightPopover.svelte +++ b/src/components/Popovers/QKVWeightPopover.svelte @@ -221,7 +221,7 @@
-

QKV Calculation

+

QKV 计算过程

{#if isAnimationActive}
-
Embedding
+
嵌入
×
-
Q·K·V Weights
+
Q·K·V 权重
+
-
Bias
+
偏置
- Adds skip-connections to allow for better gradient flow. + 添加跳过连接以实现更好的梯度流
diff --git a/src/components/SubsequentBlocks.svelte b/src/components/SubsequentBlocks.svelte index 816f885..b6d0841 100644 --- a/src/components/SubsequentBlocks.svelte +++ b/src/components/SubsequentBlocks.svelte @@ -32,9 +32,9 @@ -->
- {$modelMeta.layer_num - 1} more identical
Transformer
Blocks
. + {$modelMeta.layer_num - 1} 更相同的
Transformer 块
-
Temperature
+
Temperature(温度)
- {`Changes the output \nprobability distribution \nand randomness \nof next token.`} + {`改变下一个 token 的输出概率分布和随机性.`}
diff --git a/src/components/article/Article.svelte b/src/components/article/Article.svelte index 2deda20..fbda609 100644 --- a/src/components/article/Article.svelte +++ b/src/components/article/Article.svelte @@ -15,472 +15,380 @@
-

What is a Transformer?

+

什么是 Transformer?

- Transformer is a neural network architecture that has fundamentally changed the approach to - Artificial Intelligence. Transformer was first introduced in the seminal paper + Transformer 首次出现在 2017 年的开创性论文 "Attention is All You Need"《Attention is All You Need》 - in 2017 and has since become the go-to architecture for deep learning models, powering text-generative - models like OpenAI's GPT, Meta's Llama, and Google's - Gemini. Beyond text, Transformer is also applied in + 中,此后已成为深度学习模型的首选架构,为 OpenAI + 的 GPT、Meta 的 Llama 和 Google 的 + Gemini 等文本生成模型提供支持。 + 除了文本之外,Transformer 还应用于 audio generation, + target="_blank">音频生成、 image recognition, + target="_blank">图像识别、 protein structure prediction, and even + >蛋白质结构预测,甚至 game playing, demonstrating its versatility across numerous domains. + target="_blank">游戏中,展示了其在众多领域的多功能性。

- Fundamentally, text-generative Transformer models operate on the principle of next-word prediction: given a text prompt from the user, what is the most probable next word that will follow - this input? The core innovation and power of Transformers lie in their use of self-attention mechanism, - which allows them to process entire sequences and capture long-range dependencies more effectively - than previous architectures. + 从根本上讲,文本生成 Transformer 模型的运行原理是下一个单词预测:给定用户的文本提示, + 紧随此输入之后的最有可能的下一个单词是什么?Transformer 的核心创新和强大之处在于它们使用了 + 自注意力机制,这使得它们能够比以前的架构更有效地处理整个序列并捕获长距离依赖关系。

- GPT-2 family of models are prominent examples of text-generative Transformers. Transformer - Explainer is powered by the + GPT-2 系列模型是文本生成 Transformers 的杰出代表。Transformer Explainer 基于 GPT-2 - (small) model which has 124 million parameters. While it is not the latest or most powerful Transformer - model, it shares many of the same architectural components and principles found in the current - state-of-the-art models making it an ideal starting point for understanding the basics. + (small),该模型有 1.24 亿个参数。虽然它不是最新或最强大的 Transformer 模型, + 但它具有许多与当前最先进模型相同的架构组件和原理,使其成为理解基础知识的理想起点。

-

Transformer Architecture

+

Transformer 架构

- Every text-generative Transformer consists of these three key components: + 每个文本生成 Transformer 都由以下三个关键组件组成:

  1. - Embedding: Text input is divided into smaller units - called tokens, which can be words or subwords. These tokens are converted into numerical - vectors called embeddings, which capture the semantic meaning of words. + 嵌入(Embedding):文本输入被划分为更小的单位, + 称为标记(token),可以是单词或子单词。这些标记被转换成数值向量,称为嵌入(Embedding),用于捕获单词的语义。
  2. - Transformer Block is the fundamental building block of - the model that processes and transforms the input data. Each block includes: + Transformer Block 是模型的基本构建块,用于处理和转换输入数据。 + 每个块包括:
    • - Attention Mechanism, the core component of the Transformer block. It - allows tokens to communicate with other tokens, capturing contextual information and - relationships between words. + 注意力机制(Attention Mechanism),Transformer 模块的核心组件。它允许 + token 与其他 token 进行通信,从而捕获上下文信息和单词之间的关系。
    • - MLP (Multilayer Perceptron) Layer, a feed-forward network that operates - on each token independently. While the goal of the attention layer is to route - information between tokens, the goal of the MLP is to refine each token's - representation. + MLP 层(多层感知器 Multilayer Perceptron), + 一个独立对每个标记进行操作的前馈网络。注意层的目标是在标记之间路由 + 信息,而 MLP 的目标是优化每个标记的表示。
  3. - Output Probabilities: The final linear and softmax - layers transform the processed embeddings into probabilities, enabling the model to make - predictions about the next token in a sequence. + 输出概率(Output Probabilities): + 最后的线性层和 softmax 层将处理后的嵌入转换为概率,使模型能够对序列中的下一个标记做出预测。
-

Embedding

+

嵌入

- Let's say you want to generate text using a Transformer model. You add the prompt like this - one: “Data visualization empowers users to”. This input needs to be converted - into a format that the model can understand and process. That is where embedding comes in: - it transforms the text into a numerical representation that the model can work with. To - convert a prompt into embedding, we need to 1) tokenize the input, 2) obtain token - embeddings, 3) add positional information, and finally 4) add up token and position - encodings to get the final embedding. Let’s see how each of these steps is done. + 假设您想使用 Transformer 模型生成文本。您添加如下提示词(prompt):“Data visualization empowers users to”。 + 此输入需要转换为模型可以理解和处理的格式。这就是嵌入的作用所在:它将文本转换为模型可以使用的数字表示。要将提示转换为嵌入, + 我们需要 1) 对输入进行标记,2) 获取标记嵌入,3) 添加位置信息,最后 4) 将标记和位置编码相加以获得最终嵌入。 + 让我们看看每个步骤是如何完成的。

- Figure 1. Expanding the Embedding layer view, showing how the - input prompt is converted to a vector representation. The process involves - (1) Tokenization, (2) Token Embedding, (3) Positional Encoding, - and (4) Final Embedding. + 图1,展开嵌入层视图,显示如何将输入提示转换为矢量表示。 + 该过程涉及 (1)标记化(2)标记嵌入(3)位置编码和(4)最终嵌入
-

Step 1: Tokenization

+

步骤1:标记化

- Tokenization is the process of breaking down the input text into smaller, more manageable - pieces called tokens. These tokens can be a word or a subword. The words "Data" - and "visualization" correspond to unique tokens, while the word - "empowers" - is split into two tokens. The full vocabulary of tokens is decided before training the model: - GPT-2's vocabulary has 50,257 unique tokens. Now that we split our input text - into tokens with distinct IDs, we can obtain their vector representation from embeddings. + 标记化(Tokenization)是将输入文本分解为更小、更易于管理的部分(称为标记)的过程。这些标记可以是单词或子单词。 + 单词 "Data"“visualization” 对应于唯一标记,而单词 “empowers” 则 + 被拆分为两个标记。完整的标记词汇表是在训练模型之前确定的:GPT-2 的词汇表有 50,257 个唯一标记。 + 现在我们将输入文本拆分为具有不同 ID 的标记,我们可以从嵌入中获取它们的向量表示。

-

Step 2. Token Embedding

+

步骤2:Token 嵌入

- GPT-2 Small represents each token in the vocabulary as a 768-dimensional vector; the - dimension of the vector depends on the model. These embedding vectors are stored in a - matrix of shape (50,257, 768), containing approximately 39 million - parameters! This extensive matrix allows the model to assign semantic meaning to each - token. + GPT-2 Small 将词汇表中的每个标记表示为一个 768 维向量;向量的维度取决于模型。这些嵌入向量存储在形状为 + (50,257, 768) 的矩阵中,包含大约 3900 万个参数!这个广泛的矩阵允许模型为每个标记分配语义含义。

-

Step 3. Positional Encoding

+

步骤3:位置编码

- The Embedding layer also encodes information about each token's position in the input - prompt. Different models use various methods for positional encoding. GPT-2 trains its own - positional encoding matrix from scratch, integrating it directly into the training - process. + Embedding 层还对每个 token 在输入提示中的位置信息进行编码。不同的模型使用不同的方法进行位置编码。 + GPT-2 从头开始​​训练自己的位置编码矩阵,将其直接集成到训练过程中。

-

Step 4. Final Embedding

+

步骤4:最终嵌入

- Finally, we sum the token and positional encodings to get the final embedding - representation. This combined representation captures both the semantic meaning of the - tokens and their position in the input sequence. + 最后,我们将标记和位置编码相加以获得最终的嵌入表示。这种组合表示既捕获了标记的语义含义,也捕获了它们在输入序列中的位置。

-

Transformer Block

+

Transformer 块

- The core of the Transformer's processing lies in the Transformer block, which comprises - multi-head self-attention and a Multi-Layer Perceptron layer. Most models consist of - multiple such blocks that are stacked sequentially one after the other. The token - representations evolve through layers, from the first block to the 12th one, allowing the - model to build up an intricate understanding of each token. This layered approach leads to - higher-order representations of the input. + Transformer 处理的核心在于 Transformer 块,它由多头自注意力和多层感知器层组成。大多数模型由多个这样的块组成, + 这些块按顺序一个接一个地堆叠在一起。Token 表示通过层级演变,从第一个块到第 12 个块,使模型能够对每个 Token 建立复杂的理解。 + 这种分层方法可以实现输入的高阶表示。

-

Multi-Head Self-Attention

+

多头自注意力

- The self-attention mechanism enables the model to focus on relevant parts of the input - sequence, allowing it to capture complex relationships and dependencies within the data. - Let’s look at how this self-attention is computed step-by-step. + 自注意力机制使模型能够专注于输入序列的相关部分,从而能够捕获数据中的复杂关系和依赖关系。 + 让我们一步步看看这种自注意力是如何计算的。

-

Step 1: Query, Key, and Value Matrices

+

第一步:查询、键和值矩阵(Query, Key, and Value Matrices)

- Figure 2. Computing Query, Key, and Value matrices from - the original embedding. + 图2,根据原始嵌入计算查询、键和值矩阵

- Each token's embedding vector is transformed into three vectors: - Query (Q), - Key (K), and - Value (V). These vectors are derived by multiplying the - input embedding matrix with learned weight matrices for - Q, - K, and - V. Here's a web search analogy to help us build some - intuition behind these matrices: + 每个 token 的嵌入向量被转换成三个向量: + Query (Q)、 + Key (K)和 + Value (V)。这些向量是通过将输入嵌入矩阵与学习到的权重矩阵相乘而得出的 + Q、 + K和 + V。这里有一个网络搜索类比,可以帮助我们建立这些矩阵背后的一些直觉:

  • - Query (Q) is the search text you type in - the search engine bar. This is the token you want to - "find more information about". + Query (Q) 是您在搜索引擎栏中输入的搜索文本。 + 这是您想要“查找更多信息”的标记。
  • - Key (K) is the title of each web page in the - search result window. It represents the possible tokens the query can attend to. + Key (K) 是搜索结果窗口中每个网页的标题。 + 它表示查询可以关注的可能的标记。
  • - Value (V) is the actual content of web pages - shown. Once we matched the appropriate search term (Query) with the relevant results (Key), - we want to get the content (Value) of the most relevant pages. + Value (V)是网页显示的实际内容。 + 当我们将适当的搜索词(Query)与相关结果(Key)匹配后,我们希望获得最相关页面的内容(Value)。

- By using these QKV values, the model can calculate attention scores, which determine how - much focus each token should receive when generating predictions. + 通过使用这些 QKV 值,模型可以计算注意力分数,这决定了每个标记在生成预测时应该获得多少关注。

-

Step 2: Masked Self-Attention

+

第二步:掩码自注意力机制

- Masked self-attention allows the model to generate sequences by focusing on relevant - parts of the input while preventing access to future tokens. + 掩码自注意力机制(Masked Self-Attention)允许模型通过关注输入的相关部分来生成序列,同时阻止访问未来的标记。

- Figure 3. Using Query, Key, and Value matrices to - calculate masked self-attention. + 图3,使用查询、键和值矩阵来计算掩蔽自注意力
  • - Attention Score: The dot product of - Query - and Key matrices determines the alignment of each query with - each key, producing a square matrix that reflects the relationship between all input tokens. + 注意力分数QueryKey + 矩阵的点积确定每个查询与每个键的对齐方式,从而产生一个反映所有输入标记之间关系的方阵。
  • - Masking: A mask is applied to the upper triangle of the attention - matrix to prevent the model from accessing future tokens, setting these values to - negative infinity. The model needs to learn how to predict the next token without - “peeking” into the future. + 掩码:对注意力矩阵的上三角应用掩码,以防止模型访问未来的标记,并将这些值设置为负无穷大。 + 模型需要学习如何在不“窥视”未来的情况下预测下一个标记。
  • - Softmax: After masking, the attention score is converted into - probability by the softmax operation which takes the exponent of each attention score. - Each row of the matrix sums up to one and indicates the relevance of every other token - to the left of it. + Softmax:经过掩码处理后,注意力得分通过 softmax 运算转换为概率,该运算取每个注意 + 力得分的指数。矩阵的每一行总和为 1,并表示其左侧每个其他标记的相关性。
-

Step 3: Output

+

第三步:输出

- The model uses the masked self-attention scores and multiplies them with the - Value matrix to get the - final output - of the self-attention mechanism. GPT-2 has 12 self-attention heads, each capturing - different relationships between tokens. The outputs of these heads are concatenated and passed - through a linear projection. + 该模型使用掩码后的自注意力得分,并将其与 Value 矩阵相乘, + 以获得自注意力机制的 最终输出。GPT-2 有 12 个 + 自注意力 heads,每个 head 捕获 token 之间的不同关系。这些 head 的输出被连接起来并通过线性投影(linear projection)。

-

MLP: Multi-Layer Perceptron

+

多层感知器

- Figure 4. Using MLP layer to project the self-attention - representations into higher dimensions to enhance the model's representational capacity. + 图4,使用 MLP 层将自注意力表征投影到更高维度,以增强模型的表征能力

- After the multiple heads of self-attention capture the diverse relationships between the - input tokens, the concatenated outputs are passed through the Multilayer Perceptron - (MLP) layer to enhance the model's representational capacity. The MLP block consists of - two linear transformations with a GELU activation function in between. The first linear - transformation increases the dimensionality of the input four-fold from 768 - to 3072. The second linear transformation reduces the dimensionality back - to the original size of 768, ensuring that the subsequent layers receive - inputs of consistent dimensions. Unlike the self-attention mechanism, the MLP processes - tokens independently and simply map them from one representation to another. + 在多个自注意力机制捕获输入 token 之间的不同关系后,连接的输出将通过多层感知器(MLP,Multi-Layer Perceptron)层, + 以增强模型的表示能力。MLP 块由两个线性变换组成,中间有一个 GELU 激活函数。 + 第一个线性变换将输入的维数从 768 增加了四倍至 3072。 + 第二个线性变换将维数降低回原始大小 768,确保后续层接收一致维度的输入。 + 与自注意力机制不同,MLP 独立处理 token 并简单地将它们从一种表示映射到另一种表示。

-

Output Probabilities

+

输出概率

- After the input has been processed through all Transformer blocks, the output is passed - through the final linear layer to prepare it for token prediction. This layer projects - the final representations into a 50,257 - dimensional space, where every token in the vocabulary has a corresponding value called - logit. Any token can be the next word, so this process allows us to simply - rank these tokens by their likelihood of being that next word. We then apply the softmax - function to convert the logits into a probability distribution that sums to one. This - will allow us to sample the next token based on its likelihood. + 在输入经过所有 Transformer 块处理后,输出将通过最后的线性层,为标记预测做好准备。 + 此层将最终表示投影到 50,257 维空间中,词汇表中的每个标记都有一个对应的值, + 称为 logit。任何标记都可以是下一个单词,因此此过程允许我们根据它们成为 + 下一个单词的可能性对这些标记进行简单排序。然后,我们应用 softmax 函数将 logit 转换为 + 总和为 1 的概率分布。这将使我们能够根据其可能性对下一个标记进行采样。

- Figure 5. Each token in the vocabulary is assigned a - probability based on the model's output logits. These probabilities determine the - likelihood of each token being the next word in the sequence. + 图5,词汇表中的每个标记都根据模型的输出逻辑 + 分配一个概率,这些概率决定了每个标记成为序列中下一个单词的可能性

- The final step is to generate the next token by sampling from this distribution The temperature - hyperparameter plays a critical role in this process. Mathematically speaking, it is a very - simple operation: model output logits are simply divided by the - temperature: + 最后一步是从该分布中采样来生成下一个标记。temperature 超参数在 + 此过程中起着关键作用。从数学上讲,这是一个非常简单的操作:模型输出 logits 只 + 需除以 temperature

  • - temperature = 1: Dividing logits by one has no effect on the softmax - outputs. + temperature = 1:将 logits 除以 1 对 softmax 输出没有影响。
  • - temperature < 1: Lower temperature makes the model more confident and - deterministic by sharpening the probability distribution, leading to more predictable - outputs. + temperature < 1:较低的温度通过锐化概率分布使模型更加自信和确定,从而产生更可预测的输出。
  • - temperature > 1: Higher temperature creates a softer probability - distribution, allowing for more randomness in the generated text – what some refer to - as model “creativity”. + temperature > 1:较高的温度会产生更柔和的概率分布,从而允许生成的文本具有更多的随机性 - 有些人称之为模型“创造力”

- Adjust the temperature and see how you can balance between deterministic and diverse - outputs! + 调节温度,看看如何在确定性和多样化输出之间取得平衡!

-

Advanced Architectural Features

+

高级架构功能

- There are several advanced architectural features that enhance the performance of - Transformer models. While important for the model's overall performance, they are not as - important for understanding the core concepts of the architecture. Layer Normalization, - Dropout, and Residual Connections are crucial components in Transformer models, - particularly during the training phase. Layer Normalization stabilizes training and - helps the model converge faster. Dropout prevents overfitting by randomly deactivating - neurons. Residual Connections allows gradients to flow directly through the network and - helps to prevent the vanishing gradient problem. + 有几种高级架构功能可增强 Transformer 模型的性能。虽然它们对于模型的整体性能很重要, + 但对于理解架构的核心概念却不那么重要。层规范化、Dropout 和残差连接是 Transformer + 模型中的关键组件,尤其是在训练阶段。层规范化可以稳定训练并帮助模型更快地收敛。 + Dropout 通过随机停用神经元来防止过度拟合。残差连接允许梯度直接流过网络并有助于防止梯度消失问题。

-

Layer Normalization

+

层归一化

- Layer Normalization helps to stabilize the training process and improves convergence. - It works by normalizing the inputs across the features, ensuring that the mean and - variance of the activations are consistent. This normalization helps mitigate issues - related to internal covariate shift, allowing the model to learn more effectively and - reducing the sensitivity to the initial weights. Layer Normalization is applied twice - in each Transformer block, once before the self-attention mechanism and once before - the MLP layer. + 层归一化(Layer Normalization)有助于稳定训练过程并提高收敛性。它通过对特征之间的输入进行归一化来工作, + 确保激活的均值和方差一致。这种归一化有助于缓解与内部协变量偏移相关的问题, + 使模型能够更有效地学习并降低对初始权重的敏感度。每个 Transformer 块中都会 + 应用两次层归一化,一次在自注意力机制之前,一次在 MLP 层之前。

-

Dropout

+

暂退法

- Dropout is a regularization technique used to prevent overfitting in neural networks - by randomly setting a fraction of model weights to zero during training. This - encourages the model to learn more robust features and reduces dependency on specific - neurons, helping the network generalize better to new, unseen data. During model - inference, dropout is deactivated. This essentially means that we are using an - ensemble of the trained subnetworks, which leads to a better model performance. + 暂退法(Dropout)是一种正则化技术,通过在训练期间随机将模型权重的一部分设置为零来防止神经网络过度拟合。 + 这鼓励模型学习更稳健的特征并减少对特定神经元的依赖,帮助网络更好地推广到新的、未见过的数据。 + 在模型推理期间,Dropout 被停用。这本质上意味着我们正在使用经过训练的子网络的集合,从而提高模型性能。

-

Residual Connections

- +

残差连接

- Residual connections were first introduced in the ResNet model in 2015. This - architectural innovation revolutionized deep learning by enabling the training of very - deep neural networks. Essentially, residual connections are shortcuts that bypass one - or more layers, adding the input of a layer to its output. This helps mitigate the - vanishing gradient problem, making it easier to train deep networks with multiple - Transformer blocks stacked on top of each other. In GPT-2, residual connections are - used twice within each Transformer block: once before the MLP and once after, ensuring - that gradients flow more easily, and earlier layers receive sufficient updates during - backpropagation. + 残差连接(Residual Connections)于 2015 年首次在 ResNet 模型中引入。这种架构创新通过实现非常深的神经网络的训练, + 彻底改变了深度学习。本质上,残差连接是绕过一个或多个层的捷径,将层的输入添加到其输出中。 + 这有助于缓解梯度消失问题,从而更容易训练堆叠在一起的多个 Transformer 块的深度网络。 + 在 GPT-2 中,每个 Transformer 块内使用两次残差连接:一次在 MLP 之前,一次在 MLP 之后, + 以确保梯度更容易流动,并且较早的层在反向传播期间获得足够的更新。

-

Interactive Features

+

互动功能

- Transformer Explainer is built to be interactive and allows you to explore the inner - workings of the Transformer. Here are some of the interactive features you can play - with: + Transformer Explainer 是交互式的,可让您探索 Transformer 的内部工作原理。以下是您可以使用的一些交互式功能:

  • - Input your own text sequence to see how the model processes it and predicts - the next word. Explore attention weights, intermediate computations, and see how the final - output probabilities are calculated. + 输入您自己的文本序列,看看模型如何处理它并预测下一个单词。探索注意力权重、中间计算, + 并看看最终输出概率是如何计算的。
  • - Use the temperature slider to control the randomness of the model’s predictions. - Explore how you can make the model output more deterministic or more creative by changing - the temperature value. + 使用温度滑块控制模型预测的随机性。探索如何通过更改温度值使模型输出更具确定性或更具创造性。
  • - Interact with attention maps to see how the model focuses on different - tokens in the input sequence. Hover over tokens to highlight their attention weights and - explore how the model captures context and relationships between words. + 与注意力图交互,查看模型如何关注输入序列中的不同标记。将鼠标悬停在标记上 + 以突出显示其注意力权重,并探索模型如何捕获上下文和单词之间的关系。
-

Video Tutorial

+

视频教程

+