1
- [[ inverted-index]]
2
- === Inverted index
1
+ ## 倒排索引
3
2
4
- Elasticsearch uses a structure called an _ inverted index_ which is designed
5
- to allow very fast full text searches. An inverted index consists of a list
6
- of all the unique words that appear in any document, and for each word, a list
7
- of the documents in which it appears.
3
+ Elasticsearch使用一种叫做** 倒排索引(inverted index)** 的结构来做快速的全文搜索。倒排索引由在文档中出现的唯一的单词列表,以及对于每个单词在文档中的位置组成。
8
4
9
- For example, let's say we have two documents, each with a ` content ` field
10
- containing:
5
+ 例如,我们有两个文档,每个文档` content ` 字段包含:
11
6
12
- 1 . `` The quick brown fox jumped over the lazy dog''
13
- 2 . `` Quick brown foxes leap over lazy dogs in summer''
7
+ 1 . The quick brown fox jumped over the lazy dog
8
+ 2 . Quick brown foxes leap over lazy dogs in summer
14
9
15
- To create an inverted index, we first split the ` content ` field of each
16
- document into separate words (which we call _ terms_ or _ tokens_ ), create a
17
- sorted list of all the unique terms, then list in which document each term
18
- appears. The result looks something like this:
10
+ 为了创建倒排索引,我们首先切分每个文档的` content ` 字段为单独的单词(我们把它们叫做** 词(terms)** 或者** 表征(tokens)** )(译者注:关于` terms ` 和` tokens ` 的翻译比较生硬,只需知道语句分词后的个体叫做这两个。),把所有的唯一词放入列表并排序,结果是这个样子的:
19
11
20
12
Term Doc_1 Doc_2
21
13
-------------------------
@@ -36,8 +28,7 @@ appears. The result looks something like this:
36
28
the | X |
37
29
------------------------
38
30
39
- Now, if we want to search for ` "quick brown" ` we just need to find the
40
- documents in which each term appears:
31
+ 现在,如果我们想搜索` "quick brown" ` ,我们只需要找到每个词在哪个文档中出现既可:
41
32
42
33
43
34
Term Doc_1 Doc_2
@@ -47,44 +38,26 @@ documents in which each term appears:
47
38
------------------------
48
39
Total | 2 | 1
49
40
50
- Both documents match, but the first document has more matches than the second.
51
- If we apply a naive _ similarity algorithm_ which just counts the number of
52
- matching terms, then we can say that the first document is a better match --
53
- is _ more relevant_ to our query -- than the second document.
41
+ 两个文档都匹配,但是第一个比第二个有更多的匹配项。
42
+ 如果我们加入简单的** 相似度算法(similarity algorithm)** ,计算匹配单词的数目,这样我们就可以说第一个文档比第二个匹配度更高——对于我们的查询具有更多相关性。
54
43
55
- But there are a few problems with our current inverted index:
44
+ 但是在我们的倒排索引中还有些问题:
56
45
57
- 1 . ` "Quick" ` and ` "quick" ` appear as separate terms, while the user probably
58
- thinks of them as the same word.
46
+ 1 . ` "Quick" ` 和` "quick" ` 被认为是不同的单词,但是用户可能认为它们是相同的。
47
+ 2 . ` "fox" ` 和` "foxes" ` 很相似,就像` "dog" ` 和` "dogs" ` ——它们都是同根词。
48
+ 3 . ` "jumped" ` 和` "leap" ` 不是同根词,但意思相似——它们是同义词。
59
49
60
- 2 . ` "fox" ` and ` "foxes" ` are pretty similar, as are ` "dog" ` and ` "dogs" `
61
- -- they share the same root word.
50
+ 上面的索引中,搜索` "+Quick +fox" ` 不会匹配任何文档(记住,前缀` + ` 表示单词必须匹配到)。只有` "Quick" ` 和` "fox" ` 都在同一文档中才可以匹配查询,但是第一个文档包含` "quick fox" ` 且第二个文档包含` "Quick foxes" ` 。(译者注:这段真罗嗦,说白了就是单复数和同义词没法匹配)
62
51
63
- 3 . ` "jumped" ` and ` "leap" ` , while not from the same root word, are similar
64
- in meaning -- they are synonyms.
52
+ 用户可以合理的希望两个文档都能匹配查询,我们也可以做的更好。
65
53
66
- With the above index, a search for ` "+Quick +fox" ` wouldn't match any
67
- documents. (Remember, a preceding ` + ` means that the word must be present).
68
- Both the term ` "Quick" ` and the term ` "fox" ` have to be in the same document
69
- in order to satisfy the query, but the first doc contains ` "quick fox" ` and
70
- the second doc contains ` "Quick foxes" ` .
54
+ 如果我们将词为统一为标准格式,这样就可以找到不是确切匹配查询,但是足以相似从而可以关联的文档。例如:
71
55
72
- Our user could reasonably expect both documents to match the query. We can do
73
- better.
56
+ 1 . ` "Quick" ` 可以转为小写成为` "quick" ` 。
57
+ 2 . ` "foxes" ` 可以被转为根形式` ""fox ` 。同理` "dogs" ` 可以被转为` "dog" ` 。
58
+ 3 . ` "jumped" ` 和` "leap" ` 同义就可以只索引为单个词` "jump" `
74
59
75
- If we normalize the terms into a standard format, then we can find documents
76
- that contain terms that are not exactly the same as the user requested, but
77
- are similar enough to still be relevant. For instance:
78
-
79
- 1 . ` "Quick" ` can be lowercased to become ` "quick" ` .
80
-
81
- 2 . ` "foxes" ` can be _ stemmed_ -- reduced to its root form -- to
82
- become ` "fox" ` . Similarly ` "dogs" ` could be stemmed to ` "dog" ` .
83
-
84
- 3 . ` "jumped" ` and ` "leap" ` are synonyms and can be indexed as just the
85
- single term ` "jump" ` .
86
-
87
- Now the index looks like this:
60
+ 现在的索引:
88
61
89
62
Term Doc_1 Doc_2
90
63
-------------------------
@@ -100,15 +73,9 @@ Now the index looks like this:
100
73
the | X | X
101
74
------------------------
102
75
103
- But we're not there yet. Our search for ` "+Quick +fox" ` would * still* fail,
104
- because we no longer have the exact term ` "Quick" ` in our index. However, if
105
- we apply the same normalization rules that we used on the ` content ` field to
106
- our query string, it would become a query for ` "+quick +fox" ` , which would
107
- match both documents!
76
+ 但我们还未成功。我们的搜索` "+Quick +fox" ` * 依旧* 失败,因为` "Quick" ` 的确切值已经不在索引里,不过,如果我们使用相同的标准化规则处理查询字符串的` content ` 字段,查询将变成` "+quick +fox" ` ,这样就可以匹配到两个文档。
108
77
109
- IMPORTANT: This is very important. You can only find terms that actually exist in your
110
- index, so: * both the indexed text and the query string must be normalized
111
- into the same form* .
78
+ > ### IMPORTANT
79
+ > 这很重要。你只可以找到确实存在于索引中的词,所以** 索引文本和查询字符串都要标准化为相同的形式** 。
112
80
113
- This process of tokenization and normalization is called _ analysis_ , which we
114
- discuss in the next section.
81
+ 这个表征化和标准化的过程叫做** 分析(analysis)** ,这个在下节中我们讨论。
0 commit comments