Skip to content

Commit 9000719

Browse files
committed
Finished 6.3
1 parent 6774398 commit 9000719

File tree

3 files changed

+26
-59
lines changed

3 files changed

+26
-59
lines changed

050_Search/20_Query_string.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
## 简易搜素
22

3-
`search` API有两种表单:一种是“简易版”的**查询字符串(query string)**将所有参数通过查询字符串定义,另一种**请求体(request body)**版本用JSON请求体和使用一种富搜索一眼叫做结构化查询语句。
3+
`search` API有两种表单:一种是“简易版”的**查询字符串(query string)**将所有参数通过查询字符串定义,另一种版本使用JSON完整的表示**请求体(request body)**,这种富搜索语言叫做结构化查询语句(DSL)
44

55
查询字符串搜索对于在命令行下运行**点对点(ad hoc)**查询特别有用。例如这个语句查询所有类型为`tweet`并在`tweet`字段中包含`elasticsearch`字符的文档:
66

052_Mapping_Analysis/30_Exact_vs_full_text.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ WHERE name = "John Smith"
4343

4444
* `"fox news hunting"`能返回有关hunting on Fox News的故事,而`"fox hunting news"`也能返回关于fox hunting的新闻故事。
4545

46-
为了方便在全文文本字段中进行这些类型的查询,Elasticsearch首先_分析_(analyzes)文本,然后使用结果建立一个_反向索引_。我们将在以下两个章节讨论反向索引及分析过程
46+
为了方便在全文文本字段中进行这些类型的查询,Elasticsearch首先_分析_(analyzes)文本,然后使用结果建立一个_倒排索引_。我们将在以下两个章节讨论倒排索引及分析过程
4747

4848

4949

+24-57
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,13 @@
1-
[[inverted-index]]
2-
=== Inverted index
1+
## 倒排索引
32

4-
Elasticsearch uses a structure called an _inverted index_ which is designed
5-
to allow very fast full text searches. An inverted index consists of a list
6-
of all the unique words that appear in any document, and for each word, a list
7-
of the documents in which it appears.
3+
Elasticsearch使用一种叫做**倒排索引(inverted index)**的结构来做快速的全文搜索。倒排索引由在文档中出现的唯一的单词列表,以及对于每个单词在文档中的位置组成。
84

9-
For example, let's say we have two documents, each with a `content` field
10-
containing:
5+
例如,我们有两个文档,每个文档`content`字段包含:
116

12-
1. ``The quick brown fox jumped over the lazy dog''
13-
2. ``Quick brown foxes leap over lazy dogs in summer''
7+
1. The quick brown fox jumped over the lazy dog
8+
2. Quick brown foxes leap over lazy dogs in summer
149

15-
To create an inverted index, we first split the `content` field of each
16-
document into separate words (which we call _terms_ or _tokens_), create a
17-
sorted list of all the unique terms, then list in which document each term
18-
appears. The result looks something like this:
10+
为了创建倒排索引,我们首先切分每个文档的`content`字段为单独的单词(我们把它们叫做**词(terms)**或者**表征(tokens)**)(译者注:关于`terms``tokens`的翻译比较生硬,只需知道语句分词后的个体叫做这两个。),把所有的唯一词放入列表并排序,结果是这个样子的:
1911

2012
Term Doc_1 Doc_2
2113
-------------------------
@@ -36,8 +28,7 @@ appears. The result looks something like this:
3628
the | X |
3729
------------------------
3830

39-
Now, if we want to search for `"quick brown"` we just need to find the
40-
documents in which each term appears:
31+
现在,如果我们想搜索`"quick brown"`,我们只需要找到每个词在哪个文档中出现既可:
4132

4233

4334
Term Doc_1 Doc_2
@@ -47,44 +38,26 @@ documents in which each term appears:
4738
------------------------
4839
Total | 2 | 1
4940

50-
Both documents match, but the first document has more matches than the second.
51-
If we apply a naive _similarity algorithm_ which just counts the number of
52-
matching terms, then we can say that the first document is a better match --
53-
is _more relevant_ to our query -- than the second document.
41+
两个文档都匹配,但是第一个比第二个有更多的匹配项。
42+
如果我们加入简单的**相似度算法(similarity algorithm)**,计算匹配单词的数目,这样我们就可以说第一个文档比第二个匹配度更高——对于我们的查询具有更多相关性。
5443

55-
But there are a few problems with our current inverted index:
44+
但是在我们的倒排索引中还有些问题:
5645

57-
1. `"Quick"` and `"quick"` appear as separate terms, while the user probably
58-
thinks of them as the same word.
46+
1. `"Quick"``"quick"`被认为是不同的单词,但是用户可能认为它们是相同的。
47+
2. `"fox"``"foxes"`很相似,就像`"dog"``"dogs"`——它们都是同根词。
48+
3. `"jumped"``"leap"`不是同根词,但意思相似——它们是同义词。
5949

60-
2. `"fox"` and `"foxes"` are pretty similar, as are `"dog"` and `"dogs"`
61-
-- they share the same root word.
50+
上面的索引中,搜索`"+Quick +fox"`不会匹配任何文档(记住,前缀`+`表示单词必须匹配到)。只有`"Quick"``"fox"`都在同一文档中才可以匹配查询,但是第一个文档包含`"quick fox"`且第二个文档包含`"Quick foxes"`。(译者注:这段真罗嗦,说白了就是单复数和同义词没法匹配)
6251

63-
3. `"jumped"` and `"leap"`, while not from the same root word, are similar
64-
in meaning -- they are synonyms.
52+
用户可以合理的希望两个文档都能匹配查询,我们也可以做的更好。
6553

66-
With the above index, a search for `"+Quick +fox"` wouldn't match any
67-
documents. (Remember, a preceding `+` means that the word must be present).
68-
Both the term `"Quick"` and the term `"fox"` have to be in the same document
69-
in order to satisfy the query, but the first doc contains `"quick fox"` and
70-
the second doc contains `"Quick foxes"`.
54+
如果我们将词为统一为标准格式,这样就可以找到不是确切匹配查询,但是足以相似从而可以关联的文档。例如:
7155

72-
Our user could reasonably expect both documents to match the query. We can do
73-
better.
56+
1. `"Quick"`可以转为小写成为`"quick"`
57+
2. `"foxes"`可以被转为根形式`""fox`。同理`"dogs"`可以被转为`"dog"`
58+
3. `"jumped"``"leap"`同义就可以只索引为单个词`"jump"`
7459

75-
If we normalize the terms into a standard format, then we can find documents
76-
that contain terms that are not exactly the same as the user requested, but
77-
are similar enough to still be relevant. For instance:
78-
79-
1. `"Quick"` can be lowercased to become `"quick"`.
80-
81-
2. `"foxes"` can be _stemmed_ -- reduced to its root form -- to
82-
become `"fox"`. Similarly `"dogs"` could be stemmed to `"dog"`.
83-
84-
3. `"jumped"` and `"leap"` are synonyms and can be indexed as just the
85-
single term `"jump"`.
86-
87-
Now the index looks like this:
60+
现在的索引:
8861

8962
Term Doc_1 Doc_2
9063
-------------------------
@@ -100,15 +73,9 @@ Now the index looks like this:
10073
the | X | X
10174
------------------------
10275

103-
But we're not there yet. Our search for `"+Quick +fox"` would *still* fail,
104-
because we no longer have the exact term `"Quick"` in our index. However, if
105-
we apply the same normalization rules that we used on the `content` field to
106-
our query string, it would become a query for `"+quick +fox"`, which would
107-
match both documents!
76+
但我们还未成功。我们的搜索`"+Quick +fox"`*依旧*失败,因为`"Quick"`的确切值已经不在索引里,不过,如果我们使用相同的标准化规则处理查询字符串的`content`字段,查询将变成`"+quick +fox"`,这样就可以匹配到两个文档。
10877

109-
IMPORTANT: This is very important. You can only find terms that actually exist in your
110-
index, so: *both the indexed text and the query string must be normalized
111-
into the same form*.
78+
>### IMPORTANT
79+
>这很重要。你只可以找到确实存在于索引中的词,所以**索引文本和查询字符串都要标准化为相同的形式**
11280
113-
This process of tokenization and normalization is called _analysis_, which we
114-
discuss in the next section.
81+
这个表征化和标准化的过程叫做**分析(analysis)**,这个在下节中我们讨论。

0 commit comments

Comments
 (0)