Skip to content

Commit 0d3e016

Browse files
author
aiddroid
committed
2 parents 10565ce + eecb1b2 commit 0d3e016

File tree

2 files changed

+43
-60
lines changed

2 files changed

+43
-60
lines changed

270_Fuzzy_matching/10_Intro.md

+13-23
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,19 @@
1-
[[fuzzy-matching]]
2-
== Typoes and Mispelings
1+
[[模糊-匹配]]
2+
== 打字错误 和 拼写错误
33

4-
We expect a query on structured data like dates and prices to return only
5-
documents that match exactly. ((("typoes and misspellings", "fuzzy matching")))((("fuzzy matching"))) However, good full-text search shouldn't have the
6-
same restriction. Instead, we can widen the net to include words that _may_
7-
match, but use the relevance score to push the better matches to the top
8-
of the result set.
4+
我们希望在结构化数据上的查询(如日期和价格)仅返回精确匹配的文档.
5+
((("typoes and misspellings", "fuzzy matching")))((("fuzzy matching"))) 然而, 好的全文检索不应该有同样的限制. 相反, 我们能拓宽网络以包含那些 _可能的_匹配, 并且利用相关性分数把更好的匹配结果放在结果集的前面.
96

10-
In fact, full-text search ((("full text search", "fuzzy matching")))that only matches exactly will probably frustrate
11-
your users. Wouldn't you expect a search for ``quick brown fox'' to match a
12-
document containing ``fast brown foxes,'' ``Johnny Walker'' to match
13-
``Johnnie Walker,'' or ``Arnold Shcwarzenneger'' to match ``Arnold
14-
Schwarzenegger''?
7+
事实上, 仅能精确匹配的全文检索 ((("full text search", "fuzzy matching")))可能会让你的用户感到失望. 难道你不希望一个对 ``quick brown fox'' 的检索能匹配包含
8+
``fast brown foxes,'' 的文档,对 ``Johnny Walker'' 的检索能匹配包含
9+
``Johnnie Walker,'' 的文档 或 ``Arnold Shcwarzenneger'' 能匹配 ``Arnold
10+
Schwarzenegger''吗?
1511

16-
If documents exist that _do_ contain exactly what the user has queried,
17-
they should appear at the top of the result set, but weaker matches can be
18-
included further down the list. If no documents match exactly, at least we
19-
can show the user potential matches; they may even be what the user
20-
originally intended!
12+
如果文档中存在 _确切_ 包含于用户查询的内容,它们应该出现在结果集的前面, 但更弱的匹配可能在下面的列表中.
13+
如果没有精确匹配的文档, 至少我们应该为用户显示可能的匹配结果; 它们甚至有可能就是用户原来想要的!
2114

22-
We have already looked at diacritic-free matching in <<token-normalization>>,
23-
word stemming in <<stemming>>, and synonyms in <<synonyms>>, but all of those
24-
approaches presuppose that words are spelled correctly, or that there is only
25-
one way to spell each word.
15+
我们已经在 <<token-normalization>> 看过 diacritic-free 匹配,
16+
在 <<stemming>> 看过 词干提取, 在 <<synonyms>> 看过 同义词, 但是所有的这些方法都预先假定了单词是正确拼写的, 或者每个词只有一种拼写方式.
2617

27-
Fuzzy matching allows for query-time matching of misspelled words, while
28-
phonetic token filters at index time can be used for _sounds-like_ matching.
18+
模糊匹配允许 查询-时 匹配拼写错误的单词, 音标表征过滤器能在索引时用于 _发音-相似_ 的匹配.
2919

270_Fuzzy_matching/20_Fuzziness.md

+30-37
Original file line numberDiff line numberDiff line change
@@ -1,53 +1,46 @@
1-
[[fuzziness]]
2-
=== Fuzziness
1+
[[模糊]]
2+
=== 模糊
33

4-
_Fuzzy matching_ treats two words that are ``fuzzily'' similar as if they were
5-
the same word.((("typoes and misspellings", "fuzziness, defining"))) First, we need to define what((("fuzziness"))) we mean by _fuzziness_.
4+
_模糊匹配_ 视两个单词 ``模糊'' 相似,正好像它们是同一个词.
5+
((("typoes and misspellings", "fuzziness, defining"))) 首先, 我们需要通过_fuzziness_ 来定义什么是((("fuzziness"))).
66

7-
In 1965, Vladimir Levenshtein developed the
8-
http://en.wikipedia.org/wiki/Levenshtein_distance[Levenshtein distance], which
9-
measures ((("Levenshtein distance")))the number of single-character edits required to transform
10-
one word into the other. He proposed three types of one-character edits:
7+
1965年, Vladimir Levenshtein 开发了
8+
http://en.wikipedia.org/wiki/Levenshtein_distance[Levenshtein distance(Levenshtein距离)], 用来度量把一个单词转换为另一个单词需要的单字符编辑次数 ((("Levenshtein distance"))).
9+
他提出了3种单字符编辑:
1110

12-
* _Substitution_ of one character for another: _f_ox -> _b_ox
11+
* _替换_ 一个字符到另一个字符: _f_ox -> _b_ox
1312

14-
* _Insertion_ of a new character: sic -> sic_k_
13+
* _插入_ 一个新字符: sic -> sic_k_
1514

16-
* _Deletion_ of a character:: b_l_ack -> back
15+
* _删除_ 一个字符:: b_l_ack -> back
1716

1817
http://en.wikipedia.org/wiki/Frederick_J._Damerau[Frederick Damerau]
19-
later expanded these operations ((("Damerau, Frederick J.")))to include one more:
18+
稍后扩展了这些操作并包含了1个新的 ((("Damerau, Frederick J."))):
2019

21-
* _Transposition_ of two adjacent characters: _st_ar -> _ts_ar
20+
* _换位_ 调整字符: _st_ar -> _ts_ar
2221

23-
For example, to convert the word `bieber` into `beaver` requires the
24-
following steps:
22+
例如,把 `bieber` 转换为 `beaver` 需要以下几步:
2523

26-
1. Substitute `v` for `b`: bie_b_er -> bie_v_er
27-
2. Substitute `a` for `i`: b_i_ever -> b_a_ever
28-
3. Transpose `a` and `e`: b_ae_ver -> b_ea_ver
24+
1. `v` 替换掉 `b`: bie_b_er -> bie_v_er
25+
2. `a` 替换掉 `i`: b_i_ever -> b_a_ever
26+
3. 换位 `a` `e` : b_ae_ver -> b_ea_ver
2927

30-
These three steps represent a
31-
http://bit.ly/1ymgZPB[Damerau-Levenshtein edit distance]
32-
of 3.
28+
以上的3步代表了3个
29+
http://bit.ly/1ymgZPB[Damerau-Levenshtein edit distance(Damerau-Levenshtein编辑距离)].
3330

34-
Clearly, `bieber` is a long way from `beaver`&#x2014;they are too far apart to be
35-
considered a simple misspelling. Damerau observed that 80% of human
36-
misspellings have an edit distance of 1. In other words, 80% of misspellings
37-
could be corrected with a _single edit_ to the original string.
31+
显然, `bieber``beaver`&#x2014很远;远得无法被认为是一个简单的拼写错误.
32+
Damerau发现 80% 的人类拼写错误的编辑距离都是1. 换句话说, 80% 的拼写错误都可以通过 _单次编辑_
33+
修改为原始的字符串.
3834

39-
Elasticsearch supports a maximum edit distance, specified with the `fuzziness`
40-
parameter, of 2.
35+
通过指定 `fuzziness` 参数为 2,Elasticsearch 支持最大的编辑距离.
4136

42-
Of course, the impact that a single edit has on a string depends on the
43-
length of the string. Two edits to the word `hat` can produce `mad`, so
44-
allowing two edits on a string of length 3 is overkill. The `fuzziness`
45-
parameter can be set to `AUTO`, which results in the following maximum edit distances:
37+
当然, 一个字符串的单次编辑次数依赖于它的长度. 对 `hat` 进行两次编辑可以得到 `mad`,
38+
所以允许对长度为3的字符串进行两次修改就太过了. `fuzziness`
39+
参数可以被设置成 `AUTO`, 结果会在下面的最大编辑距离中:
4640

47-
* `0` for strings of one or two characters
48-
* `1` for strings of three, four, or five characters
49-
* `2` for strings of more than five characters
41+
* `0` 1或2个字符的字符串
42+
* `1` 3、4或5个字符的字符串
43+
* `2` 多于5个字符的字符串
5044

51-
Of course, you may find that an edit distance of `2` is still overkill, and
52-
returns results that don't appear to be related. You may get better results,
53-
and better performance, with a maximum `fuzziness` of `1`.
45+
当然, 你可能发现编辑距离为`2` 仍然是太过了, 返回的结果好像并没有什么关联.
46+
`fuzziness` 设置为 `1` ,你可能会获得更好的结果和性能.

0 commit comments

Comments
 (0)