Skip to content

Commit d540abe

Browse files
committed
index 170
1 parent c707e1c commit d540abe

33 files changed

+3650
-0
lines changed

120_Proximity_Matching/00_Intro.md

+38
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
[[proximity-matching]]
2+
== Proximity Matching
3+
4+
Standard full-text search with TF/IDF treats documents, or at least each field
5+
within a document, as a big _bag of words_.((("proximity matching"))) The `match` query can tell us whether
6+
that bag contains our search terms, but that is only part of the story.
7+
It can't tell us anything about the relationship between words.
8+
9+
Consider the difference between these sentences:
10+
11+
* Sue ate the alligator.
12+
* The alligator ate Sue.
13+
* Sue never goes anywhere without her alligator-skin purse.
14+
15+
A `match` query for `sue alligator` would match all three documents, but it
16+
doesn't tell us whether the two words form part of the same idea, or even the same
17+
paragraph.
18+
19+
Understanding how words relate to each other is a complicated problem, and
20+
we can't solve it by just using another type of query,
21+
but we can at least find words that appear to be related because they appear
22+
near each other or even right next to each other.
23+
24+
Each document may be much longer than the examples we have presented: `Sue`
25+
and `alligator` may be separated by paragraphs of other text. Perhaps we still
26+
want to return these documents in which the words are widely separated, but we
27+
want to give documents in which the words are close together a higher relevance
28+
score.
29+
30+
This is the province of _phrase matching_, or _proximity matching_.
31+
32+
[TIP]
33+
==================================================
34+
35+
In this chapter, we are using the same example documents that we used for
36+
the <<match-test-data,`match` query>>.
37+
38+
==================================================
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
[[phrase-matching]]
2+
=== Phrase Matching
3+
4+
In the same way that the `match` query is the go-to query for standard
5+
full-text search, the `match_phrase` query((("proximity matching", "phrase matching")))((("phrase matching")))((("match_phrase query"))) is the one you should reach for
6+
when you want to find words that are near each other:
7+
8+
[source,js]
9+
--------------------------------------------------
10+
GET /my_index/my_type/_search
11+
{
12+
"query": {
13+
"match_phrase": {
14+
"title": "quick brown fox"
15+
}
16+
}
17+
}
18+
--------------------------------------------------
19+
// SENSE: 120_Proximity_Matching/05_Match_phrase_query.json
20+
21+
Like the `match` query, the `match_phrase` query first analyzes the query
22+
string to produce a list of terms. It then searches for all the terms, but
23+
keeps only documents that contain _all_ of the search terms, in the same
24+
_positions_ relative to each other. A query for the phrase `quick fox`
25+
would not match any of our documents, because no document contains the word
26+
`quick` immediately followed by `fox`.
27+
28+
[TIP]
29+
==================================================
30+
31+
The `match_phrase` query can also be written as a `match` query with type
32+
`phrase`:
33+
34+
[source,js]
35+
--------------------------------------------------
36+
"match": {
37+
"title": {
38+
"query": "quick brown fox",
39+
"type": "phrase"
40+
}
41+
}
42+
--------------------------------------------------
43+
// SENSE: 120_Proximity_Matching/05_Match_phrase_query.json
44+
45+
==================================================
46+
47+
==== Term Positions
48+
49+
When a string is analyzed, the analyzer returns not((("phrase matching", "term positions")))((("match_phrase query", "position of terms")))((("position-aware matching"))) only a list of terms, but
50+
also the _position_, or order, of each term in the original string:
51+
52+
[source,js]
53+
--------------------------------------------------
54+
GET /_analyze?analyzer=standard
55+
Quick brown fox
56+
--------------------------------------------------
57+
// SENSE: 120_Proximity_Matching/05_Term_positions.json
58+
59+
This returns the following:
60+
61+
[role="pagebreak-before"]
62+
[source,js]
63+
--------------------------------------------------
64+
{
65+
"tokens": [
66+
{
67+
"token": "quick",
68+
"start_offset": 0,
69+
"end_offset": 5,
70+
"type": "<ALPHANUM>",
71+
"position": 1 <1>
72+
},
73+
{
74+
"token": "brown",
75+
"start_offset": 6,
76+
"end_offset": 11,
77+
"type": "<ALPHANUM>",
78+
"position": 2 <1>
79+
},
80+
{
81+
"token": "fox",
82+
"start_offset": 12,
83+
"end_offset": 15,
84+
"type": "<ALPHANUM>",
85+
"position": 3 <1>
86+
}
87+
]
88+
}
89+
--------------------------------------------------
90+
<1> The `position` of each term in the original string.
91+
92+
Positions can be stored in the inverted index, and position-aware queries like
93+
the `match_phrase` query can use them to match only documents that contain
94+
all the words in exactly the order specified, with no words in-between.
95+
96+
==== What Is a Phrase
97+
98+
For a document to be considered a((("match_phrase query", "documents matching a phrase")))((("phrase matching", "criteria for matching documents"))) match for the phrase ``quick brown fox,'' the following must be true:
99+
100+
* `quick`, `brown`, and `fox` must all appear in the field.
101+
102+
* The position of `brown` must be `1` greater than the position of `quick`.
103+
104+
* The position of `fox` must be `2` greater than the position of `quick`.
105+
106+
If any of these conditions is not met, the document is not considered a match.
107+
108+
[TIP]
109+
==================================================
110+
111+
Internally, the `match_phrase` query uses the low-level `span` query family to
112+
do position-aware matching. ((("match_phrase query", "use of span queries for position-aware matching")))((("span queries")))Span queries are term-level queries, so they have
113+
no analysis phase; they search for the exact term specified.
114+
115+
Thankfully, most people never need to use the `span` queries directly, as the
116+
`match_phrase` query is usually good enough. However, certain specialized
117+
fields, like patent searches, use these low-level queries to perform very
118+
specific, carefully constructed positional searches.
119+
120+
==================================================

120_Proximity_Matching/10_Slop.md

+61
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
[[slop]]
2+
=== Mixing It Up
3+
4+
Requiring exact-phrase matches ((("proximity matching", "slop parameter")))may be too strict a constraint. Perhaps we _do_
5+
want documents that contain ``quick brown fox'' to be considered a match for
6+
the query ``quick fox,'' even though the positions aren't exactly equivalent.
7+
8+
We can introduce a degree ((("slop parameter")))of flexibility into phrase matching by using the
9+
`slop` parameter:
10+
11+
[source,js]
12+
--------------------------------------------------
13+
GET /my_index/my_type/_search
14+
{
15+
"query": {
16+
"match_phrase": {
17+
"title": {
18+
"query": "quick fox",
19+
"slop": 1
20+
}
21+
}
22+
}
23+
}
24+
--------------------------------------------------
25+
// SENSE: 120_Proximity_Matching/10_Slop.json
26+
27+
The `slop` parameter tells the `match_phrase` query how((("match_phrase query", "slop parameter"))) far apart terms are
28+
allowed to be while still considering the document a match. By _how far
29+
apart_ we mean _how many times do you need to move a term in order to make
30+
the query and document match_?
31+
32+
We'll start with a simple example. To make the query `quick fox` match
33+
a document containing `quick brown fox` we need a `slop` of just `1`:
34+
35+
36+
Pos 1 Pos 2 Pos 3
37+
-----------------------------------------------
38+
Doc: quick brown fox
39+
-----------------------------------------------
40+
Query: quick fox
41+
Slop 1: quick ↳ fox
42+
43+
Although all words need to be present in phrase matching, even when using `slop`,
44+
the words don't necessarily need to be in the same sequence in order to
45+
match. With a high enough `slop` value, words can be arranged in any order.
46+
47+
To make the query `fox quick` match our document, we need a `slop` of `3`:
48+
49+
Pos 1 Pos 2 Pos 3
50+
-----------------------------------------------
51+
Doc: quick brown fox
52+
-----------------------------------------------
53+
Query: fox quick
54+
Slop 1: fox|quick ↵ <1>
55+
Slop 2: quick ↳ fox
56+
Slop 3: quick ↳ fox
57+
58+
<1> Note that `fox` and `quick` occupy the same position in this step.
59+
Switching word order from `fox quick` to `quick fox` thus requires two
60+
steps, or a `slop` of `2`.
61+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
=== Multivalue Fields
2+
3+
A curious thing can happen when you try to use phrase matching on multivalue
4+
fields. ((("proximity matching", "on multivalue fields")))((("match_phrase query", "on multivalue fields"))) Imagine that you index this document:
5+
6+
[source,js]
7+
--------------------------------------------------
8+
PUT /my_index/groups/1
9+
{
10+
"names": [ "John Abraham", "Lincoln Smith"]
11+
}
12+
--------------------------------------------------
13+
// SENSE: 120_Proximity_Matching/15_Multi_value_fields.json
14+
15+
Then run a phrase query for `Abraham Lincoln`:
16+
17+
[source,js]
18+
--------------------------------------------------
19+
GET /my_index/groups/_search
20+
{
21+
"query": {
22+
"match_phrase": {
23+
"names": "Abraham Lincoln"
24+
}
25+
}
26+
}
27+
--------------------------------------------------
28+
// SENSE: 120_Proximity_Matching/15_Multi_value_fields.json
29+
30+
Surprisingly, our document matches, even though `Abraham` and `Lincoln`
31+
belong to two different people in the `names` array. The reason for this comes
32+
down to the way arrays are indexed in Elasticsearch.
33+
34+
When `John Abraham` is analyzed, it produces this:
35+
36+
* Position 1: `john`
37+
* Position 2: `abraham`
38+
39+
Then when `Lincoln Smith` is analyzed, it produces this:
40+
41+
* Position 3: `lincoln`
42+
* Position 4: `smith`
43+
44+
In other words, Elasticsearch produces exactly the same list of tokens as it would have
45+
for the single string `John Abraham Lincoln Smith`. Our example query
46+
looks for `abraham` directly followed by `lincoln`, and these two terms do
47+
indeed exist, and they are right next to each other, so the query matches.
48+
49+
Fortunately, there is a simple workaround for cases like these, called the
50+
`position_offset_gap`, which((("mapping (types)", "position_offset_gap")))((("position_offset_gap"))) we need to configure in the field mapping:
51+
52+
[source,js]
53+
--------------------------------------------------
54+
DELETE /my_index/groups/ <1>
55+
56+
PUT /my_index/_mapping/groups <2>
57+
{
58+
"properties": {
59+
"names": {
60+
"type": "string",
61+
"position_offset_gap": 100
62+
}
63+
}
64+
}
65+
--------------------------------------------------
66+
// SENSE: 120_Proximity_Matching/15_Multi_value_fields.json
67+
68+
<1> First delete the `groups` mapping and all documents of that type.
69+
<2> Then create a new `groups` mapping with the correct values.
70+
71+
The `position_offset_gap` setting tells Elasticsearch that it should increase
72+
the current term `position` by the specified value for every new array
73+
element. So now, when we index the array of names, the terms are emitted with
74+
the following positions:
75+
76+
* Position 1: `john`
77+
* Position 2: `abraham`
78+
* Position 103: `lincoln`
79+
* Position 104: `smith`
80+
81+
Our phrase query would no longer match a document like this because `abraham`
82+
and `lincoln` are now 100 positions apart. You would have to add a `slop`
83+
value of 100 in order for this document to match.

120_Proximity_Matching/20_Scoring.md

+54
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
=== Closer Is Better
2+
3+
Whereas a phrase query simply excludes documents that don't contain the exact
4+
query phrase, a _proximity query_&#x2014;a ((("proximity matching", "proximity queries")))((("slop parameter", "proximity queries and")))phrase query where `slop` is greater
5+
than `0`&#x2014;incorporates the proximity of the query terms into the final
6+
relevance `_score`. By setting a high `slop` value like `50` or `100`, you can
7+
exclude documents in which the words are really too far apart, but give a higher
8+
score to documents in which the words are closer together.
9+
10+
The following proximity query for `quick dog` matches both documents that
11+
contain the words `quick` and `dog`, but gives a higher score to the
12+
document((("relevance scores", "for proximity queries"))) in which the words are nearer to each other:
13+
14+
[source,js]
15+
--------------------------------------------------
16+
POST /my_index/my_type/_search
17+
{
18+
"query": {
19+
"match_phrase": {
20+
"title": {
21+
"query": "quick dog",
22+
"slop": 50 <1>
23+
}
24+
}
25+
}
26+
}
27+
--------------------------------------------------
28+
// SENSE: 120_Proximity_Matching/20_Scoring.json
29+
30+
<1> Note the high `slop` value.
31+
32+
[source,js]
33+
--------------------------------------------------
34+
{
35+
"hits": [
36+
{
37+
"_id": "3",
38+
"_score": 0.75, <1>
39+
"_source": {
40+
"title": "The quick brown fox jumps over the quick dog"
41+
}
42+
},
43+
{
44+
"_id": "2",
45+
"_score": 0.28347334, <2>
46+
"_source": {
47+
"title": "The quick brown fox jumps over the lazy dog"
48+
}
49+
}
50+
]
51+
}
52+
--------------------------------------------------
53+
<1> Higher score because `quick` and `dog` are close together
54+
<2> Lower score because `quick` and `dog` are further apart

0 commit comments

Comments
 (0)