Skip to content

[Feature Request] Improve the performance of numeric range queries #18334

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kkewwei opened this issue May 19, 2025 · 7 comments
Open

[Feature Request] Improve the performance of numeric range queries #18334

kkewwei opened this issue May 19, 2025 · 7 comments
Labels
enhancement Enhancement or improvement to existing feature or request Search:Performance

Comments

@kkewwei
Copy link
Collaborator

kkewwei commented May 19, 2025

Is your feature request related to a problem? Please describe

In #13788, we introduce ApproximatePointRangeQuery for long and date type. However, several potential optimizations remain to be explored:

  1. ApproximatePointRangeQuery is only used in long type, but doesn't used in integer, double, float type, we can extend the ApproximatePointRangeQuery to those numeric types.
  2. ApproximatePointRangeQuery can't used in the follow dsl, which just add a bool on the outer compared with the standard dsl.
GET logs1/_search
{
    "query": {
        "bool": {
            "must": [
               {
                   "range": {
                      "size": {
                         "from": 1
                      }
                   }
               }
            ]
        }
    }
}
  1. Consider extending ApproximatePointRangeQuery to term queries. For low-cardinality field, this could yield better optimization results.
  2. For the ApproximatePointRangeQuery, we will visit at least trackTotalHitsUpTo(defalut 10k) docs in BKD tree, it may be more efficient to limit the visit to the size (default: 10) to reduce unnecessary overhead for the non-scoring case.

Describe the solution you'd like

no

Related component

Search:Performance

Describe alternatives you've considered

No response

Additional context

No response

@kkewwei kkewwei added enhancement Enhancement or improvement to existing feature or request untriaged labels May 19, 2025
@prudhvigodithi
Copy link
Member

Hey @kkewwei here is the similar issue to extend Approximation to other numeric types #14406 (comment) which I'm looking into.

Also I agree it would be more efficient to limit the visit to the size default to 10.

I'm about to open an issue with boolean approximation, any boolean query with filter (and must not) clause produces ConstantScore, example the following query

curl -X GET "localhost:9200/big5/_search" -H "Content-Type: application/json" -d'
{
  "profile": true,
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "@timestamp": {
              "gte": "2023-01-01T00:00:00",
              "lt": "2023-01-03T00:00:00"
            }
          }
        },
        {
          "match": {
            "process.name": "kernel"
          }
        }
      ]
    }
  }
}' | jq '.'

Can be rewritten to and can be approximated

curl -X GET "localhost:9200/big5/_search" -H "Content-Type: application/json" -d'
{
  "explain": true,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": "2023-01-01T00:00:00",
              "lt": "2023-01-03T00:00:00"
            }
          }
        },
        {
          "term": {
            "process.name": "kernel"
          }
        }
      ]
    }
  }
}'

Thanks
@getsaurabh02 @harshavamsi

@kkewwei
Copy link
Collaborator Author

kkewwei commented May 20, 2025

@prudhvigodithi For the third, I'm not certain I grasp your point.

For multi-condition queries within a filter clause, approximating document IDs (docIDs) for range query seems challenging. For above example, suppose a range query matches docIDs in [0–20,000], and a term query matches docIDs in [19,000–30,000]. The intersection of these docIDs would be [19,000–20,000]. However, if we approximate the range query’s docIDs as [0–10,000], the intersection would incorrectly appear empty.

I’ve also pondered this issue. Perhaps instead of traversing all docIDs in the BKD tree upfront, we could represent them using an iterator and traverse them in real time as needed.

@prudhvigodithi
Copy link
Member

prudhvigodithi commented May 20, 2025

Thats true @kkewwei we should come up with few ideas on how we can do a safe approximation for bool queries. We can brainstorm some ideas in this same issue and later can move that to a separate issue for bool queries approximation.

@msfroh
Copy link
Contributor

msfroh commented May 20, 2025

For the ApproximatePointRangeQuery, we will visit at least trackTotalHitsUpTo(defalut 10k) docs in BKD tree, it may be more efficient to limit the visit to the size (default: 10) to reduce unnecessary overhead for the non-scoring case.

When @harshavamsi and I discussed the idea of ApproximatePointRangeQuery, we opted to collect up to trackTotalHitsUpTo to keep the same behavior regarding hit counts. By default, you'll get the exact number of matches if the number is < 10k.

I'm a bit reluctant to change that behavior, since it impacts the user experience.

@bowenlan-amzn
Copy link
Member

For the ApproximatePointRangeQuery, we will visit at least trackTotalHitsUpTo(defalut 10k) docs in BKD tree, it may be more efficient to limit the visit to the size (default: 10) to reduce unnecessary overhead for the non-scoring case.

When @harshavamsi and I discussed the idea of ApproximatePointRangeQuery, we opted to collect up to trackTotalHitsUpTo to keep the same behavior regarding hit counts. By default, you'll get the exact number of matches if the number is < 10k.

I'm a bit reluctant to change that behavior, since it impacts the user experience.

Add on this, we discovered there's shortcutTotalHitCount logic for MatchAllQuery in a bug fix PR, so we can just vist the size and stop super early. But yeah, for bool query, we have to return 10k because no shortcut way to tell total hits.

static int shortcutTotalHitCount(IndexReader reader, Query query) throws IOException {
while (true) {
// remove wrappers that don't matter for counts
// this is necessary so that we don't only optimize match_all
// queries but also match_all queries that are nested in
// a constant_score query
if (query instanceof ConstantScoreQuery) {
query = ((ConstantScoreQuery) query).getQuery();
} else if (query instanceof BoostQuery) {
query = ((BoostQuery) query).getQuery();
} else if (query instanceof ApproximateScoreQuery) {
query = ((ApproximateScoreQuery) query).getOriginalQuery();
} else {
break;
}
}
if (query.getClass() == MatchAllDocsQuery.class) {
return reader.numDocs();
} else if (query.getClass() == TermQuery.class && reader.hasDeletions() == false) {
final Term term = ((TermQuery) query).getTerm();
int count = 0;
for (LeafReaderContext context : reader.leaves()) {
count += context.reader().docFreq(term);
}
return count;
} else if (query.getClass() == FieldExistsQuery.class && reader.hasDeletions() == false) {
final String field = ((FieldExistsQuery) query).getField();
int count = 0;
for (LeafReaderContext context : reader.leaves()) {
FieldInfos fieldInfos = context.reader().getFieldInfos();
FieldInfo fieldInfo = fieldInfos.fieldInfo(field);
if (fieldInfo != null) {
if (fieldInfo.getPointIndexDimensionCount() > 0) {
PointValues points = context.reader().getPointValues(field);
if (points != null) {
count += points.getDocCount();
}
} else if (fieldInfo.getIndexOptions() != IndexOptions.NONE) {
Terms terms = context.reader().terms(field);
if (terms != null) {
count += terms.getDocCount();
}
} else {
return -1; // no shortcut possible for fields that are not indexed
}
}
}
return count;
} else {
return -1;
}
}

@kkewwei I am not sure the idea for this point, would you explain a bit more?

Consider extending ApproximatePointRangeQuery to term queries. For low-cardinality field, this could yield better optimization results.
I'm thinking the benefit we get from ApproximatePointRangeQuery is the time saved when building iterator from BKD. But for term, I suppose there's not much cost to build iterator but just use the posting list for that term.

@kkewwei
Copy link
Collaborator Author

kkewwei commented Jun 5, 2025

But for term, I suppose there's not much cost to build iterator but just use the posting list for that term.

@bowenlan-amzn Yes. When the number of matched terms exceeds 10,000, the ApproximatePointRangeQuery becomes effective, which is particularly beneficial for low-cardinality field. Conversely, if the matched count is below 10,000, there will be no performance degradation.

Add on this, we discovered there's shortcutTotalHitCount logic for MatchAllQuery in a bug fix PR, so we can just vist the size and stop super early

Yes, I find another bug in MatchAllQuery, it can return accurate totalhits, but not, relation=gte.
Image

@prudhvigodithi
Copy link
Member

prudhvigodithi commented Jun 5, 2025

@kkewwei FYI here is PR where the bug was fixed to include the shortcut https://github.com/opensearch-project/OpenSearch/pull/18189/files#diff-53173a30404a65ce7f35073d65258143cfa2d47947ff8e0817337ff3d37e01f3, as part of this the +1 is removed which shows the relation gte (without +1 it shows as eq) . But I have to add it back again in this PR #18235 to fix the flaky test #16851 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Performance
Projects
Status: 🆕 New
Development

No branches or pull requests

5 participants