[Feature Request] Improve the performance of numeric range queries #18334

kkewwei · 2025-05-19T12:58:08Z

Is your feature request related to a problem? Please describe

In #13788, we introduce ApproximatePointRangeQuery for long and date type. However, several potential optimizations remain to be explored:

ApproximatePointRangeQuery is only used in long type, but doesn't used in integer, double, float type, we can extend the ApproximatePointRangeQuery to those numeric types.
ApproximatePointRangeQuery can't used in the follow dsl, which just add a bool on the outer compared with the standard dsl.

GET logs1/_search
{
    "query": {
        "bool": {
            "must": [
               {
                   "range": {
                      "size": {
                         "from": 1
                      }
                   }
               }
            ]
        }
    }
}

Consider extending ApproximatePointRangeQuery to term queries. For low-cardinality field, this could yield better optimization results.
For the ApproximatePointRangeQuery, we will visit at least trackTotalHitsUpTo(defalut 10k) docs in BKD tree, it may be more efficient to limit the visit to the size (default: 10) to reduce unnecessary overhead for the non-scoring case.

Describe the solution you'd like

no

Related component

Search:Performance

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

prudhvigodithi · 2025-05-19T17:19:22Z

Hey @kkewwei here is the similar issue to extend Approximation to other numeric types #14406 (comment) which I'm looking into.

Also I agree it would be more efficient to limit the visit to the size default to 10.

I'm about to open an issue with boolean approximation, any boolean query with filter (and must not) clause produces ConstantScore, example the following query

curl -X GET "localhost:9200/big5/_search" -H "Content-Type: application/json" -d'
{
  "profile": true,
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "@timestamp": {
              "gte": "2023-01-01T00:00:00",
              "lt": "2023-01-03T00:00:00"
            }
          }
        },
        {
          "match": {
            "process.name": "kernel"
          }
        }
      ]
    }
  }
}' | jq '.'

Can be rewritten to and can be approximated

curl -X GET "localhost:9200/big5/_search" -H "Content-Type: application/json" -d'
{
  "explain": true,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": "2023-01-01T00:00:00",
              "lt": "2023-01-03T00:00:00"
            }
          }
        },
        {
          "term": {
            "process.name": "kernel"
          }
        }
      ]
    }
  }
}'

Thanks
@getsaurabh02 @harshavamsi

kkewwei · 2025-05-20T01:08:48Z

@prudhvigodithi For the third, I'm not certain I grasp your point.

For multi-condition queries within a filter clause, approximating document IDs (docIDs) for range query seems challenging. For above example, suppose a range query matches docIDs in [0–20,000], and a term query matches docIDs in [19,000–30,000]. The intersection of these docIDs would be [19,000–20,000]. However, if we approximate the range query’s docIDs as [0–10,000], the intersection would incorrectly appear empty.

I’ve also pondered this issue. Perhaps instead of traversing all docIDs in the BKD tree upfront, we could represent them using an iterator and traverse them in real time as needed.

prudhvigodithi · 2025-05-20T16:06:50Z

Thats true @kkewwei we should come up with few ideas on how we can do a safe approximation for bool queries. We can brainstorm some ideas in this same issue and later can move that to a separate issue for bool queries approximation.

msfroh · 2025-05-20T16:50:18Z

For the ApproximatePointRangeQuery, we will visit at least trackTotalHitsUpTo(defalut 10k) docs in BKD tree, it may be more efficient to limit the visit to the size (default: 10) to reduce unnecessary overhead for the non-scoring case.

When @harshavamsi and I discussed the idea of ApproximatePointRangeQuery, we opted to collect up to trackTotalHitsUpTo to keep the same behavior regarding hit counts. By default, you'll get the exact number of matches if the number is < 10k.

I'm a bit reluctant to change that behavior, since it impacts the user experience.

bowenlan-amzn · 2025-06-04T19:03:48Z

For the ApproximatePointRangeQuery, we will visit at least trackTotalHitsUpTo(defalut 10k) docs in BKD tree, it may be more efficient to limit the visit to the size (default: 10) to reduce unnecessary overhead for the non-scoring case.

When @harshavamsi and I discussed the idea of ApproximatePointRangeQuery, we opted to collect up to trackTotalHitsUpTo to keep the same behavior regarding hit counts. By default, you'll get the exact number of matches if the number is < 10k.

I'm a bit reluctant to change that behavior, since it impacts the user experience.

Add on this, we discovered there's shortcutTotalHitCount logic for MatchAllQuery in a bug fix PR, so we can just vist the size and stop super early. But yeah, for bool query, we have to return 10k because no shortcut way to tell total hits.

OpenSearch/server/src/main/java/org/opensearch/search/query/TopDocsCollectorContext.java

Lines 712 to 763 in a6eb368

    
           static int shortcutTotalHitCount(IndexReader reader, Query query) throws IOException { 
        
               while (true) { 
        
                   // remove wrappers that don't matter for counts 
        
                   // this is necessary so that we don't only optimize match_all 
        
                   // queries but also match_all queries that are nested in 
        
                   // a constant_score query 
        
                   if (query instanceof ConstantScoreQuery) { 
        
                       query = ((ConstantScoreQuery) query).getQuery(); 
        
                   } else if (query instanceof BoostQuery) { 
        
                       query = ((BoostQuery) query).getQuery(); 
        
                   } else if (query instanceof ApproximateScoreQuery) { 
        
                       query = ((ApproximateScoreQuery) query).getOriginalQuery(); 
        
                   } else { 
        
                       break; 
        
                   } 
        
               } 
        
               if (query.getClass() == MatchAllDocsQuery.class) { 
        
                   return reader.numDocs(); 
        
               } else if (query.getClass() == TermQuery.class && reader.hasDeletions() == false) { 
        
                   final Term term = ((TermQuery) query).getTerm(); 
        
                   int count = 0; 
        
                   for (LeafReaderContext context : reader.leaves()) { 
        
                       count += context.reader().docFreq(term); 
        
                   } 
        
                   return count; 
        
               } else if (query.getClass() == FieldExistsQuery.class && reader.hasDeletions() == false) { 
        
                   final String field = ((FieldExistsQuery) query).getField(); 
        
                   int count = 0; 
        
                   for (LeafReaderContext context : reader.leaves()) { 
        
                       FieldInfos fieldInfos = context.reader().getFieldInfos(); 
        
                       FieldInfo fieldInfo = fieldInfos.fieldInfo(field); 
        
                       if (fieldInfo != null) { 
        
                           if (fieldInfo.getPointIndexDimensionCount() > 0) { 
        
                               PointValues points = context.reader().getPointValues(field); 
        
                               if (points != null) { 
        
                                   count += points.getDocCount(); 
        
                               } 
        
                           } else if (fieldInfo.getIndexOptions() != IndexOptions.NONE) { 
        
                               Terms terms = context.reader().terms(field); 
        
                               if (terms != null) { 
        
                                   count += terms.getDocCount(); 
        
                               } 
        
                           } else { 
        
                               return -1; // no shortcut possible for fields that are not indexed 
        
                           } 
        
                       } 
        
                   } 
        
                   return count; 
        
               } else { 
        
                   return -1; 
        
               } 
        
           }

@kkewwei I am not sure the idea for this point, would you explain a bit more?

Consider extending ApproximatePointRangeQuery to term queries. For low-cardinality field, this could yield better optimization results.
I'm thinking the benefit we get from ApproximatePointRangeQuery is the time saved when building iterator from BKD. But for term, I suppose there's not much cost to build iterator but just use the posting list for that term.

kkewwei · 2025-06-05T02:21:23Z

But for term, I suppose there's not much cost to build iterator but just use the posting list for that term.

@bowenlan-amzn Yes. When the number of matched terms exceeds 10,000, the ApproximatePointRangeQuery becomes effective, which is particularly beneficial for low-cardinality field. Conversely, if the matched count is below 10,000, there will be no performance degradation.

Add on this, we discovered there's shortcutTotalHitCount logic for MatchAllQuery in a bug fix PR, so we can just vist the size and stop super early

Yes, I find another bug in MatchAllQuery, it can return accurate totalhits, but not, relation=gte.

prudhvigodithi · 2025-06-05T02:34:12Z

@kkewwei FYI here is PR where the bug was fixed to include the shortcut https://github.com/opensearch-project/OpenSearch/pull/18189/files#diff-53173a30404a65ce7f35073d65258143cfa2d47947ff8e0817337ff3d37e01f3, as part of this the +1 is removed which shows the relation gte (without +1 it shows as eq) . But I have to add it back again in this PR #18235 to fix the flaky test #16851 (comment).

kkewwei added enhancement Enhancement or improvement to existing feature or request untriaged labels May 19, 2025

github-actions bot added the Search:Performance label May 19, 2025

github-project-automation bot added this to Search Project Board May 19, 2025

github-project-automation bot moved this to 🆕 New in Search Project Board May 19, 2025

sandeshkr419 removed the untriaged label May 21, 2025

prudhvigodithi mentioned this issue Jun 4, 2025

Workload updates to test and support OpenSearch Approximation Framework opensearch-project/opensearch-benchmark-workloads#655

Closed

6 tasks

This was referenced Jun 16, 2025

Extend Approximation Framework to other numeric types #18530

Draft

Support new queries for approximation framework opensearch-project/opensearch-benchmark-workloads#674

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] Improve the performance of numeric range queries #18334

[Feature Request] Improve the performance of numeric range queries #18334

kkewwei commented May 19, 2025 •

edited

Loading

prudhvigodithi commented May 19, 2025

Uh oh!

kkewwei commented May 20, 2025 •

edited

Loading

Uh oh!

prudhvigodithi commented May 20, 2025 •

edited

Loading

Uh oh!

msfroh commented May 20, 2025

Uh oh!

bowenlan-amzn commented Jun 4, 2025

Uh oh!

kkewwei commented Jun 5, 2025 •

edited

Loading

Uh oh!

prudhvigodithi commented Jun 5, 2025 •

edited

Loading

Uh oh!

[Feature Request] Improve the performance of numeric range queries #18334

[Feature Request] Improve the performance of numeric range queries #18334

Comments

kkewwei commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

prudhvigodithi commented May 19, 2025

Uh oh!

kkewwei commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

prudhvigodithi commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msfroh commented May 20, 2025

Uh oh!

bowenlan-amzn commented Jun 4, 2025

Uh oh!

kkewwei commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

prudhvigodithi commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kkewwei commented May 19, 2025 •

edited

Loading

kkewwei commented May 20, 2025 •

edited

Loading

prudhvigodithi commented May 20, 2025 •

edited

Loading

kkewwei commented Jun 5, 2025 •

edited

Loading

prudhvigodithi commented Jun 5, 2025 •

edited

Loading