-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR-15437: ReRanking/LTR does not work in combination with custom sort and SolrCloud #151
Conversation
The basic idea is to have two queues in the QueryComponent. First get the results of the rerankend docs and after them place the other docs
solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java
Outdated
Show resolved
Hide resolved
queue.insertWithOverflow(shardDoc); | ||
if(reRankQueue != null && docCounter++ <= reRankDocsSize) { | ||
ShardDoc droppedShardDoc = reRankQueue.insertWithOverflow(shardDoc); | ||
// FIXME: Only works if the original request does not sort by score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current solution only works if the original sort did not sort by score.
This is because the score of the documents is overwritten during the reRanking.
The reRankDocs-param which specifies the amount of docs that should be reRanked, is used per shard, but also has to be applied while combining the results.
Therefore, each shard response may contain documents which were reRanked on the shard, but should not be reRanked in the combined result.
Example:
reRankDocs = 2
shard1: doc_1 (score 200, reRanked), doc_2 (score 100, reRanked, original score 40), doc_3 (score 30)
shard2: doc_4 (score 300, reRanked), doc_5 (score 50, reRanked, original score 25), doc_6 (score 20)
expected result:
doc_4 (score 300, reRanked), doc_1 (score 200, reRanked), doc_2, doc_3, doc_5, doc_6
actual result:
doc_4 (score 300, reRanked), doc_1 (score 200, reRanked), doc_2, doc_5, doc_3, doc_6
The problem is, that we compare the score after reRanking (doc 2 & 5) with the score before reRanking (doc 3 & 6).
We have no access to the score before reRanking at this point and I currently see no possibility to retrieve it again.
Depending on the used reRanking algorithm, these scores may differ greatly, which results in an incorrect ordering of the results starting at position > reRankDocs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<thinking out aloud>
... The reRankDocs-param which specifies the amount of docs that should be reRanked, is used per shard, but also has to be applied while combining the results. ...
Example: ... reRankDocs = 2 shard1: ... shard2: ...
Might it be helpful to include in the example the original score for doc_1 and doc_4 and to have a "pre-example" of what the logic and outcome is if the collection was single sharded and a "post-example" for (say) 4-sharded?
Can distributed, sharded re-ranking arrive at exactly the same outcome? Might the "but also has to be applied while combining the results" be up-for-debate e.g. a reRankDocs on shard level and a (possibly different) foobarReRankDocs on the federation level?
... We have no access to the score before reRanking at this point and I currently see no possibility to retrieve it again. ...
Conceptually if a feature store had an OriginalScoreFeature
and fl=[fv]
feature values were requested then in terms of "transport" the score before reRanking could be available in the federator. As a proof-of-concept that could be explored by making sure the store has the feature and the fl asks for the feature values. As a real implementation it could be a bit messy to capture, send and use the score that way. And actually, hmm, the score might be available via the fl
via the "get fields" part of the distributed use of the query component but if/since the id merging happens before "get fields" then yes no original score available during merging.
</thinking out aloud>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current solution only works if the original sort did not sort by score. ...
... We have no access to the score before reRanking at this point and I currently see no possibility to retrieve it again. ...
I had some more ideas around this, it's a semi-structured trail of thoughts, but sharing in that form in the hope that it might be useful and/or trigger further ideas and thoughts.
So as indicated by the // FIXME: Only works if the original request does not sort by score
comment location, the problem happens when a document drops out of reRankQueue
(where it's compared against other reranked documents via the reranked score) and joins queue
(where it needs to be compared against other (reranked or unreranked) documents via the sort clause).
One way to avoid this is to ensure dropping out never happens but that is a change to the overall reranking logic and so for now here let's assume that is not the logic we want ...
OriginalScoreFeature via fl=[fv] providing an original score is too late timing wise since it happens during "get fields" after the "merge ids".
ShardParams.DISTRIB_SINGLE_PASS "distrib.singlePass" could be used to bring timing forward but that would have other performance etc. implications.
Implying "distrib.singlePass=true" and implying fl=[fv] and an OriginalScoreFeature that might not have been there in the model or feature store would also be messy both on shard and federator code level.
All that aside, this a reranking problem generally i.e. not an LTR reranking problem specifically i.e. OriginalScoreFeature and fl=[fv] is too specific a solution.
So let's leave the merging logic here alone for a moment and go to the shard code and determine where exactly the "original_score" information is lost. If we find where it's lost then we can preserve it there and pass it along via the code path that the "score" information takes and eventually it would reach the merging logic here, at least that's the theory.
The score changes during rescoring.
In the case of LTR rescoring https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.8.2/solr/contrib/ltr/src/java/org/apache/solr/ltr/LTRRescorer.java#L121-L131 allocates reranked
which is then filled from firstPassResults
i.e. structurally this is where the information loss happens. For non-LTR rescoring something equivalent presumably happens elsewhere.
For thinking purposes let's assume that we could add an original_score
field to ScoreDoc
itself i.e. at line 210 we could preserve the score:
scorer.getDocInfo().setOriginalDocScore(hit.score);
+ hit.original_score = hit.score;
hit.score = scorer.score();
Next let's explore how a distributed query's scores get from the shard to the federator.
(Shoutout to Hoss' "Lifecycle of a Solr Search Request" talk at this point e.g. http://people.apache.org/~hossman/rev2017/ slides.)
Somewhat jumping into the middle of the code QueryComponent L368 and L1512
QueryResult result = new QueryResult();
...
rb.setResult(result);
QueryResult
contains DocListAndSet
which contains DocList
and DocSet
. DocSlice
implements DocList
and one of the places (there might be others) where ScoreDoc
turns into DocSlice
is at https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.8.2/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java#L1623
So for thinking purposes we added original_score
field to ScoreDoc
already and DocSlice
contains an optional scores
list and so let's imagine adding an optional original_scores
list alongside it: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.8.2/solr/core/src/java/org/apache/solr/search/DocSlice.java#L39
public class DocSlice extends DocSetBase implements DocList {
...
final int[] docs; // a slice of documents (docs 0-100 of the query)
final float[] scores; // optional score list
+ final float[] original_scores; // optional original_score list
final long matches;
final TotalHits.Relation matchesRelation;
...
Now then though, how will that new original score list get "transported" from shard to federator?
Using an unusual(?) trick here: instead of navigating the code via DocSlice.scores we could navigate via DocSlice.matchesRelation on the assumption that it leads to the same area but more easily.
One of the areas is https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.8.2/solr/core/src/java/org/apache/solr/response/TextResponseWriter.java#L189-L197 and there the res.getReturnFields()
leads to https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.8.2/solr/core/src/java/org/apache/solr/search/SolrReturnFields.java#L51-L53
public class SolrReturnFields extends ReturnFields {
// Special Field Keys
public static final String SCORE = "score";
and the speculation that perhaps a SolrReturnFields.ORIGINAL_SCORE might be needed.
That's where the trail of thoughts led me.
Of course, it is too early to reach any conclusions here but going intuitively and based on the code trails above it seems to me that getting access to the original_score is feasible but would be quite an undertaking and so before investing effort in that direction it could be helpful to first further consider and firmly rule out any overall reranking logic that requires only the already available information e.g. for shards to use reRankDocs
and for the federator whilst combining results to not use reRankDocs
but to combine all shards' reranked documents separately in a reRankQueue
that has the same sizing as queue
(and then documents dropping out of reRankQueue
need not be added to queue
since them dropping out means there's already sufficient documents in reRankQueue
to serve the request).
I hope that is not too opinionated a way of sharing one perspective and that some bits of it are useful? This distributed reranking stuff is an intriguing problem! :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really appreciate how detailed you share your thoughts!
This is extremely helpful to discover different ways to solve this problem.
First regarding the original score:
Personally, I like the idea of adding the value to the ShardDoc, but that would entail a PR to Lucene and we would have to wait for a newer version to use that in Solr to have access to that change. Like you already mentioned, that sounds like quite an undertaking.
Moving on to the problem with sort by score:
e.g. for shards to use reRankDocs and for the federator whilst combining results to not use reRankDocs but to combine all shards' reranked documents separately in a reRankQueue that has the same sizing as queue (and then documents dropping out of reRankQueue need not be added to queue since them dropping out means there's already sufficient documents in reRankQueue to serve the request).
I struggle to understand your idea.
First I understood that you want to collect the returned documents before reranking in one queue (queue A) and the reranked documents in another queue (queue B). Then we could take all the documents in queue A first, remove them from queue B and add the results from queue B after the results from queue A.
But the reranking happens on the shards and we are on the coordinator (in mergeIds), so we only have the already "mixed" (reranked and nor reranked) results from the shards.
Then I thought, maybe you want the shards to return multiple result lists? One with the reranked documents and another one with the results before reranking?
Then we could first merge the results from the reranked lists (= take the top ones), remove these documents from the queues that contain the results before reranking and then merge these results to append them after the reranked docs.
I did not look into the code yet but that feels like quite a big task.
What do you think about excluding both problems (making the original score visible and fixing the reranking with sort by score) to another PR? So that we first make LTR with a cloud setup possible at all and in a next step fix the problem with the sort=score?
I know that the sort by score is the default, but I am also positive that there are many people with custom sorting that could already benefit from the first part (this PR without fix for score sort).
We could also aim to finish that until 8.9 maybe?
If I did not understand your last idea correctly, please let me know. And please be aware that I do not want to create pressure by suggesting the exclusion of the fix from this PR. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... but that would entail a PR to Lucene and we would have to wait for a newer version ...
Maybe, maybe not. In terms of thinking about the code it is perhaps easiest to imagine ScoreDoc
having an additional original_score
element but in terms of writing the code it might not necessarily work out that way e.g. instead imagine within Solr
public class ScoreDocWithOriginalScore extends org.apache.lucene.search.ScoreDoc {
public float original_score;
...
}
or something similar and for that class to be used if and only if needed thus reducing impact of the changes on code paths that don't need the change. But still even then there'd be quite a bit of Solr code to cover to build confidence that the original score makes it to all the right places and isn't somehow lost somewhere.
... to collect the returned documents before reranking in one queue (queue A) and the reranked documents in another queue (queue B). ...
... But the reranking happens on the shards and we are on the coordinator (in mergeIds), so we only have the already "mixed" (reranked and nor reranked) results from the shards. ...
Yes, the idea was to collect reranked and not-reranked documents in separate non-overlapping queues and it assumed that within our "mixed" results we can already tell apart the reranked and not-reranked results i.e. within each shard's response the first reRankDocs
documents will be reranked and documents (if any) after that are not-reranked i.e. the i < reRankDocsSize
condition. I have not verified if that assumption is accurate but here's a short example to explore the idea, noting that in the dual sharded collection the client application chooses a halfed reRankDocs on the understanding that the collection has two shards.
single sharded collection
shard1: a b c d e f g h i j x y z # without reranking
shard1: J I H G F E D C B A x y z # with reRankDocs=10 (assuming reranking reverses order of reranked docs and uppercases them)
overall results: J I H G F E D C B A x y z
dual sharded collection
shard1: a c e g i y # without reranking
shard2: b d f h j x z # without reranking
shard1: I G E C A y # with reRankDocs=5 (assuming reranking reverses order of reranked docs and uppercases them)
shard2: J H F D B x z # with reRankDocs=5 (assuming reranking reverses order of reranked docs and uppercases them)
reRankedQueue: J I H G F E D C B A
queue: x y z
overall results: J I H G F E D C B A x y z
... What do you think about excluding ... to another PR? ... people with custom sorting that could already benefit from the first part ...
Yes, that possibility had occurred to me too :-)
In principle I think a fix for half of the issue could be better than no fix:
- if it can be clearly documented/communicated what is fixed and what is not fixed
- if existing behaviour is maintained for the 'not (yet) fixed' scenario
- if we anticipate that the fix for the second half is not going to break/change things for anyone who benefitted from the first fix
In terms of the specifics here, so the "fixed" condition would be that "sort" did not use "score" in any way i.e. sort=popularity desc
is "fixed" but both sort=score desc
and sort=popularity desc, score desc
is "not fixed" as yet. I wonder if QueryComponent
and SortedHitQueueManager
could accommodate that logic so that we would maintain existing behaviour for the not-yet-fixed use cases?
... If I did not understand your last idea correctly, please let me know. ...
Thanks for identifying which bits were and weren't clear! I've tried to clarify with the example and making the key assumption explicit. Please let me know if that was useful or not. I think that approach would solve both the without-score and the with-score scenario.
... And please be aware that I do not want to create pressure by suggesting the exclusion of the fix from this PR. :)
I did not perceive any pressure here at all but appreciate your openness and clarity w.r.t. no pressure being intended, thank you. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cpoerschke please excuse the delay, I am very sorry!
I just found time to look at your commit.
The tests fail because they assume, that the number of reRankDocs is applied to the whole result. Your change applies it to each shard.
So with e.g. 3 shards and reRankDocs=2 we expect the first 2 docs to be reranked, but instead 6 docs get reranked. Then the assertion on sort by id fails.
(Actually, the 6 is variable. I ran tests where 5 docs were reRanked in total, because shard 3 only had one doc.)
Because the client cannot (and should not have to) know, how many shards the cluster has, we would have to convert the reRankDocs that were passed in the {!ltr...}
-query to a per shard value.
So basically <value that was passed in> / numShards
.
However, I don't think that we can convert the global setting for reRankDocs into a per shard setting because we cannot know how the top reRankDocs are distributed across the shards.
So when the client wants 6 docs to be reRanked and we have 3 shards, we cannot tell each shard that it should reReank the first 2 documents.
This is because the top 6 reRanked docs may be all be on one shard.
That was our reasoning behind the droppedShardDoc
-logic.
So referring to your examples (thank you for the visualization idea! :) ):
dual sharded collection with reRankDocs=6 (3 per shard)
shard1: a c e g i y # without reranking
shard2: b d f h j x z # without reranking
shard1: E C A g i y
shard2: F D B h j x z
reRankedQueue: E C A F D B # assuming that all scores are higher on shard 1
queue: g i y h j x z
overall results: E C A F D B g i y h j x z
But the top 6 docs by their reRanked score could be:
dual sharded collection with reRankDocs=6 (also applied on shard level)
shard1: a c e g i y # without reranking
shard2: b d f h j x z # without reranking
shard1: Y I G E C A
shard2: X J H F D B z
reRankedQueue: Y I G E C A # assuming that all scores are higher on shard 1
queue: b d f h j x z
overall results: Y I G E C A b d f h j x z
That's a bit sad, because if we could just convert the global param for the reRankDocs to a per shard one, we would not have the problem with the missing originalScore for documents, that were reRanked on a shard but not in the whole result. :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cpoerschke please excuse the delay, I am very sorry!
Hi @tomglk, no worries at all and nothing to be sorry about, we all contribute here as and when time permits. :)
I just found time to look at your commit.
Thank you for taking a look at the commit!
The tests fail because they assume, that the number of reRankDocs is applied to the whole result. Your change applies it to each shard. ...
... Because the client cannot (and should not have to) know, how many shards the cluster has ...
... if the reRankScore for all docs on shard2 is higher then the score for each reRanked doc on shard1. ...
Yes, very well put, so this all shows then that (with my change applied) someone (Solr or the client) doing reRankDocs / numShard
could give different results than reRankDocs
for a single sharded collection.
That's a bit sad, ...
Yes it is.
Hmm, so let's maybe then loop back to and continue with your idea of a multi-part solution? I.e. this pull request to fix for use cases with a score-free sort and future pull request(s) for use cases whose sort contains a score (and which therefore requires access to the original score in the coordinator but where it is currently not available).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cpoerschke ,
thank you for the reassurance. :)
I just found the timing quite unlucky because we were suggesting an inclusion of this PR in 8.9 just before the "break" and I have no idea when 8.9 may be released.
So I cannot estimate the time pressure we would be under if we decide for a split of the PRs and aim for the 8.9 release.
Do you have any information on this? Maybe just a broad timeframe like 2 weeks / 2 months?
Having said that,
maybe then loop back to and continue with your idea of a multi-part solution? I.e. this pull request to fix for use cases with a score-free sort and future pull request(s) for use cases whose sort contains a score (and which therefore requires access to the original score in the coordinator but where it is currently not available).
I think that this approach would make the most sense.
We were looking into different methods to gain access to the original score during the week but do not have a good solution yet.
Just your point
if existing behaviour is maintained for the 'not (yet) fixed' scenario
is something we would have to look at again. Currently the existing behavior and the sort by score with our fix are broken, but I don't think that they are broken in the same way.
And this point
if it can be clearly documented/communicated what is fixed and what is not fixed
really is important. I do not know where we would communicate this best, but we definitely should make really clear, what was not working, what is working after this PR and what will need another PR to work.
This
fix for the second half is not going to break/change things for anyone who benefitted from the first fix
should be no problem. The first PR would only work for people with sort != score and the second PR would only provide benefit for people with sort == score. Returning information for the original score also should not break anything for people benefiting from this PR because that is just a bit more information than can, but does not have to be used.
Regarding
future pull request(s)
I think that the access to the original score and allowing the sort by score are very intertwined. So as soon as we solve the access to the score, the sorting by score is no longer a big problem. Therefore this could be one PR, in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... I have no idea when 8.9 may be released. ... Do you have any information on this? Maybe just a broad timeframe like 2 weeks / 2 months? ...
As per https://lists.apache.org/thread.html/rc9b83cd6c1608088365a911b40fa98f0775a00b21ae1491725d63ebc%40%3Cdev.lucene.apache.org%3E the 8.9 release process is now underway and from the "Release Lucene/Solr 8.9.0 should we have it soon" mailing list thread subject i.e. https://lists.apache.org/thread.html/r6bde6e242a0801a63386dc3b7efb10d77038271cd9c69dc07f9c76a1%40%3Cdev.lucene.apache.org%3E my understanding is that this is for both Lucene and Solr.
I hope that's not a disappointment, perhaps I could have provided more clarity w.r.t. likely 8.9 timelines earlier here, sorry!
Perhaps "we are targetting 8.10" rather than "we just missed 8.9" could be another way to look at it?
w.r.t. broad timeframe for 8.10 (assuming there will be a 8.10 before 9.0) -- i don't know but historically e.g. http://archive.apache.org/dist/lucene/solr/ usually there's a release every couple of months but it depends (we have documentation on this somewhere which explains it better but i can't find it right now). hope that helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"we are targetting 8.10" rather than "we just missed 8.9"
I like that mindset!
That also means that we are not under time pressure and we can aim to solve the original score problem until them to be able to solve all problems in one release.
So no worries! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very interesting and the narrative description of the approach in the solution part of the pull request really helped me find a quick way into reading of the code changes, thank you!
...java/org/apache/solr/handler/component/ShardFieldSortedHitQueueWithSameShardCompareSkip.java
Outdated
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java
Outdated
Show resolved
Hide resolved
…as used to keep the shortcut logic. Instead, make it configurable via constructor param.
…eues to SortedHitQueueManager to remove complexity from mergeIds( )
solr/contrib/ltr/src/test/org/apache/solr/ltr/TestLTROnSolrCloud.java
Outdated
Show resolved
Hide resolved
…fl, because we need it for sorting
|
||
if(rb.getRankQuery() != null && rankQuery instanceof AbstractReRankQuery){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... this pull request to fix for use cases with a score-free sort and future pull request(s) for use cases whose sort contains a score ...
I think something like this could work:
if(rb.getRankQuery() != null && rankQuery instanceof AbstractReRankQuery){ | |
boolean hasScoreSortField = false; | |
for (SortField sortField : sortFields) { | |
if (SortField.FIELD_SCORE.equals(sortField)) { | |
hasScoreSortField = true; | |
break; | |
} | |
} | |
if(rb.getRankQuery() != null && rankQuery instanceof AbstractReRankQuery && hasScoreSortField == false){ |
What do you think? (And no time pressure or rush at all to consider this idea, I just wanted to post it here today to perhaps help us shift away from the "bit sad" conclusion of the "The current solution only works if ..." conversation above.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cpoerschke ,
because your solution looked good to me, I added it in df6f30a . :)
I also added some comments about why we do that and linked to this PR.
Would you agree that this now finishes the scope of this PR?
I think this was the only ToDo left.
From my point of view the only thing left is to find a way of communicating what this PR solves and what is still broken. I do not know the right place for that, but maybe you have an idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tomglk,
Thanks for updating the branch so that the existing behaviour is maintained when the sort includes a score!
In terms of communicating the scope of the fix, I can see a few complimentary possibilities:
-
solr/CHANGES.txt entry for the JIRA issue: if this is very specific e.g. "custom sort without a score" rather than only "custom sort" wording then that could convey the scope and non-scope of the fix.
-
Solr Ref Guide: https://solr.apache.org/guide/8_9/query-re-ranking.html appears to not yet mention distributed queries, so there's an opportunity there to document limitations or caveats e.g. https://solr.apache.org/guide/8_9/result-grouping.html#distributed-result-grouping-caveats does something similar.
-
code javadocs: i've pushed an exploratory commit to the branch -- feel free to revert or to adapt e.g. the (Single/Dual) SortedHitQueueManager name choice is probably too generic -- and in it the javadocs for the
DualSortedHitQueueManager.supports
method could provide a natural place to communicate what is and what is not supported and the usage of the method when deciding the DualSortedHitQueueManager vs. SingleSortedHitQueueManager instantiation helps communicate structually when existing behaviour is maintained since SingleSortedHitQueueManager is a simple wrapper around ShardFieldSortedHitQueue.
... I think this was the only ToDo left. ...
The test coverage in solr/core
remains on the ToDo list I think, though it's easy to miss since it's rather hidden amongst all the discussion detail that led us to this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cpoerschke ,
thank you for the different pointers for the documentation. :)
You are right, the test coverage still is a ToDo.
I like the approach of splitting the SortedHitQueueManager
.
But I do not think that the QueryComponent should know when to create which SortedHitQueueManager. So I moved the newSortedHitQueueManager()
into the manager itself and changed the interface to an abstract class. (bbf887f)
What do you think about that change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I do not think that the QueryComponent should know when to create which SortedHitQueueManager. So I moved the
newSortedHitQueueManager()
into the manager itself and changed the interface to an abstract class. (bbf887f)What do you think about that change?
Makes sense. And thanks for adding wonderful javadocs for the new classes too!
Hi @cpoerschke , We have a very rough first draw of a possible solution and created a draft PR ( #171 ) which is based upon the code from this PR. Apart from that we would continue with the scope of only fixing the sort now. Starting with applying you suggestion. |
// we always need the score for reRanking | ||
rb.setFieldFlags(rb.getFieldFlags() | GET_SCORES); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
909a25d proposes to continue to have only one setFieldFlags
call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks good to me! :)
One thing that I stumbled upon while looking at your change is, that we actually override the flags instead of just adding the score, because we start with this:
int flags = 0;
A comment why we do this instead of starting with
int flags = rb.getFieldFlags();
would be great, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point. How about something along these lines?
int flags = 0; // we presume to be first to set this flag and hereby override any flags other components' prepare methods may have already set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's exactly what I was looking for, thank you. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate all your ideas for improved comments! I included them in the commit where I added your comment above (dc2b84f)
…ager and change interface to abstract class
import org.apache.lucene.search.SortField; | ||
|
||
/** | ||
* This class is used to manage the multiple SortedHitQueues that we need during mergeIds( ). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor:
* This class is used to manage the multiple SortedHitQueues that we need during mergeIds( ). | |
* This class is used to manage the multiple SortedHitQueues that we need during {@link QueryComponent#mergeIds(ResponseBuilder, ShardRequest)}. |
if (sortField.getType().equals(SortField.Type.SCORE) && sortField.getField() == null) { | ||
// using two queues makes reRanking on Solr Cloud possible (@see https://github.com/apache/solr/pull/151 ) | ||
// however, the fix does not work if the non-reRanked docs should be sorted by score ( @see https://github.com/apache/solr/pull/151#discussion_r640664451 ) | ||
// to maintain the existing behavior for these cases, we disable the broken use of two queues and use the (also broken) status quo instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor
// to maintain the existing behavior for these cases, we disable the broken use of two queues and use the (also broken) status quo instead | |
// to maintain the existing behavior for these cases, we do not support the broken use of two queues and continue to use the (also broken) status quo instead |
if(shardDoc.orderInShard < reRankDocsSize) { | ||
ShardDoc droppedShardDoc = reRankQueue.insertWithOverflow(shardDoc); | ||
// Only works if the original request does not sort by score | ||
// @AwaitsFix(bugUrl = "https://issues.apache.org/jira/browse/SOLR-15437") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AwaitsFix
is usually used for tests I think and using it in non-test code (albeit in a comment) could be confusing.
// @AwaitsFix(bugUrl = "https://issues.apache.org/jira/browse/SOLR-15437") | |
// see https://issues.apache.org/jira/browse/SOLR-15437 |
* In all other cases only one queue is used. | ||
* | ||
* @param sortFields the fields by which the results should be sorted | ||
* @param ss SortSpec of the responseBuilder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @param ss SortSpec of the responseBuilder | |
* @param ss SortSpec |
or perhaps we can remove the ss
parameter then and do rb.getSort()
instead?
|
||
if(rb.getRankQuery() != null && rankQuery instanceof AbstractReRankQuery){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I do not think that the QueryComponent should know when to create which SortedHitQueueManager. So I moved the
newSortedHitQueueManager()
into the manager itself and changed the interface to an abstract class. (bbf887f)What do you think about that change?
Makes sense. And thanks for adding wonderful javadocs for the new classes too!
…d; improve comments
I ran into this problem today -- I'm glad others are working on it. Unless I've misattributed this issue, I think this problem is not related to a "custom sort" -- one only needs to have a re-ranking boost function that could reduce the score from what it was. QueryComponent.mergeIds doesn't know how to deal with scores that don't descend when you're sorting by score (as is the default). The solution above is one approach and I think it's basically fine. It assumes that |
We're also affected by this bug and first of all we're wondering: Is anyone still working on these PRs? Furthermore while the Solution proposed in the PRs 151/171 also covers many corner cases and e.g. uneven distributions of re-ranked docs across the shards, we were thinking about deploying a temporary, simpler fix for our local installation, perhaps a variant of what @dsmiley mentioned in his last comment: To adjust the scores of all re-ranked documents locally in the shards to make them larger than the largest original score of the documents before re-ranking. There is still be the possibility that the original scores differ per shard, but since we also accept that for normal ranking (except for distributed IDF via global statsCache), we figured we could ignore that. We would also accept that the scores of all re-ranked docs are always better than the highest non-reranked score, and we'd set reRankDocs to be our intended total number divided by the actual numbers of shards we're using. However (not being a Solr expert), is there something obvious I'm missing here? |
I had to stop due to a lack of time. However, I can try to manage collaborating with you to finally fix this. |
Thanks a lot for the offer, @tomglk! I guess it will take us some time to get a full understanding of the issue(s) and the proposed solution. Perhaps we'll start by looking into the "simpler" fix proposed above and see if we can get that to run on our side - that will probably already give us a better understanding of these issues. We'll definitely get back to you and are looking forward to discussions on this! |
This PR had no visible activity in the past 60 days, labeling it as stale. Any new activity will remove the stale label. To attract more reviewers, please tag someone or notify the [email protected] mailing list. Thank you for your contribution! |
This PR is now closed due to 60 days of inactivity after being marked as stale. Re-opening this PR is still possible, in which case it will be marked as active again. |
https://issues.apache.org/jira/browse/SOLR-15437
Description
We found out that a plain SolrCloud will return random sorted docs if you define a custom sort field or function. This problem is also mentioned on mailing-lists (see jira issue) and is not really fixed yet.
Solution
The basic idea is to fix this in the
mergeIds
method of the QueryComponent.A PriorityQueue was used to collect the results from all shards.
We added another PriorityQueue next to the existing one. The new queue is used to collect the defined count of reRanked documents.
After collecting all documents, the two queues have to be combined to fill the
resultIds
.Be aware, that the
reRankDocs
-threshold is used per shard but also has to be applied across the whole result.To handle this, documents get removed from the
reRankQueue
and inserted into the normal queue if their score after reRanking does not place them under the top results defined by thereRankDocs
.The documents in the
reRankQueue
are sorted by the score (after reRanking). The original sort is applied to all documents in the original queue.To be able to correctly sort the not-reRanked documents, we had to remove the shortcut of taking the
orderInShard
in shard instead of comparing the sortValues from theShardFieldSortedHitQueue
. We extracted the functionality to the classShardFieldSortedHitQueueWithSameShardCompareSkip
which is used for thereRankQueue
or in cases without reRanking.Tests
We've added two tests with a custom sorting and reranking activated. One test ranks the whole resultset and the other one reranks only a subset, to make sure that the custom is applied to docs that were not reranked.
Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.