-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix filter and ordering issues with FloatField and IntegerField #186
base: master
Are you sure you want to change the base?
Conversation
Any progress on this? |
@PoByBolek, if you are still interested in working on this issue, we could work together to bring those improvements to the master branch. Note we are now requiring xapian 1.4. And please one PR per type of issue, thanks! |
It's been a few years since I opened this. But sure, I can give this another shot. Do you want me to create a new PR or update this one to match the current master branch? |
What works for you will be fine. |
this requires a new format for FloatFields: the hex-encoded result of xapian.sortable_serialise(). This is necessary because sortable_serialise() returns bytes in Python 3 which we can't pass to xapian.Query. Note that hex-encoded byte strings have the same sort order as the actual byte strings. Also, use xapian's OP_VALUE_LE and OP_VALUE_GE in the filter_* methods because calling the XHValueRangeProcessor seems too hacky.
this requires storing integers as hex encoded sortable_serialise()d values which should be safe up to 2**53 (see https://en.wikipedia.org/wiki/Double-precision_floating-point_format)
if not begin: | ||
if begin: | ||
if field_type == 'float': | ||
begin = _term_to_xapian_value(float(begin), field_type) | ||
elif field_type == 'integer': | ||
begin = _term_to_xapian_value(int(begin), field_type) | ||
else: | ||
if field_type == 'text': | ||
begin = 'a' # TODO: A better way of getting a min text value? | ||
elif field_type == 'integer': | ||
begin = -sys.maxsize - 1 | ||
elif field_type == 'float': | ||
begin = float('-inf') | ||
elif field_type in ('float', 'integer'): | ||
# floats and ints are both serialised using xapian.sortable_serialise | ||
# so we can use -Infinity as the lower bound for both of them. | ||
begin = _term_to_xapian_value(float('-inf'), field_type) | ||
elif field_type == 'date' or field_type == 'datetime': | ||
begin = '00010101000000' | ||
elif end == '*': | ||
|
||
if end == '*': | ||
if field_type == 'text': | ||
end = 'z' * 100 # TODO: A better way of getting a max text value? | ||
elif field_type == 'integer': | ||
end = sys.maxsize | ||
elif field_type == 'float': | ||
end = float('inf') | ||
elif field_type in ('float', 'integer'): | ||
# floats and ints are both serialised using xapian.sortable_serialise | ||
# so we can use +Infinity as the upper bound for both of them. | ||
end = _term_to_xapian_value(float('inf'), field_type) | ||
elif field_type == 'date' or field_type == 'datetime': | ||
end = '99990101000000' | ||
else: | ||
if field_type == 'float': | ||
end = _term_to_xapian_value(float(end), field_type) | ||
elif field_type == 'integer': | ||
end = _term_to_xapian_value(int(end), field_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can also simplify the logic for the other field types here. ValueRangeProcessors
(which should maybe just be RangeProcessors
now?) can return empty values to represent open ranges. In that case we wouldn't have to make up arbitrary lower and upper bounds for text and date values.
This is now condensed in two commits. But the logic remains the same: we now store integer and float fields as the hex encoded result of |
Could the OP_VALUE_LE/OP_VALUE_GE change part of another PR? |
You mean in the |
Hm.... I just realized why I changed the The To avoid these double encoding issues I changed the |
Not sure I understand, do you mean that your |
Not really, no. Because the |
Thanks for exploring that. Then I'll plan a global review of this patch as a whole. |
I think the main concern here is compatibility. This will need complete re-indexing of all indexes containing floats and integers. This should be emphasized in a Changelog entry, and this probably also justify bumping to 4.0.0. |
@asedeno, would you mind having a look, too? |
bump This feels a bit like two years ago... Is there anything else I can do to help move this along? |
As I mentioned in a comment above, the main issue here is backwards compatibility. Indexes with integers/floats need to be entirely rebuilt, which can be a non trivial task on some systems. Therefore, the change must be clearly emphasized in the Changelog. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from compatibility issues @claudep wants to call out in the release notes for a new major version for this, I think it looks good.
We might want some more tests around large integers that can't be represented as doubles to show we know that there are now new issues there, though the (presumed) release notes may be sufficient.
@PoByBolek would you mind trying to write some release notes for your patch? |
Filtering on FloatFields crashed with a ValueError because it tried to convert the already serialised float back into an actual float. Also, in Python 3
xapian.sortable_serialise()
returnsbytes
instead ofstr
which we then can't use any further (becausexapian.Query
expects a unicode string). So we now store the hex encoded result ofsortable_serialise()
which has the same sort order as the result ofsortable_serialise()
itself.Ordering on IntegerFields with negative values didn't produce the expected results because the integer format produced the following order: -000000000001, -000000000002, -000000000003, 000000000000, 000000000001, 000000000002, 000000000003. By storing integers the same way as floats, i.e. hex encode the result of
xapian.sortable_serialise()
, we get the correct order. Note thatsortable_serialise()
deals with doubles but a double can store integers up to 2**53 without any loss of precision.I also noticed that Xapian 1.4 allows returning empty strings for
begin
andend
from a ValueRangeProcessor and then turns theVALUE_RANGE
query intoVALUE_LE
andVALUE_GE
query respectively, which would greatly simplify theXHValueRangeProcessor
. However, this does not work with Xapian 1.2 (and I didn't check Xapian 1.3).