Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix filter and ordering issues with FloatField and IntegerField #186

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

PoByBolek
Copy link

Filtering on FloatFields crashed with a ValueError because it tried to convert the already serialised float back into an actual float. Also, in Python 3 xapian.sortable_serialise() returns bytes instead of str which we then can't use any further (because xapian.Query expects a unicode string). So we now store the hex encoded result of sortable_serialise() which has the same sort order as the result of sortable_serialise() itself.

Ordering on IntegerFields with negative values didn't produce the expected results because the integer format produced the following order: -000000000001, -000000000002, -000000000003, 000000000000, 000000000001, 000000000002, 000000000003. By storing integers the same way as floats, i.e. hex encode the result of xapian.sortable_serialise(), we get the correct order. Note that sortable_serialise() deals with doubles but a double can store integers up to 2**53 without any loss of precision.

I also noticed that Xapian 1.4 allows returning empty strings for begin and end from a ValueRangeProcessor and then turns the VALUE_RANGE query into VALUE_LE and VALUE_GE query respectively, which would greatly simplify the XHValueRangeProcessor. However, this does not work with Xapian 1.2 (and I didn't check Xapian 1.3).

@coveralls
Copy link

coveralls commented Apr 26, 2019

Coverage Status

Coverage decreased (-0.2%) to 97.345% when pulling 62685e2 on PoByBolek:develop into 3fc4cfe on notanumber:master.

@PoByBolek
Copy link
Author

Any progress on this?

@claudep
Copy link
Collaborator

claudep commented Feb 6, 2022

@PoByBolek, if you are still interested in working on this issue, we could work together to bring those improvements to the master branch. Note we are now requiring xapian 1.4. And please one PR per type of issue, thanks!

@PoByBolek
Copy link
Author

It's been a few years since I opened this. But sure, I can give this another shot. Do you want me to create a new PR or update this one to match the current master branch?

@claudep
Copy link
Collaborator

claudep commented Feb 7, 2022

What works for you will be fine.

this requires a new format for FloatFields: the hex-encoded result of xapian.sortable_serialise(). This is necessary because sortable_serialise() returns bytes in Python 3 which we can't pass to xapian.Query. Note that hex-encoded byte strings have the same sort order as the actual byte strings.

Also, use xapian's OP_VALUE_LE and OP_VALUE_GE in the filter_* methods because calling the XHValueRangeProcessor seems too hacky.
this requires storing integers as hex encoded sortable_serialise()d values which should be safe up to 2**53 (see https://en.wikipedia.org/wiki/Double-precision_floating-point_format)
Comment on lines -105 to +132
if not begin:
if begin:
if field_type == 'float':
begin = _term_to_xapian_value(float(begin), field_type)
elif field_type == 'integer':
begin = _term_to_xapian_value(int(begin), field_type)
else:
if field_type == 'text':
begin = 'a' # TODO: A better way of getting a min text value?
elif field_type == 'integer':
begin = -sys.maxsize - 1
elif field_type == 'float':
begin = float('-inf')
elif field_type in ('float', 'integer'):
# floats and ints are both serialised using xapian.sortable_serialise
# so we can use -Infinity as the lower bound for both of them.
begin = _term_to_xapian_value(float('-inf'), field_type)
elif field_type == 'date' or field_type == 'datetime':
begin = '00010101000000'
elif end == '*':

if end == '*':
if field_type == 'text':
end = 'z' * 100 # TODO: A better way of getting a max text value?
elif field_type == 'integer':
end = sys.maxsize
elif field_type == 'float':
end = float('inf')
elif field_type in ('float', 'integer'):
# floats and ints are both serialised using xapian.sortable_serialise
# so we can use +Infinity as the upper bound for both of them.
end = _term_to_xapian_value(float('inf'), field_type)
elif field_type == 'date' or field_type == 'datetime':
end = '99990101000000'
else:
if field_type == 'float':
end = _term_to_xapian_value(float(end), field_type)
elif field_type == 'integer':
end = _term_to_xapian_value(int(end), field_type)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can also simplify the logic for the other field types here. ValueRangeProcessors (which should maybe just be RangeProcessors now?) can return empty values to represent open ranges. In that case we wouldn't have to make up arbitrary lower and upper bounds for text and date values.

@PoByBolek
Copy link
Author

This is now condensed in two commits. But the logic remains the same: we now store integer and float fields as the hex encoded result of xapian.sortable_serialise() which preserves order even for negative integer values.

@claudep
Copy link
Collaborator

claudep commented Feb 9, 2022

Could the OP_VALUE_LE/OP_VALUE_GE change part of another PR?

@PoByBolek
Copy link
Author

You mean in the _filter_gte() and _filter_lte() methods? Sure!

@PoByBolek
Copy link
Author

Hm.... I just realized why I changed the _filter_gte() and _filter_lte() methods to not use the XHValueRangeProcessor anymore.

The XHValueRangeProcessor.__call__() method expects strings as its begin and end parameters, which is why we call _term_to_xapian_value() in the _filter_gte() and _filter_lte() methods. However, the value range processor then calls _term_to_xapian_value() again in certain situations. This is especially problematic for integer and float fields because it would essentially try to encode the terms twice (which was the original problem I had with the float fields). Even if float(_term_to_xapian_value(term)) didn't raise a ValueError, it would definitely not be the same value anymore, i.e. float(term) != float(_term_to_xapian_value(term)).

To avoid these double encoding issues I changed the _filter_gte() and _filter_lte() methods to use the OP_VALUE_GE and OP_VALUE_LE queries directly and leave the XHValueRangeProcessor for raw query uses (maybe?), i.e. for situations where the user provides a range query like float_field:12.3..23.4. In those situations I wouldn't expect the user to know our internal encoding for the float and integer fields and just provide "normal" string formatted values. This means that the XHValueRangeProcessor always deals with "normal" string values, doesn't have to work around these double encoding issues, and can safely call _term_to_xapian_value(float(term)) again.

@claudep
Copy link
Collaborator

claudep commented Feb 9, 2022

Not sure I understand, do you mean that your _filter_gte()/_filter_lte() changes cannot be decoupled from the rest of your changes?

@PoByBolek
Copy link
Author

Not really, no. Because the XHValueRangeProcessor would have to know whether it's being called from _filter_gte() / _filter_lte() (in which case it shouldn't encode the float and integer terms) or from a user-provided raw range query like float_field:12.3..23.4 (in which case it must encode the float and integer terms).

@claudep
Copy link
Collaborator

claudep commented Feb 9, 2022

Thanks for exploring that. Then I'll plan a global review of this patch as a whole.

@claudep
Copy link
Collaborator

claudep commented Feb 10, 2022

I think the main concern here is compatibility. This will need complete re-indexing of all indexes containing floats and integers. This should be emphasized in a Changelog entry, and this probably also justify bumping to 4.0.0.

@claudep
Copy link
Collaborator

claudep commented Feb 10, 2022

@asedeno, would you mind having a look, too?

@PoByBolek
Copy link
Author

bump

This feels a bit like two years ago... Is there anything else I can do to help move this along?

@claudep
Copy link
Collaborator

claudep commented Mar 29, 2022

As I mentioned in a comment above, the main issue here is backwards compatibility. Indexes with integers/floats need to be entirely rebuilt, which can be a non trivial task on some systems. Therefore, the change must be clearly emphasized in the Changelog.

Copy link
Collaborator

@asedeno asedeno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from compatibility issues @claudep wants to call out in the release notes for a new major version for this, I think it looks good.

We might want some more tests around large integers that can't be represented as doubles to show we know that there are now new issues there, though the (presumed) release notes may be sufficient.

@claudep
Copy link
Collaborator

claudep commented Apr 8, 2022

@PoByBolek would you mind trying to write some release notes for your patch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants