fix filter and ordering issues with FloatField and IntegerField #186

PoByBolek · 2019-04-26T13:35:34Z

Filtering on FloatFields crashed with a ValueError because it tried to convert the already serialised float back into an actual float. Also, in Python 3 xapian.sortable_serialise() returns bytes instead of str which we then can't use any further (because xapian.Query expects a unicode string). So we now store the hex encoded result of sortable_serialise() which has the same sort order as the result of sortable_serialise() itself.

Ordering on IntegerFields with negative values didn't produce the expected results because the integer format produced the following order: -000000000001, -000000000002, -000000000003, 000000000000, 000000000001, 000000000002, 000000000003. By storing integers the same way as floats, i.e. hex encode the result of xapian.sortable_serialise(), we get the correct order. Note that sortable_serialise() deals with doubles but a double can store integers up to 2**53 without any loss of precision.

I also noticed that Xapian 1.4 allows returning empty strings for begin and end from a ValueRangeProcessor and then turns the VALUE_RANGE query into VALUE_LE and VALUE_GE query respectively, which would greatly simplify the XHValueRangeProcessor. However, this does not work with Xapian 1.2 (and I didn't check Xapian 1.3).

coveralls · 2019-04-26T14:00:44Z

Coverage decreased (-0.2%) to 97.345% when pulling 62685e2 on PoByBolek:develop into 3fc4cfe on notanumber:master.

PoByBolek · 2019-05-24T11:06:35Z

Any progress on this?

claudep · 2022-02-06T21:50:33Z

@PoByBolek, if you are still interested in working on this issue, we could work together to bring those improvements to the master branch. Note we are now requiring xapian 1.4. And please one PR per type of issue, thanks!

PoByBolek · 2022-02-07T18:36:47Z

It's been a few years since I opened this. But sure, I can give this another shot. Do you want me to create a new PR or update this one to match the current master branch?

claudep · 2022-02-07T18:48:25Z

What works for you will be fine.

this requires a new format for FloatFields: the hex-encoded result of xapian.sortable_serialise(). This is necessary because sortable_serialise() returns bytes in Python 3 which we can't pass to xapian.Query. Note that hex-encoded byte strings have the same sort order as the actual byte strings. Also, use xapian's OP_VALUE_LE and OP_VALUE_GE in the filter_* methods because calling the XHValueRangeProcessor seems too hacky.

this requires storing integers as hex encoded sortable_serialise()d values which should be safe up to 2**53 (see https://en.wikipedia.org/wiki/Double-precision_floating-point_format)

PoByBolek · 2022-02-08T22:33:35Z

xapian_backend.py

-                if not begin:
+                if begin:
+                    if field_type == 'float':
+                        begin = _term_to_xapian_value(float(begin), field_type)
+                    elif field_type == 'integer':
+                        begin = _term_to_xapian_value(int(begin), field_type)
+                else:
                    if field_type == 'text':
                        begin = 'a'  # TODO: A better way of getting a min text value?
-                    elif field_type == 'integer':
-                        begin = -sys.maxsize - 1
-                    elif field_type == 'float':
-                        begin = float('-inf')
+                    elif field_type in ('float', 'integer'):
+                        # floats and ints are both serialised using xapian.sortable_serialise
+                        # so we can use -Infinity as the lower bound for both of them.
+                        begin = _term_to_xapian_value(float('-inf'), field_type)
                    elif field_type == 'date' or field_type == 'datetime':
                        begin = '00010101000000'
-                elif end == '*':
+
+                if end == '*':
                    if field_type == 'text':
                        end = 'z' * 100  # TODO: A better way of getting a max text value?
-                    elif field_type == 'integer':
-                        end = sys.maxsize
-                    elif field_type == 'float':
-                        end = float('inf')
+                    elif field_type in ('float', 'integer'):
+                        # floats and ints are both serialised using xapian.sortable_serialise
+                        # so we can use +Infinity as the upper bound for both of them.
+                        end = _term_to_xapian_value(float('inf'), field_type)
                    elif field_type == 'date' or field_type == 'datetime':
                        end = '99990101000000'
+                else:
+                    if field_type == 'float':
+                        end = _term_to_xapian_value(float(end), field_type)
+                    elif field_type == 'integer':
+                        end = _term_to_xapian_value(int(end), field_type)


I think we can also simplify the logic for the other field types here. ValueRangeProcessors (which should maybe just be RangeProcessors now?) can return empty values to represent open ranges. In that case we wouldn't have to make up arbitrary lower and upper bounds for text and date values.

PoByBolek · 2022-02-08T22:36:46Z

This is now condensed in two commits. But the logic remains the same: we now store integer and float fields as the hex encoded result of xapian.sortable_serialise() which preserves order even for negative integer values.

claudep · 2022-02-09T07:39:16Z

Could the OP_VALUE_LE/OP_VALUE_GE change part of another PR?

PoByBolek · 2022-02-09T09:31:05Z

You mean in the _filter_gte() and _filter_lte() methods? Sure!

PoByBolek · 2022-02-09T21:37:13Z

Hm.... I just realized why I changed the _filter_gte() and _filter_lte() methods to not use the XHValueRangeProcessor anymore.

The XHValueRangeProcessor.__call__() method expects strings as its begin and end parameters, which is why we call _term_to_xapian_value() in the _filter_gte() and _filter_lte() methods. However, the value range processor then calls _term_to_xapian_value() again in certain situations. This is especially problematic for integer and float fields because it would essentially try to encode the terms twice (which was the original problem I had with the float fields). Even if float(_term_to_xapian_value(term)) didn't raise a ValueError, it would definitely not be the same value anymore, i.e. float(term) != float(_term_to_xapian_value(term)).

To avoid these double encoding issues I changed the _filter_gte() and _filter_lte() methods to use the OP_VALUE_GE and OP_VALUE_LE queries directly and leave the XHValueRangeProcessor for raw query uses (maybe?), i.e. for situations where the user provides a range query like float_field:12.3..23.4. In those situations I wouldn't expect the user to know our internal encoding for the float and integer fields and just provide "normal" string formatted values. This means that the XHValueRangeProcessor always deals with "normal" string values, doesn't have to work around these double encoding issues, and can safely call _term_to_xapian_value(float(term)) again.

claudep · 2022-02-09T21:42:06Z

Not sure I understand, do you mean that your _filter_gte()/_filter_lte() changes cannot be decoupled from the rest of your changes?

PoByBolek · 2022-02-09T21:51:01Z

Not really, no. Because the XHValueRangeProcessor would have to know whether it's being called from _filter_gte() / _filter_lte() (in which case it shouldn't encode the float and integer terms) or from a user-provided raw range query like float_field:12.3..23.4 (in which case it must encode the float and integer terms).

claudep · 2022-02-09T21:53:01Z

Thanks for exploring that. Then I'll plan a global review of this patch as a whole.

claudep · 2022-02-10T19:55:44Z

I think the main concern here is compatibility. This will need complete re-indexing of all indexes containing floats and integers. This should be emphasized in a Changelog entry, and this probably also justify bumping to 4.0.0.

claudep · 2022-02-10T19:56:32Z

@asedeno, would you mind having a look, too?

PoByBolek · 2022-03-28T21:01:38Z

bump

This feels a bit like two years ago... Is there anything else I can do to help move this along?

claudep · 2022-03-29T06:47:06Z

As I mentioned in a comment above, the main issue here is backwards compatibility. Indexes with integers/floats need to be entirely rebuilt, which can be a non trivial task on some systems. Therefore, the change must be clearly emphasized in the Changelog.

asedeno

Aside from compatibility issues @claudep wants to call out in the release notes for a new major version for this, I think it looks good.

We might want some more tests around large integers that can't be represented as doubles to show we know that there are now new issues there, though the (presumed) release notes may be sufficient.

claudep · 2022-04-08T16:14:28Z

@PoByBolek would you mind trying to write some release notes for your patch?

PoByBolek added 2 commits February 8, 2022 22:56

fix order_by for IntegerFields with negative values

62685e2

this requires storing integers as hex encoded sortable_serialise()d values which should be safe up to 2**53 (see https://en.wikipedia.org/wiki/Double-precision_floating-point_format)

PoByBolek force-pushed the develop branch from 8395a96 to 62685e2 Compare February 8, 2022 22:26

PoByBolek commented Feb 8, 2022

View reviewed changes

claudep mentioned this pull request Apr 6, 2022

GitHub Actions: rework test matrix (drop Django 2.2, py3.9) #225

Open

asedeno approved these changes Apr 6, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix filter and ordering issues with FloatField and IntegerField #186

fix filter and ordering issues with FloatField and IntegerField #186

PoByBolek commented Apr 26, 2019

coveralls commented Apr 26, 2019 •

edited

Loading

PoByBolek commented May 24, 2019

claudep commented Feb 6, 2022

PoByBolek commented Feb 7, 2022

claudep commented Feb 7, 2022

PoByBolek Feb 8, 2022

PoByBolek commented Feb 8, 2022

claudep commented Feb 9, 2022

PoByBolek commented Feb 9, 2022

PoByBolek commented Feb 9, 2022

claudep commented Feb 9, 2022

PoByBolek commented Feb 9, 2022

claudep commented Feb 9, 2022

claudep commented Feb 10, 2022

claudep commented Feb 10, 2022

PoByBolek commented Mar 28, 2022

claudep commented Mar 29, 2022

asedeno left a comment •

edited

Loading

claudep commented Apr 8, 2022

fix filter and ordering issues with FloatField and IntegerField #186

Are you sure you want to change the base?

fix filter and ordering issues with FloatField and IntegerField #186

Conversation

PoByBolek commented Apr 26, 2019

coveralls commented Apr 26, 2019 • edited Loading

PoByBolek commented May 24, 2019

claudep commented Feb 6, 2022

PoByBolek commented Feb 7, 2022

claudep commented Feb 7, 2022

PoByBolek Feb 8, 2022

Choose a reason for hiding this comment

PoByBolek commented Feb 8, 2022

claudep commented Feb 9, 2022

PoByBolek commented Feb 9, 2022

PoByBolek commented Feb 9, 2022

claudep commented Feb 9, 2022

PoByBolek commented Feb 9, 2022

claudep commented Feb 9, 2022

claudep commented Feb 10, 2022

claudep commented Feb 10, 2022

PoByBolek commented Mar 28, 2022

claudep commented Mar 29, 2022

asedeno left a comment • edited Loading

Choose a reason for hiding this comment

claudep commented Apr 8, 2022

coveralls commented Apr 26, 2019 •

edited

Loading

asedeno left a comment •

edited

Loading