Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Management of 'empty' index query #35

Open
sgeulette opened this issue Mar 21, 2018 · 19 comments
Open

Management of 'empty' index query #35

sgeulette opened this issue Mar 21, 2018 · 19 comments

Comments

@sgeulette
Copy link

Hi,
Since 3.0 version, it is possible to search on 'not': cool ! Empty index elements are not included in the results.
It would be nice to can also query on empty index in the catalog.
And then combine those 2 different queries if needed.

What do you think about this improvement ?
Regards

@andbag
Copy link
Member

andbag commented Mar 21, 2018

Could you please define an example for a better understanding?

@sgeulette
Copy link
Author

On a KeywordIndex ('tags'), storing by example tags ;-).
If I want to find brains without tag, I cannot query on [] or None.
A solution is that I manage an empty value ['no_tag'] in the index to query on this special value.
I think it would be better that the catalog manage itself this behavior and can query on empty value search.

@vincentfretin
Copy link
Member

I guess you'll need a TreeSet of rids for non indexed docs in the index, like it is done for example in hypatia FieldIndex (used by substanced)
https://github.com/Pylons/hypatia/blob/master/hypatia/field/__init__.py#L94
See my example here:
Pylons/hypatia#9 (comment)

@andbag
Copy link
Member

andbag commented Mar 21, 2018

You can test this branch

https://github.com/zopefoundation/Products.ZCatalog/tree/not-parm-patch

Queries on [] should work. If that's what you needed, I'll create a PR.

@icemac
Copy link
Member

icemac commented Jun 7, 2018

@sgeulette Where you able to try the branch @andbag suggested?

@sgeulette
Copy link
Author

I couldn't test it in Plone.

@d-maurer
Copy link
Contributor

d-maurer commented Mar 8, 2019

It would be nice to can also query on empty index in the catalog.

It would be difficult to implement this on the index level: the index by itself knows only about the documents it has indexed (itself), not about the documents known by the catalog. What you are calling for are the documents known by the catalog and not known by the index.

Products.AdvancedQuery allows you to formulate queries like this via ~ Indexed(index) (i.e. search
for the documents not indexed by index). As ZCatalog does not have a general "not"; you would need a new index query parameter telling the index to look into the enclosing catalog and determine the set of its known objects. Making assumptions about the enclosing catalog (and how to determine its known objects) is not nice -- at least on the conceptual level.

@andbag
Copy link
Member

andbag commented Apr 13, 2019

@d-maurer by the way, you identified a bug. The same argument applies to the 'pure not' operation. The index can currently only return documents that the index knows. Obects without a value (== None) belong to the result set of a 'pure not' operation.

@d-maurer
Copy link
Contributor

d-maurer commented Apr 13, 2019 via email

@andbag
Copy link
Member

andbag commented Apr 14, 2019

I think it's more of a bug as a feature. Because 'not' support was added on 25 Mar 2012. And indexing of objects with empty value was disabled due to a BTrees 4.0+ compatibility problem on 2 Nov 2014 (two years later). There must have been a phase in between in which BTrees (<4.0) were used. Otherwise the fix would surely have been added earlier. If an unittest had existed for this case, the bug would have been exposed. Since no unittest exists for this case, the bug has not been detected.

@d-maurer as vincentfretin suggests, you can collect the documenids with empty values separately in the index. My current incomplete branch follows this idea. What's your opinion?

@d-maurer
Copy link
Contributor

d-maurer commented Apr 14, 2019 via email

@d-maurer
Copy link
Contributor

d-maurer commented Apr 15, 2019 via email

@andbag
Copy link
Member

andbag commented Apr 15, 2019

@d-maurer I suggest that UnIndex should generally support this feature, which can be disabled or enabled. Then we should decide which indexes whose parent class is UnIndex should support the feature by default. Even for debugging, it would be helpful if you could use the catalog to quickly identify those objects that have no value set for an index. However, I have currently no idea how to name the query option to disable or enable the feature.

@d-maurer
Copy link
Contributor

However, I have currently no idea how to name the query option to disable or enable the feature.

@andbag I suggest to model this not via a query option but via a special (search) "term" (aka "key"). This way, it could be combined with "normal" "term"s via and, or and not. This also reflects the behaviour of some indexes in their [un]index_object: they use a "_marker" to represent the case, that an object has no value for a given index. This "_marker" could become global and part of the official interface to represent "the index has no value for the object".
Drawback: the feature would be available only to python code, not directly for through the web queries (as the special term has no natural textual representation which could be used easily in a web form).

None could be a natural choice to represent the case "no value for this index" (I have chosen this for Products.ManageableIndex). However, there might be indexes around which use None as a meaningful object value - and those may break if we would use None for the new purpose. Therefore, I suggest to introduce a special marker object.

Indexes which support the new feature could be marked with an interface, maybe IIndexingMissingValue:
Products.PluginIndexes.interfaces:

...
MissingValue = object()  # can be used as query "term" to query for objects the index does not have a value for.

class IIndexingMissingValue(Interface):
    """Marker interface to mark indexes with support the `MissingValue` query term."""
...

@andbag
Copy link
Member

andbag commented Apr 16, 2019

@d-maurer I am currently experimenting with the IIndexingMissingValue interface. If KeywordIndex implements this interface, should the index consider MissingValue implicit or explicit. There are some special queries that I would like to know what results are expected from the search, e.g.

q1 = {'query': ['f']}, 'not': ['f']}`

q2 = {'query': ['f', 'g']}, 'not': ['f']}`

q3 = {'not': ['f']}

Should the results here contain implicitly items with MissingValue? Or do I have to explicitly specify the term 'MissingValue' in the query?

q1 = {'query': [MissingValue]} etc.

@sgeulette
Copy link
Author

hi,

I think q3 should return MissingValue brains too, like "pure not" mentioned before.
{'not': ['f', MissingValue]} should explicitly be used if MissingValue is not desired.

regards

@d-maurer
Copy link
Contributor

d-maurer commented Apr 17, 2019 via email

@andbag
Copy link
Member

andbag commented May 2, 2019

@d-maurer I still have questions. How should the search for the empty set be defined? And which variations should be supported? Which results are expected? Examples:

q1 = {'query': []}
Should return all items with empty sets, but not items with MissingValue.

q2 = {'query': [(), 'a']}
Should return all items with empty sets and items with keyword 'a'.

q3 = {'not': []}
Should return all items with sets (length greater than 0) and consequently items with MissingValue.

q4 = {'not': [(), MissingValue]}
Should return all items with sets (length greater than 0) but not items with MissingValue.

I'm not sure if this is the right syntax to query for items with empty set.

@d-maurer
Copy link
Contributor

d-maurer commented May 2, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants