-
Notifications
You must be signed in to change notification settings - Fork 49
Description
What happened?
Hello! Great work on the library! I recently found an edge case in the get_raw_interactions() calls, which may or may not be intended, so I'm opening an issue for discussion.
Context
Namely, the get_raw_interactions() method is a couple of wrappers for to the convert_to_external() calls from each identifier, which uses get_from_series_by_index().
get_from_series_by_index() seems to have an explicit assumption that NaNs in the resulting dataframe are caused by the reindex() operation. The operation introduces NaNs when there is a mismatch between item IDs and user IDs:
https://github.com/MobileTeleSystems/RecTools/blob/main/rectools/utils/indexing.py#L123-L127
Edge Case
The reindex() operation isn't the only time NaNs can be introduced, though. For example, if a Dataset has a NaN by itself in its raw form - this exception is raised anyways. For example:
>>> import rectools
>>> import pandas as pd
>>> from rectools.dataset import Dataset
>>> data = {
... 'user_id': [1, 2, 3, None, 5, 1, 2, 3, 4, 5], # There is a None in the User IDs
... 'item_id': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
... 'rating': [5, 3, 4, 2, 1, 4, 5, 3, 2, 1]
... }
>>> df = pd.DataFrame(data)
>>> df['weight'] = 1
>>> df['datetime'] = pd.to_datetime('2025-02-25')
>>> dataset = Dataset.construct(df)
>>> dataset.get_raw_interactions()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/david.landup/.pyenv/versions/3.11.10/lib/python3.11/site-packages/rectools/dataset/dataset.py", line 248, in get_raw_interactions
return self.interactions.to_external(self.user_id_map, self.item_id_map, include_weight, include_datetime)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/david.landup/.pyenv/versions/3.11.10/lib/python3.11/site-packages/rectools/dataset/interactions.py", line 181, in to_external
Columns.User: user_id_map.convert_to_external(self.df[Columns.User].values),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/david.landup/.pyenv/versions/3.11.10/lib/python3.11/site-packages/rectools/dataset/identifiers.py", line 221, in convert_to_external
result = get_from_series_by_index(self.to_external, internal, strict, return_missing)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/david.landup/.pyenv/versions/3.11.10/lib/python3.11/site-packages/rectools/utils/indexing.py", line 127, in get_from_series_by_index
raise KeyError("Some indices do not exist")
KeyError: 'Some indices do not exist'
While get_from_series_by_index() itself has a very useful strict mode that can be turned off, the get_raw_interactions() method doesn't allow passing this argument down to the external export calls.
Workaround
A simple workaround is to simply re-implement the constituent calls and pass strict=False in a utility module or patch the lib. However, it would be much nicer developer UX if we can simply pass it through the call natively.
Proposal
One backwards-compatible way to allow for this edge case would be to allow **kwargs in the get_raw_interactions() method and pass them to the underlying calls.
Alternatively, it would be nice UX to also raise which indices are missing or provide a bit of extra context in the error message to alleviate part of the debugging.
Thoughts?
Expected behavior
No response
Relevant logs and/or screenshots
No response
Operating System
MacOS
Python Version
3.11
RecTools version
0.8.0