Skip to content

get_raw_interactions() NaN assumptions and edge case #269

@DavidLandup0

Description

@DavidLandup0

What happened?

Hello! Great work on the library! I recently found an edge case in the get_raw_interactions() calls, which may or may not be intended, so I'm opening an issue for discussion.

Context

Namely, the get_raw_interactions() method is a couple of wrappers for to the convert_to_external() calls from each identifier, which uses get_from_series_by_index().

get_from_series_by_index() seems to have an explicit assumption that NaNs in the resulting dataframe are caused by the reindex() operation. The operation introduces NaNs when there is a mismatch between item IDs and user IDs:

https://github.com/MobileTeleSystems/RecTools/blob/main/rectools/utils/indexing.py#L123-L127

Edge Case

The reindex() operation isn't the only time NaNs can be introduced, though. For example, if a Dataset has a NaN by itself in its raw form - this exception is raised anyways. For example:

>>> import rectools
>>> import pandas as pd
>>> from rectools.dataset import Dataset

>>> data = {
...     'user_id': [1, 2, 3, None, 5, 1, 2, 3, 4, 5],  # There is a None in the User IDs
...     'item_id': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
...     'rating': [5, 3, 4, 2, 1, 4, 5, 3, 2, 1]
... }

>>> df = pd.DataFrame(data)
>>> df['weight'] = 1
>>> df['datetime'] = pd.to_datetime('2025-02-25')
>>> dataset = Dataset.construct(df)
>>> dataset.get_raw_interactions()

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/david.landup/.pyenv/versions/3.11.10/lib/python3.11/site-packages/rectools/dataset/dataset.py", line 248, in get_raw_interactions
    return self.interactions.to_external(self.user_id_map, self.item_id_map, include_weight, include_datetime)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/david.landup/.pyenv/versions/3.11.10/lib/python3.11/site-packages/rectools/dataset/interactions.py", line 181, in to_external
    Columns.User: user_id_map.convert_to_external(self.df[Columns.User].values),
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/david.landup/.pyenv/versions/3.11.10/lib/python3.11/site-packages/rectools/dataset/identifiers.py", line 221, in convert_to_external
    result = get_from_series_by_index(self.to_external, internal, strict, return_missing)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/david.landup/.pyenv/versions/3.11.10/lib/python3.11/site-packages/rectools/utils/indexing.py", line 127, in get_from_series_by_index
    raise KeyError("Some indices do not exist")
KeyError: 'Some indices do not exist'

While get_from_series_by_index() itself has a very useful strict mode that can be turned off, the get_raw_interactions() method doesn't allow passing this argument down to the external export calls.

Workaround

A simple workaround is to simply re-implement the constituent calls and pass strict=False in a utility module or patch the lib. However, it would be much nicer developer UX if we can simply pass it through the call natively.

Proposal

One backwards-compatible way to allow for this edge case would be to allow **kwargs in the get_raw_interactions() method and pass them to the underlying calls.

Alternatively, it would be nice UX to also raise which indices are missing or provide a bit of extra context in the error message to alleviate part of the debugging.

Thoughts?

Expected behavior

No response

Relevant logs and/or screenshots

No response

Operating System

MacOS

Python Version

3.11

RecTools version

0.8.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggood first issueGood for newcomers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions