Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversion rules for polars. #14

Merged
merged 22 commits into from
Feb 17, 2024
Merged

Conversion rules for polars. #14

merged 22 commits into from
Feb 17, 2024

Conversation

lgautier
Copy link
Member

No description provided.

@lgautier lgautier marked this pull request as ready for review December 28, 2023 19:04
@lgautier
Copy link
Member Author

@Wainberg - are you available to review the PR?

@Wainberg
Copy link

Thanks for taking a crack at this! I think you'll need some additional logic to handle the conversion between pl.Categorical/pl.Enum and R factors. I seem to remember this was also an issue for the pandas to R conversion.

For what it's worth, here are the functions I've written for my lab members to convert between Python and R data frames.

  • I don't support Enums yet because pl.col(pl.Enum) doesn't work yet and I haven't bothered coming up with a workaround because they're fixing it.
  • I run signal.signal(signal.SIGINT, signal.default_int_handler) after from rpy2.robjects import r because the import messes up the Ctrl + C behavior - it triggers a full traceback. Would be great if you could fix this :)
>>> # before, pressing Ctrl + C
KeyboardInterrupt
>>> from rpy2.robjects import r
>>> # after, pressing Ctrl + C
Traceback (most recent call last):
  File "/home/wainberg/miniforge3/lib/python3.12/site-packages/rpy2/rinterface.py", line 94, in _sigint_handler
    raise KeyboardInterrupt()
KeyboardInterrupt
  • Because polars DataFrames don't have row names but R data.frames do, I give the user the option to specify the row names (when converting polars -> R) or whether to add the row names as the first column of the DataFrame (when converting R -> polars). Converting string columns is like 100-1000x more expensive (I forget the exact amount) than converting numeric columns, so converting the rownames is off by default.
  • If some things look unintuitive or roundabout, it's probably because I benchmarked them and they're more efficient than the more straightforward way.

As a general point, I find it cleaner, more intuitive and less error-prone to import a function that does the conversion (which can then be included in a method chain using df.pipe()) rather than using a with statement and auto-converting. This has always bugged me about rpy2.

Anyway, here's the code:

def df_to_rdf(df, *, rownames=None):
    """
    Converts a polars DataFrame to an R data frame via rpy2.
    
    Args:
        df: a polars DataFrame
        rownames: an optional polars Series (or tuple, list, etc.) of rownames
                  for the R data frame; column names will be copied from df

    Returns:
        The corresponding R data frame.
    """
    import rpy2_arrow.arrow as pyra
    from rpy2.robjects import r
    signal.signal(signal.SIGINT, signal.default_int_handler)
    if not isinstance(df, pl.DataFrame):
        raise TypeError(f'input to df_to_rdf should be a polars DataFrame, '
                        f'but you gave it a {type(df)}')
    # Handle categoricals by separating off the categories, converting the
    # codes, and adding the categories back at the end
    # TODO also handle pl.Enum
    categories = df\
        .select(pl.col(pl.Categorical).cat.get_categories().implode())
    df = df.with_columns(pl.col(pl.Categorical).cast(pl.Int32))
    # Convert
    rdf = r['as.data.frame'](pyra.pyarrow_table_to_r_table(df.to_arrow()))
    # Add the categories back
    for col_categories in categories:
        col_index = rdf.colnames.index(col_categories.name)
        rdf[col_index].rclass = 'factor'
        rdf[col_index].slots['levels'] = series_to_rvector(col_categories[0])
    if rownames is not None:
        rdf.rownames = series_to_rvector(pl.Series(rownames))
    return rdf

def rdf_to_df(rdf, *, keep_rownames=False):
    """
    Converts an R data frame to a polars DataFrame via rpy2.
    
    Args:
        rdf: an R data frame
        keep_rownames: if True, adds the R data frame's names as the first
                       column of the output polars DataFrame (called rownames,
                       or rownames_ if there's already a rownames column)

    Returns:
        The corresponding polars DataFrame.
    """
    import rpy2_arrow.arrow as pyra
    from rpy2.robjects import DataFrame, r
    signal.signal(signal.SIGINT, signal.default_int_handler)
    if not isinstance(rdf, DataFrame):
        raise TypeError(f'input to rdf_to_df should be an R data frame, '
                        f'but you gave it a {type(rdf)}')
    # Remove any classes rvector has (except factor) to avoid conversion errors
    df = {}
    for col_name, rvector in rdf.items():
        original_rclass = rvector.rclass
        try:
            rvector.rclass = 'factor' if 'factor' in rvector.rclass else ()
            df[col_name] = \
                pyra.rarrow_to_py_array(r('arrow::Array$create')(rvector))
        finally:
            # Put the original classes back
            rvector.rclass = original_rclass
    df = pl.DataFrame(df)
    if keep_rownames and rdf.rownames:
        rownames_column_name = 'rownames'
        while rownames_column_name in df:
            rownames_column_name += '_'
        df.insert_column(0, rvector_to_series(rdf.rownames)
                         .rename(rownames_column_name))
    return df

Again, I'd like to reiterate that it would be fabulous if you could incorporate rpy2-arrow directly into rpy2, because then the polars folks would be on board with doing polars_df.to_r() and pl.from_r(r_df). I wish they were less particular about whether rpy2 and rpy2-arrow are two separate libraries versus one, but they seem to care a lot.

@lgautier
Copy link
Member Author

lgautier commented Jan 7, 2024

Thanks for taking a crack at this! I think you'll need some additional logic to handle the conversion between pl.Categorical/pl.Enum and R factors. I seem to remember this was also an issue for the pandas to R conversion.

Thanks. I'll look at what you have and what might be missing in the PR (as well as performance).

pandas's Categorical should be handled.
https://github.com/rpy2/rpy2/blob/e0f2155e4857c61b1bf9bed4aecc7650e7ffb6d3/rpy2/robjects/pandas2ri.py#L89
https://github.com/rpy2/rpy2/blob/e0f2155e4857c61b1bf9bed4aecc7650e7ffb6d3/rpy2/robjects/pandas2ri.py#L343

Please open an issue if not the case.

For what it's worth, here are the functions I've written for my lab members to convert between Python and R data frames.

* I don't support Enums yet because `pl.col(pl.Enum)` doesn't work yet and I haven't bothered coming up with a workaround because they're fixing it.

pandas can have features that don't completely work yet. I am also staying away from them as much as possible.

* I run `signal.signal(signal.SIGINT, signal.default_int_handler)` after `from rpy2.robjects import r` because the import messes up the Ctrl + C behavior - it triggers a full traceback. Would be great if you could fix this :)

Thanks.

IIRC R will modify SIGINT handling during initialization. rpy2 is trying to revert that right after R is initialized (see here) with

def _sigint_handler(sig, frame):
    raise KeyboardInterrupt()

which I had understood to be Python's typical behaviour for KeyboardInterrupt. If there is default interruption handler in Python this would work as well.

>>> # before, pressing Ctrl + C
KeyboardInterrupt
>>> from rpy2.robjects import r
>>> # after, pressing Ctrl + C
Traceback (most recent call last):
  File "/home/wainberg/miniforge3/lib/python3.12/site-packages/rpy2/rinterface.py", line 94, in _sigint_handler
    raise KeyboardInterrupt()
KeyboardInterrupt
* Because polars DataFrames don't have row names but R data.frames do, I give the user the option to specify the row names (when converting polars -> R) or whether to add the row names as the first column of the DataFrame (when converting R -> polars). Converting string columns is like 100-1000x more expensive (I forget the exact amount) than converting numeric columns, so converting the rownames is off by default.

String management in Python and R is different, and IIRC both use some form of caching / pointer strategy to minimize object creation and memory usage when identical strings are present.

* If some things look unintuitive or roundabout, it's probably because I benchmarked them and they're more efficient than the more straightforward way.

More static, less modular code can bring efficiencies. In the case of rpy2, there is also the notion of convenience vs performance. The robjects layer bring a few convenience features that abstract conversion or make wrapper for R objects more pythonic but that can come at the cost of performance. The rinterface layer can then be used. I prefer to have specifics about the performance issue motivating odd-looking code (say, as code comments).

As a general point, I find it cleaner, more intuitive and less error-prone to import a function that does the conversion (which can then be included in a method chain using df.pipe()) rather than using a with statement and auto-converting. This has always bugged me about rpy2.

Several things here. Conversion rule sets spare users a lot of tedious explicit calling of conversions, or all re-implementing dispatch or casting logic. You are actually leveraging the default conversion rule set in your code. Even the rinterface-level to interface with R is applying some for conversion. Thinking about conversions as a rule set with a built-in dispatch mechanism allows things like "Python list of Python polar DataFrames -> R" to "just work" by adding "Python polar DataFrame -> R" to an existing default rule set handling "Python list -> R".

The use of a context manager (and the with statement) comes from the fact that conversion rule sets have to be implemented as "global" but should facilitate their handling in a context-dependent fashion.

I hear you that having to wrap calls into with block can be annoying in some contexts. However, creating a function like you find more intuitive is rather trivial.

def df_to_rdf(df):
    with polars2ri.converter.context() as conversion_ctx:
        return converstion_ctx.py2rpy(df)

That function definition could be in the module with the conversion rules set, this way all use-cases are covered.

Anyway, here's the code:

def df_to_rdf(df, *, rownames=None):
    """
    Converts a polars DataFrame to an R data frame via rpy2.
    
    Args:
        df: a polars DataFrame
        rownames: an optional polars Series (or tuple, list, etc.) of rownames
                  for the R data frame; column names will be copied from df

    Returns:
        The corresponding R data frame.
    """
    import rpy2_arrow.arrow as pyra
    from rpy2.robjects import r
    signal.signal(signal.SIGINT, signal.default_int_handler)
    if not isinstance(df, pl.DataFrame):
        raise TypeError(f'input to df_to_rdf should be a polars DataFrame, '
                        f'but you gave it a {type(df)}')
    # Handle categoricals by separating off the categories, converting the
    # codes, and adding the categories back at the end
    # TODO also handle pl.Enum
    categories = df\
        .select(pl.col(pl.Categorical).cat.get_categories().implode())
    df = df.with_columns(pl.col(pl.Categorical).cast(pl.Int32))
    # Convert
    rdf = r['as.data.frame'](pyra.pyarrow_table_to_r_table(df.to_arrow()))
    # Add the categories back
    for col_categories in categories:
        col_index = rdf.colnames.index(col_categories.name)
        rdf[col_index].rclass = 'factor'
        rdf[col_index].slots['levels'] = series_to_rvector(col_categories[0])
    if rownames is not None:
        rdf.rownames = series_to_rvector(pl.Series(rownames))
    return rdf

def rdf_to_df(rdf, *, keep_rownames=False):
    """
    Converts an R data frame to a polars DataFrame via rpy2.
    
    Args:
        rdf: an R data frame
        keep_rownames: if True, adds the R data frame's names as the first
                       column of the output polars DataFrame (called rownames,
                       or rownames_ if there's already a rownames column)

    Returns:
        The corresponding polars DataFrame.
    """
    import rpy2_arrow.arrow as pyra
    from rpy2.robjects import DataFrame, r
    signal.signal(signal.SIGINT, signal.default_int_handler)
    if not isinstance(rdf, DataFrame):
        raise TypeError(f'input to rdf_to_df should be an R data frame, '
                        f'but you gave it a {type(rdf)}')
    # Remove any classes rvector has (except factor) to avoid conversion errors
    df = {}
    for col_name, rvector in rdf.items():
        original_rclass = rvector.rclass
        try:
            rvector.rclass = 'factor' if 'factor' in rvector.rclass else ()
            df[col_name] = \
                pyra.rarrow_to_py_array(r('arrow::Array$create')(rvector))
        finally:
            # Put the original classes back
            rvector.rclass = original_rclass
    df = pl.DataFrame(df)
    if keep_rownames and rdf.rownames:
        rownames_column_name = 'rownames'
        while rownames_column_name in df:
            rownames_column_name += '_'
        df.insert_column(0, rvector_to_series(rdf.rownames)
                         .rename(rownames_column_name))
    return df

Again, I'd like to reiterate that it would be fabulous if you could incorporate rpy2-arrow directly into rpy2, because then the polars folks would be on board with doing polars_df.to_r() and pl.from_r(r_df).

My reticence is around maintenance of rpy2 when a proliferation of optional dependencies when the resources to maintain rpy2 are very limited and the dependencies cross-languages are not handled by package management for Python. In the case of conversion to polars it relies on an R package that is not even available in the standard package repository for R. I am fine with including into rpy2_arrow, even though polars will be an optional dependency. At least there is a containment of optional features into thematic packages.

I guess that we are in a situation where there is no perfect solution, but a choice between trade-offs. For example, the pandas model seems to include features that might be optional, or not yet fully functioning, while the jupyter model is to have extension modules. The former can be argued to make it more likely that users will try it, but this can at the cost of entropy. The latter can help independently-maintained extensions to develop at a faster pace.

I wish they were less particular about whether rpy2 and rpy2-arrow are two separate libraries versus one, but they seem to care a lot.

I am also unsure about the rationale of this when an optional dependency in rpy2 would not mean it is more used, or necessarily better maintained.

However, I don't understand either what is the importance if having those to_r() and from_r() part of polars when something like rpy2_arrow.to_pypolars() or rpy2_arrow.from_rpolars(), or any similar functionality in another package than rpy2_arrow would work just as well.

@Wainberg
Copy link

Wainberg commented Jan 8, 2024

Yeah it's really just about convenience. A surprisingly large number of people aren't even aware that Python-to-R conversion is possible, probably because it's not integrated into pandas/NumPy.

Would be great if you can fix the sigint thing since rpy2's current behavior is quite annoying.

@lgautier
Copy link
Member Author

lgautier commented Jan 8, 2024

Yeah it's really just about convenience. A surprisingly large number of people aren't even aware that Python-to-R conversion is possible, probably because it's not integrated into pandas/NumPy.

Speaking of awareness, IIRC you mentioned that you did not know about the ggplot2 and dplyr wrapper in rpy2. IIUC R polars can be used in combination with dplyr in R. This could be used to used the dplyr API from Python while the data is in polars: https://rpy2.github.io/doc/v3.5.x/html/lib_dplyr.html

Would be great if you can fix the sigint thing since rpy2's current behavior is quite annoying.

I am looking into it. This is tracked here: rpy2/rpy2#1085

@lgautier
Copy link
Member Author

The issue with a test seems to be cause by a inconsistency or issue with the R package arrow. I reported it (apache/arrow#39603). I will have to find a workaround for that test.

@lgautier
Copy link
Member Author

I hit another issue. This time it seems to be with the R package polars (reported here: pola-rs/r-polars#725).

@lgautier lgautier merged commit acd28f7 into main Feb 17, 2024
34 checks passed
@lgautier
Copy link
Member Author

@Wainberg The equivalent to rdf_to_df() and df_to_rdf() is called rpl_to_pl() and pl_to_rpl() with the intent to make them handle as many polars object types as possible.

This is now merged and will be in the release 0.1.0 of the package.

@Wainberg
Copy link

Awesome!

I tried out pl_to_rpl with a DataFrame (doesn't seem to support polars Series) but got this error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "rpy2_arrow/polars.py", line 108, in pl_to_rpl
    with polars.converter.context() as conversion_ctx:
         ^^^^^^^^^^^^^^^^
AttributeError: module 'polars' has no attribute 'converter'. Did you mean: 'convert'?

I think you meant with converter.context() instead of with polars.converter.context() here:

def pl_to_rpl(df):
    """Convenience shortcut to convert a polars object to a R polars object."""
    with polars.converter.context() as conversion_ctx:
        return conversion_ctx.py2rpy(df)


def rpl_to_pl(df):
    """Convenience shortcut to convert a R polars object to a polars object."""
    with polars.converter.context() as conversion_ctx:
        return conversion_ctx.rpy2py(df)

@lgautier
Copy link
Member Author

Thatnks. It looks like I thought that I had tests but but tests are not using the functions they were meant to test.
https://github.com/rpy2/rpy2-arrow/blob/main/rpy2_arrow/tests_polars.py#L154

🤦

@lgautier
Copy link
Member Author

Thatnks. It looks like I thought that I had tests but but tests are not using the functions they were meant to test. https://github.com/rpy2/rpy2-arrow/blob/main/rpy2_arrow/tests_polars.py#L154

🤦

Fix in progress with PR #15

@lgautier
Copy link
Member Author

Should be fixed in 0.1.1 (now on pypi).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants