Conversion rules for polars. #14

lgautier · 2023-12-28T16:55:45Z

No description provided.

lgautier · 2023-12-30T17:36:22Z

@Wainberg - are you available to review the PR?

Wainberg · 2023-12-30T20:28:57Z

Thanks for taking a crack at this! I think you'll need some additional logic to handle the conversion between pl.Categorical/pl.Enum and R factors. I seem to remember this was also an issue for the pandas to R conversion.

For what it's worth, here are the functions I've written for my lab members to convert between Python and R data frames.

I don't support Enums yet because pl.col(pl.Enum) doesn't work yet and I haven't bothered coming up with a workaround because they're fixing it.
I run signal.signal(signal.SIGINT, signal.default_int_handler) after from rpy2.robjects import r because the import messes up the Ctrl + C behavior - it triggers a full traceback. Would be great if you could fix this :)

>>> # before, pressing Ctrl + C
KeyboardInterrupt
>>> from rpy2.robjects import r
>>> # after, pressing Ctrl + C
Traceback (most recent call last):
  File "/home/wainberg/miniforge3/lib/python3.12/site-packages/rpy2/rinterface.py", line 94, in _sigint_handler
    raise KeyboardInterrupt()
KeyboardInterrupt

Because polars DataFrames don't have row names but R data.frames do, I give the user the option to specify the row names (when converting polars -> R) or whether to add the row names as the first column of the DataFrame (when converting R -> polars). Converting string columns is like 100-1000x more expensive (I forget the exact amount) than converting numeric columns, so converting the rownames is off by default.
If some things look unintuitive or roundabout, it's probably because I benchmarked them and they're more efficient than the more straightforward way.

As a general point, I find it cleaner, more intuitive and less error-prone to import a function that does the conversion (which can then be included in a method chain using df.pipe()) rather than using a with statement and auto-converting. This has always bugged me about rpy2.

Anyway, here's the code:

def df_to_rdf(df, *, rownames=None):
    """
    Converts a polars DataFrame to an R data frame via rpy2.
    
    Args:
        df: a polars DataFrame
        rownames: an optional polars Series (or tuple, list, etc.) of rownames
                  for the R data frame; column names will be copied from df

    Returns:
        The corresponding R data frame.
    """
    import rpy2_arrow.arrow as pyra
    from rpy2.robjects import r
    signal.signal(signal.SIGINT, signal.default_int_handler)
    if not isinstance(df, pl.DataFrame):
        raise TypeError(f'input to df_to_rdf should be a polars DataFrame, '
                        f'but you gave it a {type(df)}')
    # Handle categoricals by separating off the categories, converting the
    # codes, and adding the categories back at the end
    # TODO also handle pl.Enum
    categories = df\
        .select(pl.col(pl.Categorical).cat.get_categories().implode())
    df = df.with_columns(pl.col(pl.Categorical).cast(pl.Int32))
    # Convert
    rdf = r['as.data.frame'](pyra.pyarrow_table_to_r_table(df.to_arrow()))
    # Add the categories back
    for col_categories in categories:
        col_index = rdf.colnames.index(col_categories.name)
        rdf[col_index].rclass = 'factor'
        rdf[col_index].slots['levels'] = series_to_rvector(col_categories[0])
    if rownames is not None:
        rdf.rownames = series_to_rvector(pl.Series(rownames))
    return rdf

def rdf_to_df(rdf, *, keep_rownames=False):
    """
    Converts an R data frame to a polars DataFrame via rpy2.
    
    Args:
        rdf: an R data frame
        keep_rownames: if True, adds the R data frame's names as the first
                       column of the output polars DataFrame (called rownames,
                       or rownames_ if there's already a rownames column)

    Returns:
        The corresponding polars DataFrame.
    """
    import rpy2_arrow.arrow as pyra
    from rpy2.robjects import DataFrame, r
    signal.signal(signal.SIGINT, signal.default_int_handler)
    if not isinstance(rdf, DataFrame):
        raise TypeError(f'input to rdf_to_df should be an R data frame, '
                        f'but you gave it a {type(rdf)}')
    # Remove any classes rvector has (except factor) to avoid conversion errors
    df = {}
    for col_name, rvector in rdf.items():
        original_rclass = rvector.rclass
        try:
            rvector.rclass = 'factor' if 'factor' in rvector.rclass else ()
            df[col_name] = \
                pyra.rarrow_to_py_array(r('arrow::Array$create')(rvector))
        finally:
            # Put the original classes back
            rvector.rclass = original_rclass
    df = pl.DataFrame(df)
    if keep_rownames and rdf.rownames:
        rownames_column_name = 'rownames'
        while rownames_column_name in df:
            rownames_column_name += '_'
        df.insert_column(0, rvector_to_series(rdf.rownames)
                         .rename(rownames_column_name))
    return df

Again, I'd like to reiterate that it would be fabulous if you could incorporate rpy2-arrow directly into rpy2, because then the polars folks would be on board with doing polars_df.to_r() and pl.from_r(r_df). I wish they were less particular about whether rpy2 and rpy2-arrow are two separate libraries versus one, but they seem to care a lot.

lgautier · 2024-01-07T21:43:40Z

Thanks for taking a crack at this! I think you'll need some additional logic to handle the conversion between pl.Categorical/pl.Enum and R factors. I seem to remember this was also an issue for the pandas to R conversion.

Thanks. I'll look at what you have and what might be missing in the PR (as well as performance).

pandas's Categorical should be handled.
https://github.com/rpy2/rpy2/blob/e0f2155e4857c61b1bf9bed4aecc7650e7ffb6d3/rpy2/robjects/pandas2ri.py#L89
https://github.com/rpy2/rpy2/blob/e0f2155e4857c61b1bf9bed4aecc7650e7ffb6d3/rpy2/robjects/pandas2ri.py#L343

Please open an issue if not the case.

For what it's worth, here are the functions I've written for my lab members to convert between Python and R data frames.
* I don't support Enums yet because `pl.col(pl.Enum)` doesn't work yet and I haven't bothered coming up with a workaround because they're fixing it.

pandas can have features that don't completely work yet. I am also staying away from them as much as possible.

* I run `signal.signal(signal.SIGINT, signal.default_int_handler)` after `from rpy2.robjects import r` because the import messes up the Ctrl + C behavior - it triggers a full traceback. Would be great if you could fix this :)

Thanks.

IIRC R will modify SIGINT handling during initialization. rpy2 is trying to revert that right after R is initialized (see here) with

def _sigint_handler(sig, frame):
    raise KeyboardInterrupt()

which I had understood to be Python's typical behaviour for KeyboardInterrupt. If there is default interruption handler in Python this would work as well.

>>> # before, pressing Ctrl + C
KeyboardInterrupt
>>> from rpy2.robjects import r
>>> # after, pressing Ctrl + C
Traceback (most recent call last):
  File "/home/wainberg/miniforge3/lib/python3.12/site-packages/rpy2/rinterface.py", line 94, in _sigint_handler
    raise KeyboardInterrupt()
KeyboardInterrupt

* Because polars DataFrames don't have row names but R data.frames do, I give the user the option to specify the row names (when converting polars -> R) or whether to add the row names as the first column of the DataFrame (when converting R -> polars). Converting string columns is like 100-1000x more expensive (I forget the exact amount) than converting numeric columns, so converting the rownames is off by default.

String management in Python and R is different, and IIRC both use some form of caching / pointer strategy to minimize object creation and memory usage when identical strings are present.

* If some things look unintuitive or roundabout, it's probably because I benchmarked them and they're more efficient than the more straightforward way.

More static, less modular code can bring efficiencies. In the case of rpy2, there is also the notion of convenience vs performance. The robjects layer bring a few convenience features that abstract conversion or make wrapper for R objects more pythonic but that can come at the cost of performance. The rinterface layer can then be used. I prefer to have specifics about the performance issue motivating odd-looking code (say, as code comments).

As a general point, I find it cleaner, more intuitive and less error-prone to import a function that does the conversion (which can then be included in a method chain using df.pipe()) rather than using a with statement and auto-converting. This has always bugged me about rpy2.

Several things here. Conversion rule sets spare users a lot of tedious explicit calling of conversions, or all re-implementing dispatch or casting logic. You are actually leveraging the default conversion rule set in your code. Even the rinterface-level to interface with R is applying some for conversion. Thinking about conversions as a rule set with a built-in dispatch mechanism allows things like "Python list of Python polar DataFrames -> R" to "just work" by adding "Python polar DataFrame -> R" to an existing default rule set handling "Python list -> R".

The use of a context manager (and the with statement) comes from the fact that conversion rule sets have to be implemented as "global" but should facilitate their handling in a context-dependent fashion.

I hear you that having to wrap calls into with block can be annoying in some contexts. However, creating a function like you find more intuitive is rather trivial.

def df_to_rdf(df):
    with polars2ri.converter.context() as conversion_ctx:
        return converstion_ctx.py2rpy(df)

That function definition could be in the module with the conversion rules set, this way all use-cases are covered.

Anyway, here's the code:

def df_to_rdf(df, *, rownames=None):
    """
    Converts a polars DataFrame to an R data frame via rpy2.
    
    Args:
        df: a polars DataFrame
        rownames: an optional polars Series (or tuple, list, etc.) of rownames
                  for the R data frame; column names will be copied from df

    Returns:
        The corresponding R data frame.
    """
    import rpy2_arrow.arrow as pyra
    from rpy2.robjects import r
    signal.signal(signal.SIGINT, signal.default_int_handler)
    if not isinstance(df, pl.DataFrame):
        raise TypeError(f'input to df_to_rdf should be a polars DataFrame, '
                        f'but you gave it a {type(df)}')
    # Handle categoricals by separating off the categories, converting the
    # codes, and adding the categories back at the end
    # TODO also handle pl.Enum
    categories = df\
        .select(pl.col(pl.Categorical).cat.get_categories().implode())
    df = df.with_columns(pl.col(pl.Categorical).cast(pl.Int32))
    # Convert
    rdf = r['as.data.frame'](pyra.pyarrow_table_to_r_table(df.to_arrow()))
    # Add the categories back
    for col_categories in categories:
        col_index = rdf.colnames.index(col_categories.name)
        rdf[col_index].rclass = 'factor'
        rdf[col_index].slots['levels'] = series_to_rvector(col_categories[0])
    if rownames is not None:
        rdf.rownames = series_to_rvector(pl.Series(rownames))
    return rdf

def rdf_to_df(rdf, *, keep_rownames=False):
    """
    Converts an R data frame to a polars DataFrame via rpy2.
    
    Args:
        rdf: an R data frame
        keep_rownames: if True, adds the R data frame's names as the first
                       column of the output polars DataFrame (called rownames,
                       or rownames_ if there's already a rownames column)

    Returns:
        The corresponding polars DataFrame.
    """
    import rpy2_arrow.arrow as pyra
    from rpy2.robjects import DataFrame, r
    signal.signal(signal.SIGINT, signal.default_int_handler)
    if not isinstance(rdf, DataFrame):
        raise TypeError(f'input to rdf_to_df should be an R data frame, '
                        f'but you gave it a {type(rdf)}')
    # Remove any classes rvector has (except factor) to avoid conversion errors
    df = {}
    for col_name, rvector in rdf.items():
        original_rclass = rvector.rclass
        try:
            rvector.rclass = 'factor' if 'factor' in rvector.rclass else ()
            df[col_name] = \
                pyra.rarrow_to_py_array(r('arrow::Array$create')(rvector))
        finally:
            # Put the original classes back
            rvector.rclass = original_rclass
    df = pl.DataFrame(df)
    if keep_rownames and rdf.rownames:
        rownames_column_name = 'rownames'
        while rownames_column_name in df:
            rownames_column_name += '_'
        df.insert_column(0, rvector_to_series(rdf.rownames)
                         .rename(rownames_column_name))
    return df

Again, I'd like to reiterate that it would be fabulous if you could incorporate rpy2-arrow directly into rpy2, because then the polars folks would be on board with doing polars_df.to_r() and pl.from_r(r_df).

My reticence is around maintenance of rpy2 when a proliferation of optional dependencies when the resources to maintain rpy2 are very limited and the dependencies cross-languages are not handled by package management for Python. In the case of conversion to polars it relies on an R package that is not even available in the standard package repository for R. I am fine with including into rpy2_arrow, even though polars will be an optional dependency. At least there is a containment of optional features into thematic packages.

I guess that we are in a situation where there is no perfect solution, but a choice between trade-offs. For example, the pandas model seems to include features that might be optional, or not yet fully functioning, while the jupyter model is to have extension modules. The former can be argued to make it more likely that users will try it, but this can at the cost of entropy. The latter can help independently-maintained extensions to develop at a faster pace.

I wish they were less particular about whether rpy2 and rpy2-arrow are two separate libraries versus one, but they seem to care a lot.

I am also unsure about the rationale of this when an optional dependency in rpy2 would not mean it is more used, or necessarily better maintained.

However, I don't understand either what is the importance if having those to_r() and from_r() part of polars when something like rpy2_arrow.to_pypolars() or rpy2_arrow.from_rpolars(), or any similar functionality in another package than rpy2_arrow would work just as well.

Wainberg · 2024-01-08T01:12:08Z

Yeah it's really just about convenience. A surprisingly large number of people aren't even aware that Python-to-R conversion is possible, probably because it's not integrated into pandas/NumPy.

Would be great if you can fix the sigint thing since rpy2's current behavior is quite annoying.

lgautier · 2024-01-08T01:45:34Z

Yeah it's really just about convenience. A surprisingly large number of people aren't even aware that Python-to-R conversion is possible, probably because it's not integrated into pandas/NumPy.

Speaking of awareness, IIRC you mentioned that you did not know about the ggplot2 and dplyr wrapper in rpy2. IIUC R polars can be used in combination with dplyr in R. This could be used to used the dplyr API from Python while the data is in polars: https://rpy2.github.io/doc/v3.5.x/html/lib_dplyr.html

Would be great if you can fix the sigint thing since rpy2's current behavior is quite annoying.

I am looking into it. This is tracked here: rpy2/rpy2#1085

lgautier · 2024-01-14T21:24:15Z

The issue with a test seems to be cause by a inconsistency or issue with the R package arrow. I reported it (apache/arrow#39603). I will have to find a workaround for that test.

lgautier · 2024-01-20T18:37:06Z

I hit another issue. This time it seems to be with the R package polars (reported here: pola-rs/r-polars#725).

See pola-rs/r-polars#725 (comment). and pola-rs/r-polars#728.

(see pola-rs/r-polars#725)

lgautier · 2024-02-18T16:35:23Z

@Wainberg The equivalent to rdf_to_df() and df_to_rdf() is called rpl_to_pl() and pl_to_rpl() with the intent to make them handle as many polars object types as possible.

This is now merged and will be in the release 0.1.0 of the package.

Wainberg · 2024-02-20T01:59:16Z

Awesome!

I tried out pl_to_rpl with a DataFrame (doesn't seem to support polars Series) but got this error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "rpy2_arrow/polars.py", line 108, in pl_to_rpl
    with polars.converter.context() as conversion_ctx:
         ^^^^^^^^^^^^^^^^
AttributeError: module 'polars' has no attribute 'converter'. Did you mean: 'convert'?

I think you meant with converter.context() instead of with polars.converter.context() here:

def pl_to_rpl(df):
    """Convenience shortcut to convert a polars object to a R polars object."""
    with polars.converter.context() as conversion_ctx:
        return conversion_ctx.py2rpy(df)


def rpl_to_pl(df):
    """Convenience shortcut to convert a R polars object to a polars object."""
    with polars.converter.context() as conversion_ctx:
        return conversion_ctx.rpy2py(df)

lgautier · 2024-02-20T04:13:38Z

Thatnks. It looks like I thought that I had tests but but tests are not using the functions they were meant to test.
https://github.com/rpy2/rpy2-arrow/blob/main/rpy2_arrow/tests_polars.py#L154

🤦

lgautier · 2024-02-20T04:20:27Z

Thatnks. It looks like I thought that I had tests but but tests are not using the functions they were meant to test. https://github.com/rpy2/rpy2-arrow/blob/main/rpy2_arrow/tests_polars.py#L154

🤦

Fix in progress with PR #15

lgautier · 2024-02-20T13:52:52Z

Should be fixed in 0.1.1 (now on pypi).

lgautier added 5 commits December 28, 2023 11:54

Conversion rules for polars.

0cf0075

Linting.

e6f3fe2

Linting (again).

0ee390f

Linting, better testing.

b1d447a

Overly optimistic copy/paste-ing.

75523c6

lgautier marked this pull request as ready for review December 28, 2023 19:04

lgautier mentioned this pull request Dec 28, 2023

Add Arrow support rpy2/rpy2#1080

Closed

Add doc page for polars.

8e25804

lgautier added 3 commits January 7, 2024 19:20

Add more tests, add coverage for polars.

6b5d16e

More tests.

61f4265

Linting.

c94e5da

lgautier mentioned this pull request Jan 8, 2024

Use Python's default signal handler. rpy2/rpy2#1085

Merged

Fixed quotes for the shell.

5266eba

lgautier added 7 commits January 15, 2024 11:55

Fix tests.

cc00f54

Split install for MacOS/Linux into different steps.

f84263d

Install explicitly nanoarrow on MacOS.

aa4cb6c

Set minimum R polars package version. Fixes for 0.12.

8e01794

Fix test for polars 0.12.

1e8be37

Remove debugging use of pdb.

a54ff0d

nanoarrow also needed on Linux.

6ac3e1f

lgautier added 3 commits January 21, 2024 11:28

polars::pl$from_arrow() is deprecated. Use as_polars_df().

b67694a

See pola-rs/r-polars#725 (comment). and pola-rs/r-polars#728.

Issue with segfault was fixed upstream (in the R polars package).

cd2f2c1

(see pola-rs/r-polars#725)

Fix type hint for polars.

6b6ca1d

lgautier added 2 commits February 10, 2024 18:31

Added convenience functions rpl_to_pl() and pl_to_rpl().

c406c10

Linting.

85e5e3e

lgautier merged commit acd28f7 into main Feb 17, 2024
34 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conversion rules for polars. #14

Conversion rules for polars. #14

lgautier commented Dec 28, 2023

lgautier commented Dec 30, 2023

Wainberg commented Dec 30, 2023

lgautier commented Jan 7, 2024 •

edited

Loading

Wainberg commented Jan 8, 2024

lgautier commented Jan 8, 2024

lgautier commented Jan 14, 2024

lgautier commented Jan 20, 2024

lgautier commented Feb 18, 2024

Wainberg commented Feb 20, 2024

lgautier commented Feb 20, 2024

lgautier commented Feb 20, 2024

lgautier commented Feb 20, 2024

Conversion rules for polars. #14

Conversion rules for polars. #14

Conversation

lgautier commented Dec 28, 2023

lgautier commented Dec 30, 2023

Wainberg commented Dec 30, 2023

lgautier commented Jan 7, 2024 • edited Loading

Wainberg commented Jan 8, 2024

lgautier commented Jan 8, 2024

lgautier commented Jan 14, 2024

lgautier commented Jan 20, 2024

lgautier commented Feb 18, 2024

Wainberg commented Feb 20, 2024

lgautier commented Feb 20, 2024

lgautier commented Feb 20, 2024

lgautier commented Feb 20, 2024

lgautier commented Jan 7, 2024 •

edited

Loading