-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversion rules for polars. #14
Conversation
@Wainberg - are you available to review the PR? |
Thanks for taking a crack at this! I think you'll need some additional logic to handle the conversion between For what it's worth, here are the functions I've written for my lab members to convert between Python and R data frames.
As a general point, I find it cleaner, more intuitive and less error-prone to import a function that does the conversion (which can then be included in a method chain using Anyway, here's the code: def df_to_rdf(df, *, rownames=None):
"""
Converts a polars DataFrame to an R data frame via rpy2.
Args:
df: a polars DataFrame
rownames: an optional polars Series (or tuple, list, etc.) of rownames
for the R data frame; column names will be copied from df
Returns:
The corresponding R data frame.
"""
import rpy2_arrow.arrow as pyra
from rpy2.robjects import r
signal.signal(signal.SIGINT, signal.default_int_handler)
if not isinstance(df, pl.DataFrame):
raise TypeError(f'input to df_to_rdf should be a polars DataFrame, '
f'but you gave it a {type(df)}')
# Handle categoricals by separating off the categories, converting the
# codes, and adding the categories back at the end
# TODO also handle pl.Enum
categories = df\
.select(pl.col(pl.Categorical).cat.get_categories().implode())
df = df.with_columns(pl.col(pl.Categorical).cast(pl.Int32))
# Convert
rdf = r['as.data.frame'](pyra.pyarrow_table_to_r_table(df.to_arrow()))
# Add the categories back
for col_categories in categories:
col_index = rdf.colnames.index(col_categories.name)
rdf[col_index].rclass = 'factor'
rdf[col_index].slots['levels'] = series_to_rvector(col_categories[0])
if rownames is not None:
rdf.rownames = series_to_rvector(pl.Series(rownames))
return rdf
def rdf_to_df(rdf, *, keep_rownames=False):
"""
Converts an R data frame to a polars DataFrame via rpy2.
Args:
rdf: an R data frame
keep_rownames: if True, adds the R data frame's names as the first
column of the output polars DataFrame (called rownames,
or rownames_ if there's already a rownames column)
Returns:
The corresponding polars DataFrame.
"""
import rpy2_arrow.arrow as pyra
from rpy2.robjects import DataFrame, r
signal.signal(signal.SIGINT, signal.default_int_handler)
if not isinstance(rdf, DataFrame):
raise TypeError(f'input to rdf_to_df should be an R data frame, '
f'but you gave it a {type(rdf)}')
# Remove any classes rvector has (except factor) to avoid conversion errors
df = {}
for col_name, rvector in rdf.items():
original_rclass = rvector.rclass
try:
rvector.rclass = 'factor' if 'factor' in rvector.rclass else ()
df[col_name] = \
pyra.rarrow_to_py_array(r('arrow::Array$create')(rvector))
finally:
# Put the original classes back
rvector.rclass = original_rclass
df = pl.DataFrame(df)
if keep_rownames and rdf.rownames:
rownames_column_name = 'rownames'
while rownames_column_name in df:
rownames_column_name += '_'
df.insert_column(0, rvector_to_series(rdf.rownames)
.rename(rownames_column_name))
return df Again, I'd like to reiterate that it would be fabulous if you could incorporate rpy2-arrow directly into rpy2, because then the polars folks would be on board with doing |
Thanks. I'll look at what you have and what might be missing in the PR (as well as performance). pandas's Please open an issue if not the case.
Thanks. IIRC R will modify SIGINT handling during initialization. rpy2 is trying to revert that right after R is initialized (see here) with def _sigint_handler(sig, frame):
raise KeyboardInterrupt() which I had understood to be Python's typical behaviour for
String management in Python and R is different, and IIRC both use some form of caching / pointer strategy to minimize object creation and memory usage when identical strings are present.
More static, less modular code can bring efficiencies. In the case of rpy2, there is also the notion of convenience vs performance. The
Several things here. Conversion rule sets spare users a lot of tedious explicit calling of conversions, or all re-implementing dispatch or casting logic. You are actually leveraging the default conversion rule set in your code. Even the The use of a context manager (and the I hear you that having to wrap calls into def df_to_rdf(df):
with polars2ri.converter.context() as conversion_ctx:
return converstion_ctx.py2rpy(df) That function definition could be in the module with the conversion rules set, this way all use-cases are covered.
My reticence is around maintenance of rpy2 when a proliferation of optional dependencies when the resources to maintain rpy2 are very limited and the dependencies cross-languages are not handled by package management for Python. In the case of conversion to polars it relies on an R package that is not even available in the standard package repository for R. I am fine with including into I guess that we are in a situation where there is no perfect solution, but a choice between trade-offs. For example, the
I am also unsure about the rationale of this when an optional dependency in However, I don't understand either what is the importance if having those |
Yeah it's really just about convenience. A surprisingly large number of people aren't even aware that Python-to-R conversion is possible, probably because it's not integrated into pandas/NumPy. Would be great if you can fix the sigint thing since rpy2's current behavior is quite annoying. |
Speaking of awareness, IIRC you mentioned that you did not know about the
I am looking into it. This is tracked here: rpy2/rpy2#1085 |
The issue with a test seems to be cause by a inconsistency or issue with the R package |
I hit another issue. This time it seems to be with the R package |
@Wainberg The equivalent to This is now merged and will be in the release 0.1.0 of the package. |
Awesome! I tried out Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "rpy2_arrow/polars.py", line 108, in pl_to_rpl
with polars.converter.context() as conversion_ctx:
^^^^^^^^^^^^^^^^
AttributeError: module 'polars' has no attribute 'converter'. Did you mean: 'convert'? I think you meant def pl_to_rpl(df):
"""Convenience shortcut to convert a polars object to a R polars object."""
with polars.converter.context() as conversion_ctx:
return conversion_ctx.py2rpy(df)
def rpl_to_pl(df):
"""Convenience shortcut to convert a R polars object to a polars object."""
with polars.converter.context() as conversion_ctx:
return conversion_ctx.rpy2py(df) |
Thatnks. It looks like I thought that I had tests but but tests are not using the functions they were meant to test. 🤦 |
Fix in progress with PR #15 |
Should be fixed in 0.1.1 (now on pypi). |
No description provided.