-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Labels
needs triageIssue that has not been reviewed by xarray team memberIssue that has not been reviewed by xarray team member
Description
Overview
When to_dataframe() is called on an xarray Dataset with a multi-dimensional index along a given dimension, the index coordinates are translated both:
- into levels of a pandas
MultiIndexfor the dataframe - into individual columns of the dataframe.
Is this expected and intended behavior?
Main reprex
import numpy as np
import pandas as pd
import xarray as xr
data_dict = dict(x=[1, 2, 1, 2, 1], y=["a", "a", "b", "b", "b"], z=[5, 10, 15, 20, 25])
data_dict_w_dims = {k: ("my_dim", v) for k, v in data_dict.items()}
# create a dataset multi-indexed along "my_dim" by "x" and "y"
xr_dat = xr.Dataset(data_dict_w_dims).set_coords(["x", "y"]).set_xindex(["x", "y"])
print(xr_dat)
# <xarray.Dataset> Size: 140B
# Dimensions: (my_dim: 5)
# Coordinates:
# * my_dim (my_dim) object 40B MultiIndex
# * x (my_dim) int64 40B 1 2 1 2 1
# * y (my_dim) <U1 20B 'a' 'a' 'b' 'b' 'b'
# Data variables:
# z (my_dim) int64 40B 5 10 15 20 25
print(xr_dat.to_dataframe()) # x and y present both as columns and as multi-index
# z x y
# x y
# 1 a 5 1 a
# 2 a 10 2 a
# 1 b 15 1 b
# 2 b 20 2 b
# 1 b 25 1 bCause
I believe the key line is here in the _to_dataframe() internal method:
Lines 7092 to 7095 in 699d895
| def _to_dataframe(self, ordered_dims: Mapping[Any, int]): | |
| from xarray.core.extension_array import PandasExtensionArray | |
| columns_in_order = [k for k in self.variables if k not in self.dims] |
The constituent IndexArrays of the multi-index are present in self.variables (and not in self.dims), so they become columns:
"x" in xr_dat.dims
# False
"x" in xr_dat.variables
# True
xr_dat.variables["x"]
# <xarray.IndexVariable 'my_dim' (my_dim: 5)> Size: 40B
# [5 values with dtype=int64]This has consequences for pandas -> xarray -> pandas conversion
Because of this, converting a MultiIndex-ed pandas dataframe to an xarray Dataset via the xr.Dataset() constructor and then converting back to pandas via .to_dataframe() will not give back the original dataframe.
Reprex
# create a multi-indexed pandas dataframe
pd_df = pd.DataFrame(
data_dict
).set_index(["x", "y"])
print(pd_df) # multi-indexed-df with one column
# z
# x y
# 1 a 5
# 2 a 10
# 1 b 15
# 2 b 20
# 1 b 25
# Conversion to xarray is as expected:
xr_from_pd = xr.Dataset(pd_df)
print(xr_from_pd)
# <xarray.Dataset> Size: 160B
# Dimensions: (dim_0: 5)
# Coordinates:
# * dim_0 (dim_0) object 40B MultiIndex
# * x (dim_0) int64 40B 1 2 1 2 1
# * y (dim_0) object 40B 'a' 'a' 'b' 'b' 'b'
# Data variables:
# z (dim_0) int64 40B 5 10 15 20 25
# Converting back to pandas df via `to_dataframe()` yields a df multi-indexed by
# x and y that also contains `x` and `y` as columns:
print(xr_from_pd.to_dataframe()) # x and y as multi-index and as columns
# x y z
# x y
# 1 a 1 a 5
# 2 a 2 a 10
# 1 b 1 b 15
# 2 b 2 b 20
# 1 b 1 b 25Thoughts
- If this behavior is not intended, the flagged line in
_to_dataframe()should be changed to determine column names in a way that ignoresIndexVariablesthat form part of a multi-index. - It might be important not just to filter to data variables, because one might want coordinates to become columns when they are not going to be part of the pandas
MultiIndex, e.g.
# similar dataset with x and y as coordinates but not as a multi-index
dat_no_multiindex = xr.Dataset(
data_dict_w_dims
).set_coords(["x", "y"])
# potentially intended behavior?
print(dat_no_multiindex.to_dataframe())
# x y z
# my_dim
# 0 1 a 5
# 1 2 a 10
# 2 1 b 15
# 3 2 b 20
# 4 1 b 25max-sixty and damonbayer
Metadata
Metadata
Assignees
Labels
needs triageIssue that has not been reviewed by xarray team memberIssue that has not been reviewed by xarray team member