Modin 0.19.0
Modin 0.19.0
This release introduces Modin's new, experimental NumPy API. It also features
many bug fixes, improvements to documentation, and performance optimizations,
including faster initialization with NumPy arrays.
Key Features and Updates Since 0.18.0
- Stability and Bugfixes
- FIX-#0000: Fix a typo in
expr.py
(#5757) - FIX-#1227: Avoid
RecursionError
for__int__
and__float__
(#5502) - FIX-#1503: Proper implementation of
Series.values
(#5469) - FIX-#2320: Raise exceptions in read_csv in some cases with
skipfooter!=0
(#5522) - FIX-#2493: Defaults to pandas for read_csv if lineterminator!=None (#5515)
- FIX-#2494: Defaults to pandas for read_csv if escapechar!=None (#5521)
- FIX-#2508: Defaults to pandas for read_csv if
dialect!=None
(#5512) - FIX-#3080: read_csv with HDK backend doesn't handle duplicated columns (#5639)
- FIX-#3305: Fix
read_excel
whenusecols
andindex_cols
parameters are provided (#5508) - FIX-#3620: Fix construction of dataframe from index (#5490)
- FIX-#3928: Fix column insertion into empty data frame (#5103)
- FIX-#4154: add value_counts method for SeriesGroupBy and DataFrameGroupBy (#5453)
- FIX-#4186: Fix
__repr__
of Modin categorical Series (#5516) - FIX-#4640: Fix
__repr__
whendisplay.max_rows=None
(#5504) - FIX-#5165: make 'groupby' handle non-str 'by' columns (#5411)
- FIX-#5273: Make
ParquetFileToRead
a named tuple (#5352) - FIX-#5430: Make groupby work on empty frames (#5442)
- FIX-#5436: Fix '.index' extraction for an empty frame (#5431)
- FIX-#5473: Fixed a bug that ignored positional arguments in
DataFrameGroupBy.take()
(#5474) - FIX-#5477: Fix TypeError: read_sas() takes 1 positional argument but 2 were given (#5465)
- FIX-#5488: Remove usage of deprecated numpy types (#5487)
- FIX-#5492: Fix
Series.values
whenSeries.dtype==ExtensionDtype
(#5493) - FIX-#5514: pin sphinx<6.0.0 (#5513)
- FIX-#5531: Fix failure when inserting a 2D python list into a frame (#5555)
- FIX-#5537: disable empty-groupby handling logic in experimental mode (#5538)
- FIX-#5539: Allow partitioning to adapt to the shape changes caused by '.merge' (#5556)
- FIX-#5545: Aligned with pandas default 'groupby.skew' results for invalid data (#5558)
- FIX-#5552: Fix sort_values when data is over-partitioned. (#5553)
- FIX-#5561: CalciteSerializer does not support unsigned integers (#5563)
- FIX-#5568: Pin 'fastparquet<2023.1.0' (#5569)
- FIX-#5581: Don't use deprecated
inplace
parameter forset_axis
function (#5579) - FIX-#5589: Do not trigger metadata materialization on 'filter' (#5588)
- FIX-#5597: pin sqlalchemy<1.4.46 as pandas does to fix CI (#5593)
- FIX-#5598: make
PyArrowDataset.files
work for3.0.0 <= pyarrow < 8.0.0
(#5592) - FIX-#5600: Copy '.dtypes' on 'df.copy()' (#5601)
- FIX-#5604: Fix dictionary groupby aggregation for a single col partition case (#5605)
- FIX-#5608: Pin openpyxl<3.1.0 (#5603)
- FIX-#5610: Add default to pandas implementation for qcut (#5611)
- FIX-#5621: Do not preserve suboptimal partitioning on
keep_partitioning=False
(#5622) - FIX-#5625: Fix set_index with modin series. (#5630)
- FIX-#5628: BUG: HDK: Unable to concatenate tables with different number of non-numeric columns (#5673)
- FIX-#5629: Make read_sql alias compatible with snowflake. (#5631)
- FIX-#5650: Restore the right dtype for applying Series.cat (#5651)
- FIX-#5665: Fix operations that flatten an array, as well as handling of where argument in such operations (#5668)
- FIX-#5698: Read list of parquet files (#5725)
- FIX-#5702: Fix passing RangeIndex to loc. (#5719)
- FIX-#5714: BUG: Empty frames concatenation with inner join is not valid (#5715)
- FIX-#5720: Ensure that modin.numpy.array's propagate NaN values when computing mean (#5735)
- FIX-#5721: Fix loc[tuple] on multiindex. (#5726)
- FIX-#5730: Add repr, len, size, and make dtype changing lazy. (#5731)
- FIX-#5733: Allow all Modin objects in all Modin object constructors, and make sure copy=False works (#5736)
- FIX-#5742: BUG: HDK: Binary operations on strings are not supported (#5743)
- FIX-#5761: Add _exp, _sqrt to query compiler (#5762)
- FIX-#0000: Fix a typo in
- Performance enhancements
- PERF-#5182: Precompute dtypes when performing binary operations in certain cases (#5494)
- PERF-#5183: Compute dtypes when performing from_labels operation (#5478)
- PERF-#5247: Make MultiIndex use memory more efficiently (#5632)
- PERF-#5369:
GroupBy.skew
implementation via MapReduce pattern (#5318) - PERF-#5484: speed up read_csv; compute metadata after skipping rows (#5482)
- PERF-#5549: copy dtypes for invert op (#5541)
- PERF-#5550: Don't trigger axes computation in
to_pandas
function (#5544) - PERF-#5551: Preserve index and columns on
_repartition
(#5543) - PERF-#5554: Implement
drop_duplicates
via newduplicated
(#5587) - PERF-#5557: Don't trigger axes computation in
pivot_table
(#5546) - PERF-#5573: Don't trigger axes computation in
columnarize
function (#5548) - PERF-#5575: Don't trigger axes computation in
reset_index
function (#5547) - PERF-#5586: Precompute resulting '.merge' partitioning based on the arguments (#5585)
- PERF-#5589: Do no trigger 'dtypes' materialization for '.filter()' (#5595)
- PERF-#5596: Do not trigger index materialization for '.merge' result (#5619)
- PERF-#5613: Optimize
duplicated
in case there is only one column partition (#5640) - FIX-#5641: Add fastpath for numpy arrays to dataframe constructor (#5655)
- PERF-#5657: Don't trigger axes computation when accessing
.str.*
methods (#5658) - PERF-#5660: Don't trigger axes computation when accessing cat.codes (#5661)
- PERF-#5680: Don't trigger axes computation when doing binary operations (#5681)
- PERF-#5682: Don't trigger axes computation when calling
isin
(#5683) - PERF-#5690: move
read_callback
from dispatchers into parsers (#5689) - PERF-#5691: Set item via
.loc
without converting a Series to np.array (#5693) - PERF-#5700: Treat numpy arrays more efficiently at
df.__setitem__
(#5708) - PERF-#5705: Preserve metadata when applying
Series.cat.codes
(#5706) - PERF-#5709: Avoid re-putting a distributed Series to the engine's object store at
.map()
(#5704) - PERF-#5710: Avoid re-putting a distributed Series to the engine's object store at
.isin()
(#5707)
- Refactor Codebase
- REFACTOR-#0000: make deploy functions in virtual_partition.py files private (#5455)
- REFACTOR-#1531: move
default_to_pandas
into base query_compiler class (#5479) - REFACTOR-#3883: Unify tests execution approach in the Github workflow files (#5520)
- REFACTOR-#3948: Use
__constructor__
inDataFrame
andSeries
classes (#5485) - REFACTOR-#5275: Deduplicate code for Ray and Unidist engines (#5457)
- REFACTOR-#5370: Move merge_asof implementation to base query compiler. (#5371)
- REFACTOR-#5393: remove unused '_VIEW_IS_COPY_WARNING' global var (#5392)
- REFACTOR-#5416: fix
FutureWarning: the mangle_dupe_cols keyword is deprecated
forread_excel
(#5415) - REFACTOR-#5434: Define public interfaces in
modin.core.execution.dask
module (#5418) - REFACTOR-#5459: Install code linters through conda and unpin flake8 (#5450)
- REFACTOR-#5462: Update execution.ray public api with virtual partitions (#5456)
- REFACTOR-#5467: remove FutureWarning for
df.iloc[:, i] = newvals
(#5468) - REFACTOR-#5471: add
FutureWarning
forDataFrameGroupBy.backfill
(#5472) - REFACTOR-#5475: Update execution.unidist public api with virtual partitions (#5476)
- REFACTOR-#5535: remove duplication for 'columnarize' method (#5534)
- REFACTOR-#5607: Fix missing formatting with 'black' (#5606)
- REFACTOR-#5685: add
RayWrapper.put
implementation (#5686) - REFACTOR-#5687: add
UnidistWrapper.put
implementation (#5688) - REFACTOR-#5703: align 'DaskWrapper.deploy' behavior with others (#5701)
- REFACTOR-#5718: add
columns
parameter forget_dtypes
function (#5717)
- Update testing suite
- TEST-#0000: correct behavior of CI for push action (#5748)
- TEST-#5420: port asv benchmarks for Repr, MaskBool, isNull, dropNa and equals functions (#5421)
- TEST-#5444: reduce Series' shape for TimeReindex asv bench (#5443)
- TEST-#5448: reduce Dataframe' shape for 'time_merge_default' asv bench (#5446)
- TEST-#5451: reduce shapes for TimeLevelAlign, TimeStack and TimeUnstack ASV benchmarks (#5452)
- TEST-#5540: add module level setup function for ASV benchmarks (#5530)
- TEST-#5664: speedup
Post Run conda-incubator/setup-miniconda@v2
step on Windows (#5662) - TEST-#5747: Synchronize jobs between push.yml and ci.yml that are used to measure test coverage (#5745)
- TEST-#5764: run test-asv-benchmarks CI job only for PRs (#5765)
- Documentation improvements
- New Features
- FEAT-#5147: implement xs (#5143)
- FEAT-#5423: Add a NumPy API to Modin (#5422)
- FEAT-#5481: Implement dictionary groupby aggregation via TreeReduce (#5503)
- FEAT-#5559: Upgrade pandas to 1.5.3 (#5560)
- FEAT-#5562: Upgrade pyhdk to 0.3.1 (#5564)
- FEAT-#5620: Synchronize parameters of
apply_full_axis
withbroadcast_apply_full_axis
(#5637) - FEAT-#5666: Support logic operations on modin numpy arrays (#5667)
- FEAT-#5751: Bump pyhdk version to 0.4 (#5752)
- FEAT-#5753: Add math functions necessary for picoGPT (#5756)
- FEAT-#5754: Add np.linalg operations (#5755)
Contributors
@AndreyPavlenko
@Egor-Krivov
@RehanSD
@YarShev
@anmyachev
@arunjose696
@dchigarev
@devin-petersohn
@mvashishtha
@noloerino
@vnlitvinov
@Billy2551
@Retribution98
@shalearkane