Release Modin 0.19.0 · modin-project/modin

Modin 0.19.0

This release introduces Modin's new, experimental NumPy API. It also features
many bug fixes, improvements to documentation, and performance optimizations,
including faster initialization with NumPy arrays.

Key Features and Updates Since 0.18.0

Stability and Bugfixes
- FIX-#0000: Fix a typo in expr.py (#5757)
- FIX-#1227: Avoid RecursionError for __int__ and __float__ (#5502)
- FIX-#1503: Proper implementation of Series.values (#5469)
- FIX-#2320: Raise exceptions in read_csv in some cases with skipfooter!=0 (#5522)
- FIX-#2493: Defaults to pandas for read_csv if lineterminator!=None (#5515)
- FIX-#2494: Defaults to pandas for read_csv if escapechar!=None (#5521)
- FIX-#2508: Defaults to pandas for read_csv if dialect!=None (#5512)
- FIX-#3080: read_csv with HDK backend doesn't handle duplicated columns (#5639)
- FIX-#3305: Fix read_excel when usecols and index_cols parameters are provided (#5508)
- FIX-#3620: Fix construction of dataframe from index (#5490)
- FIX-#3928: Fix column insertion into empty data frame (#5103)
- FIX-#4154: add value_counts method for SeriesGroupBy and DataFrameGroupBy (#5453)
- FIX-#4186: Fix __repr__ of Modin categorical Series (#5516)
- FIX-#4640: Fix __repr__ when display.max_rows=None (#5504)
- FIX-#5165: make 'groupby' handle non-str 'by' columns (#5411)
- FIX-#5273: Make ParquetFileToRead a named tuple (#5352)
- FIX-#5430: Make groupby work on empty frames (#5442)
- FIX-#5436: Fix '.index' extraction for an empty frame (#5431)
- FIX-#5473: Fixed a bug that ignored positional arguments in DataFrameGroupBy.take() (#5474)
- FIX-#5477: Fix TypeError: read_sas() takes 1 positional argument but 2 were given (#5465)
- FIX-#5488: Remove usage of deprecated numpy types (#5487)
- FIX-#5492: Fix Series.values when Series.dtype==ExtensionDtype (#5493)
- FIX-#5514: pin sphinx<6.0.0 (#5513)
- FIX-#5531: Fix failure when inserting a 2D python list into a frame (#5555)
- FIX-#5537: disable empty-groupby handling logic in experimental mode (#5538)
- FIX-#5539: Allow partitioning to adapt to the shape changes caused by '.merge' (#5556)
- FIX-#5545: Aligned with pandas default 'groupby.skew' results for invalid data (#5558)
- FIX-#5552: Fix sort_values when data is over-partitioned. (#5553)
- FIX-#5561: CalciteSerializer does not support unsigned integers (#5563)
- FIX-#5568: Pin 'fastparquet<2023.1.0' (#5569)
- FIX-#5581: Don't use deprecated inplace parameter for set_axis function (#5579)
- FIX-#5589: Do not trigger metadata materialization on 'filter' (#5588)
- FIX-#5597: pin sqlalchemy<1.4.46 as pandas does to fix CI (#5593)
- FIX-#5598: make PyArrowDataset.files work for 3.0.0 <= pyarrow < 8.0.0 (#5592)
- FIX-#5600: Copy '.dtypes' on 'df.copy()' (#5601)
- FIX-#5604: Fix dictionary groupby aggregation for a single col partition case (#5605)
- FIX-#5608: Pin openpyxl<3.1.0 (#5603)
- FIX-#5610: Add default to pandas implementation for qcut (#5611)
- FIX-#5621: Do not preserve suboptimal partitioning on keep_partitioning=False (#5622)
- FIX-#5625: Fix set_index with modin series. (#5630)
- FIX-#5628: BUG: HDK: Unable to concatenate tables with different number of non-numeric columns (#5673)
- FIX-#5629: Make read_sql alias compatible with snowflake. (#5631)
- FIX-#5650: Restore the right dtype for applying Series.cat (#5651)
- FIX-#5665: Fix operations that flatten an array, as well as handling of where argument in such operations (#5668)
- FIX-#5698: Read list of parquet files (#5725)
- FIX-#5702: Fix passing RangeIndex to loc. (#5719)
- FIX-#5714: BUG: Empty frames concatenation with inner join is not valid (#5715)
- FIX-#5720: Ensure that modin.numpy.array's propagate NaN values when computing mean (#5735)
- FIX-#5721: Fix loc[tuple] on multiindex. (#5726)
- FIX-#5730: Add repr, len, size, and make dtype changing lazy. (#5731)
- FIX-#5733: Allow all Modin objects in all Modin object constructors, and make sure copy=False works (#5736)
- FIX-#5742: BUG: HDK: Binary operations on strings are not supported (#5743)
- FIX-#5761: Add _exp, _sqrt to query compiler (#5762)
Performance enhancements
- PERF-#5182: Precompute dtypes when performing binary operations in certain cases (#5494)
- PERF-#5183: Compute dtypes when performing from_labels operation (#5478)
- PERF-#5247: Make MultiIndex use memory more efficiently (#5632)
- PERF-#5369: GroupBy.skew implementation via MapReduce pattern (#5318)
- PERF-#5484: speed up read_csv; compute metadata after skipping rows (#5482)
- PERF-#5549: copy dtypes for invert op (#5541)
- PERF-#5550: Don't trigger axes computation in to_pandas function (#5544)
- PERF-#5551: Preserve index and columns on _repartition (#5543)
- PERF-#5554: Implement drop_duplicates via new duplicated (#5587)
- PERF-#5557: Don't trigger axes computation in pivot_table (#5546)
- PERF-#5573: Don't trigger axes computation in columnarize function (#5548)
- PERF-#5575: Don't trigger axes computation in reset_index function (#5547)
- PERF-#5586: Precompute resulting '.merge' partitioning based on the arguments (#5585)
- PERF-#5589: Do no trigger 'dtypes' materialization for '.filter()' (#5595)
- PERF-#5596: Do not trigger index materialization for '.merge' result (#5619)
- PERF-#5613: Optimize duplicated in case there is only one column partition (#5640)
- FIX-#5641: Add fastpath for numpy arrays to dataframe constructor (#5655)
- PERF-#5657: Don't trigger axes computation when accessing .str.* methods (#5658)
- PERF-#5660: Don't trigger axes computation when accessing cat.codes (#5661)
- PERF-#5680: Don't trigger axes computation when doing binary operations (#5681)
- PERF-#5682: Don't trigger axes computation when calling isin (#5683)
- PERF-#5690: move read_callback from dispatchers into parsers (#5689)
- PERF-#5691: Set item via .loc without converting a Series to np.array (#5693)
- PERF-#5700: Treat numpy arrays more efficiently at df.__setitem__ (#5708)
- PERF-#5705: Preserve metadata when applying Series.cat.codes (#5706)
- PERF-#5709: Avoid re-putting a distributed Series to the engine's object store at .map() (#5704)
- PERF-#5710: Avoid re-putting a distributed Series to the engine's object store at .isin() (#5707)
Refactor Codebase
- REFACTOR-#0000: make deploy functions in virtual_partition.py files private (#5455)
- REFACTOR-#1531: move default_to_pandas into base query_compiler class (#5479)
- REFACTOR-#3883: Unify tests execution approach in the Github workflow files (#5520)
- REFACTOR-#3948: Use __constructor__ in DataFrame and Series classes (#5485)
- REFACTOR-#5275: Deduplicate code for Ray and Unidist engines (#5457)
- REFACTOR-#5370: Move merge_asof implementation to base query compiler. (#5371)
- REFACTOR-#5393: remove unused '_VIEW_IS_COPY_WARNING' global var (#5392)
- REFACTOR-#5416: fix FutureWarning: the mangle_dupe_cols keyword is deprecated for read_excel (#5415)
- REFACTOR-#5434: Define public interfaces in modin.core.execution.dask module (#5418)
- REFACTOR-#5459: Install code linters through conda and unpin flake8 (#5450)
- REFACTOR-#5462: Update execution.ray public api with virtual partitions (#5456)
- REFACTOR-#5467: remove FutureWarning for df.iloc[:, i] = newvals (#5468)
- REFACTOR-#5471: add FutureWarning for DataFrameGroupBy.backfill (#5472)
- REFACTOR-#5475: Update execution.unidist public api with virtual partitions (#5476)
- REFACTOR-#5535: remove duplication for 'columnarize' method (#5534)
- REFACTOR-#5607: Fix missing formatting with 'black' (#5606)
- REFACTOR-#5685: add RayWrapper.put implementation (#5686)
- REFACTOR-#5687: add UnidistWrapper.put implementation (#5688)
- REFACTOR-#5703: align 'DaskWrapper.deploy' behavior with others (#5701)
- REFACTOR-#5718: add columns parameter for get_dtypes function (#5717)
Update testing suite
- TEST-#0000: correct behavior of CI for push action (#5748)
- TEST-#5420: port asv benchmarks for Repr, MaskBool, isNull, dropNa and equals functions (#5421)
- TEST-#5444: reduce Series' shape for TimeReindex asv bench (#5443)
- TEST-#5448: reduce Dataframe' shape for 'time_merge_default' asv bench (#5446)
- TEST-#5451: reduce shapes for TimeLevelAlign, TimeStack and TimeUnstack ASV benchmarks (#5452)
- TEST-#5540: add module level setup function for ASV benchmarks (#5530)
- TEST-#5664: speedup Post Run conda-incubator/setup-miniconda@v2 step on Windows (#5662)
- TEST-#5747: Synchronize jobs between push.yml and ci.yml that are used to measure test coverage (#5745)
- TEST-#5764: run test-asv-benchmarks CI job only for PRs (#5765)
Documentation improvements
- DOCS-#3803: Update "building modin from source" docs (#5480)
- DOCS-#5157: Add a note regarding poor perf of the first op with Modin on Ray (#5491)
- DOCS-#5463: Add jupyter tutorials for Modin on Unidist (#5464)
- DOCS-#5498: mention 'DataFrame._repartition' API at docs (#5499)
New Features
- FEAT-#5147: implement xs (#5143)
- FEAT-#5423: Add a NumPy API to Modin (#5422)
- FEAT-#5481: Implement dictionary groupby aggregation via TreeReduce (#5503)
- FEAT-#5559: Upgrade pandas to 1.5.3 (#5560)
- FEAT-#5562: Upgrade pyhdk to 0.3.1 (#5564)
- FEAT-#5620: Synchronize parameters of apply_full_axis with broadcast_apply_full_axis (#5637)
- FEAT-#5666: Support logic operations on modin numpy arrays (#5667)
- FEAT-#5751: Bump pyhdk version to 0.4 (#5752)
- FEAT-#5753: Add math functions necessary for picoGPT (#5756)
- FEAT-#5754: Add np.linalg operations (#5755)

Contributors

@AndreyPavlenko
@Egor-Krivov
@RehanSD
@YarShev
@anmyachev
@arunjose696
@dchigarev
@devin-petersohn
@mvashishtha
@noloerino
@vnlitvinov
@Billy2551
@Retribution98
@shalearkane

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modin 0.19.0

Key Features and Updates Since 0.18.0

Contributors

Contributors