Skip to content

Releases: apache/iceberg-python

PyIceberg 0.9.0

07 Mar 04:58
8bfb16c
Compare
Choose a tag to compare

Full Changelog: pyiceberg-0.8.0...pyiceberg-0.9.0

There have been 243 new commits since the last minor release, 0.8.0, including 148 commits from various contributors and 95 from Dependabot. This release features contributions from 63 unique contributors, including 33 first-time contributors.

What's Changed

New Features

  • Introduced the capability to perform UPSERT operations on their table directly within PyIceberg.
  • Added support for dynamic overwrites as an optimization when an entire partition is replaced.
  • Implemented namespace_exists functionality for the REST catalog.
  • Extended the table updates to include new remove-snapshot-ref and remove-snapshot action
  • Added view_exists method to the REST catalog as a part of the effort to add view support to the REST Catalog.
  • Implemented support for Alibaba OSS protocol in PyArrowFileIO
  • Introduced read support for the Iceberg V3 spec.
  • Added support for Location Providers for tables which includes the ObjectStoreLocationProvider and also enables for custom write paths for both data and metadata.
  • Extended S3FileIO operations to allow for cross region read support.
  • Introduced support to convert Iceberg table scan to polars DataFrame and LazyFrame.
  • Added support for the all_manifests metadata table.
  • Implemented support for writes to bucket partitioned tables.
  • Added automatic metadata cleanup to iceberg tables via write.metadata.delete-after-commit.enabled.
  • Introduced syntactic sugar for and and or operations in filters.
  • Implemented configurable S3 request timeout settings for better performance tuning.
  • Add support to use apache/iceberg-rest-fixture image for integration tests
  • Introduced support to update table statistics
  • Add support for Bucket and Truncate transforms utilizing pyiceberg_core (iceberg-rust)
  • Add support for column projections from partition metadata
  • Add support for ResidualEvaluator

Deprecations

Catalog & Table Identifiers

  • Parsing catalog-level identifiers in Catalog references is deprecated
    • Please refer to tables using only their namespace and table name
  • Table.identifier property is deprecated
    • Use Table.name() instead

Expression Parsing

  • Parsing expressions with table names is deprecated
    • Only provide field names in row_filter

Configuration Properties

  • rest.authorization-url property is deprecated
    • Use oauth2-server-uri instead
  • gcs.endpoint property is deprecated
    • Use gcs.service.host instead
  • Properties starting with adlfs. are deprecated
    • Use properties that start with adls.

Table API Changes

  • project_table is deprecated
    • Use ArrowScan.to_table() instead
    • Use ArrowScan.to_record_batches() instead

Name Mapping

  • NameMapping.find is deprecated
    • Use apply_name_mapping instead

Table Update Field Removal

  • The initial_change field has been removed from table updates, affecting:
    • AddSchemaUpdate
    • AddPartitionSpecUpdate
    • AddSortOrderUpdate

Table Class Refactoring
Several table classes have been moved to private classes:

  • pyiceberg.table.Move β†’ pyiceberg.table.update.schema._Move
  • pyiceberg.table.MoveOperation β†’ pyiceberg.table.update.schema._MoveOperation
  • pyiceberg.table.DeleteFiles β†’ pyiceberg.table.update.snapshot._DeleteFiles
  • pyiceberg.table.FastAppendFiles β†’ pyiceberg.table.update.snapshot._FastAppendFiles
  • pyiceberg.table.MergeAppendFiles β†’ pyiceberg.table.update.snapshot._MergeAppendFiles
  • pyiceberg.table.OverwriteFiles β†’ pyiceberg.table.update.snapshot._OverwriteFiles

Table Properties Refactoring
Several constants have been moved to TableProperties:

  • DEFAULT_MAX_SNAPSHOT_AGE_MS β†’ TableProperties.MAX_SNAPSHOT_AGE_MS_DEFAULT
  • DEFAULT_MIN_SNAPSHOTS_TO_KEEP β†’ TableProperties.MIN_SNAPSHOTS_TO_KEEP_DEFAULT

Documentation Updates

  • Added documentation for the new UPSERT operation support.
  • Added documentation of the new LocationProvider feature.
  • Improve the "How to Release" documentation.
  • Add documentation linking to community contributing guidelines
  • Add documentation on nightly build

Bug Fixes

  • Fixed KeyError in add_files for Parquet files missing column stats.
  • Fixed Table.scan case sensitivity handling.
  • Resolved TypeError in create_match_filter for composite keys.
  • Allowed leading underscore in column name used in row filter.
  • Ensured correct statistics updates by removing redundant snapshot_id in SetStatisticsUpdate.
  • Fixed namespace existence check for multi-level namespaces in SqlCatalog.
  • Improved handling of S3 request timeouts.
  • Fixed TypeError in composite key joins.

Dependencies

  • Remove python 3.13 upper bound restriction
  • Remove fsspec upper bound restriction
  • Bump PyArrow to 19.0.0

Infra

  • Improve and automate release process using github workflow
  • Add support for testpypi nightly build
  • Add codespell to pre-commit
  • Replace pycln with ruff

Commits

Features

Documentations

Read more

pyiceberg-0.8.1

06 Dec 19:43
Compare
Choose a tag to compare

Full Changelog: pyiceberg-0.8.0...pyiceberg-0.8.1

Patch Release PR: #1384

What's Changed

The behavior of Table.name is changed to return the table name without the catalog name. This is a broader effort to remove references to the catalog name in pyiceberg.

  • Replace usage of Table.identifier with Table.name which returns the table name without the catalog name
  • Replace the use of a deprecated function (identifier_to_tuple_without_catalog) in pyiceberg; remove unnecessary warnings

Documentation updates are included to reflect the updated process in https://py.iceberg.apache.org/

  • Update β€œhow to release” documentation
  • 0.8.0 post-release steps

Bug fixes

  • Fix add_files for parquet files without column stats
  • Allow leading underscore in column name used in row filter
  • Ignore tables without table_type property from Glue and Hive
  • Write null in manifest list metadata when there is no parent-snapshot-id

Remove upper bound restrictions for dependency libraries; allow early testing of new versions

  • Remove Python library version upper bound restriction; allow Python 3.13
  • Remove fsspec library version upper bound restriction

Commits

36 new commits since the 0.8.0 release.

12 new commits will be included in 0.8.1

  • 11 commits cherry-picked as bug fixes (listed below)
  • 1 commit to bump version to 0.8.1

11 bug fixes (cherry-picked)

acbd071 Write null when there is no parent-snapshot-id (#1383)
bb078cf Add instruction for patch release (#1373)
ab43c6c fix KeyError raised by add_files when parquet file doe not have column stats (#1354)
cc1ab2c Improve documentation for "how to release" (#1359)
64dc6fe Remove Python 3.13 upper bound restriction (#1355)
d86ab6e Allow leading underscore in column name used in row filter (#1358)
7a4734e Replace reference of Table.identifier with Table.name (#1346)
a66ddc0 Ignore tables without table_type from Glue and Hive (#1332)
2cbc77d Drop upper bounds for fsspec and it's implementations (#1341)
7660a5b 0.8.0 post release steps (#1334)
b2f0a9e use the non-deprecated func (#1326)

New Contributors

pyiceberg-0.8.0

18 Nov 19:35
3ccdc44
Compare
Choose a tag to compare

What's Changed

PR

  • Update PyIceberg Verify Release doc by @chinmay-bhat in #976
  • DOCS: Add Github Actions Screenshots to Release Notes by @sungwy in #975
  • Bump up version in dev Dockerfile and Issue Template by @ndrluis in #981
  • Fix pydantic warning in the commit process by @ndrluis in #972
  • Bump up Iceberg version to 1.6.0 by @ndrluis in #982
  • Bug Fix: use appropriate partition spec for delete by @sungwy in #984
  • [Bug Fix]Use self.table_metadata when in transaction by @HonahX in #985
  • DOCS: Add more post release notes by @sungwy in #983
  • Treat warning as error in CI/Dev by @ndrluis in #973
  • Use 'strtobool' instead of comparing with a string. by @ndrluis in #988
  • Fix: accept empty arrays in struct field lookup by @grobgl in #997
  • Add ndrluis as collaborator by @sungwy in #1009
  • Fix list namespace response in rest catalog by @ndrluis in #995
  • Pyarrow IO property for configuring large v small types on read by @sungwy in #986
  • Update metadata-log for non-rest catalogs by @soumya-ghosh in #977
  • Exclude Python 3.9.7 due to import error in catalog module by @ndrluis in #526
  • Deprecate rest.authorization-url in favor of oauth2-server-uri by @ndrluis in #962
  • Allow setting write.parquet.row-group-limit by @Fokko in #1016
  • Deprecate Redundant Identifier Support in TableIdentifier, and row_filter by @sungwy in #994
  • Fix: Handle Empty RecordBatch within _task_to_record_batches, fix correctness issue with positional deletes by @sungwy in #1026
  • Fix overwrite when filtering all the data by @ndrluis in #1023
  • Allow setting write.parquet.page-row-limit by @Fokko in #1017
  • DOCS: Remove older row for write.parquet.row-group-limit by @sungwy in #1030
  • Improve test_version_format() error message for version mismatches by @laksh-krishna-sharma in #1015
  • Bump version to 0.7.1 by @sungwy in #1034
  • Support s3.signer.endpoint for nessie by @guitcastro in #1029
  • [bug] fix reading with to_arrow_batch_reader and limit by @kevinjqliu in #1042
  • Use VisitorWithPartner for name-mapping by @Fokko in #1014
  • Fix tracing existing entries when there are deletes by @Fokko in #1046
  • Coverage Run unit tests first before docker containers are set up by @Minfante377 in #1055
  • Update "verify release" instruction by @kevinjqliu in #1064
  • Fix Install Issues with docutils = 0.21.post1 and exclude 3.12 from supported python dependencies by @sungwy in #1067
  • Post Release 0.7.1 version updates by @sungwy in #1073
  • Update create table doc to clarify ID re-assignment by @paulcichonski in #1072
  • Refactor PyArrow DataFiles Projection functions by @sungwy in #1043
  • DOCS: Exclude signature files from twine upload by @sungwy in #1071
  • Increase the minimal required pyarrow version to 14.0.0 by @ndrluis in #1090
  • Fix table_exists behavior in REST catalog by @ndrluis in #1096
  • fix: improve makefile by @TiansuYu in #1091
  • fix (issue-1079): allow update_column to set doc as '' by @TiansuYu in #1083
  • prevent adding duplicate files by @amitgilad3 in #1036
  • Add list_views to rest catalog by @ndrluis in #817
  • Emit warnings instead of failing when seeing unsupported configuration by @Fokko in #1111
  • Use markdownlint instead of mdformat by @kevinjqliu in #1118
  • Add drop_view to the rest catalog by @ndrluis in #820
  • Support python 3.12 by @kevinjqliu in #1068
  • Make commit_table public by @Fokko in #1112
  • Refactoring: Break down very large table/__init__.py module by @sungwy in #1144
  • fix: Invert case_sensitive logic in StructType by @AnthonyLam in #1147
  • Bump duckdb to version 1.1.0 by @kevinjqliu in #1149
  • Deprecate ADLFS prefix in favor of ADLS by @ndrluis in #961
  • Cache Manifest files by @chinmay-bhat in #787
  • Use the correct spec when rewiting existing manifests by @Fokko in #1157
  • Bug Fix: Use historical partition field name by @sungwy in #1161
  • fix: remove old, incorrect docstring by @dataders in #1166
  • Preserve Backward compatibility in 0.8.0 for #1144 by @sungwy in #1151
  • follow up for more cleanup by @dataders in #1168
  • [bug] [REST] Dont remove identifier root by @kevinjqliu in #1172
  • fix: support MonthTransform for partitioning by @felixscherz in #1176
  • Add metadata tables for data_files and delete_files by @soumya-ghosh in #1066
  • Use ArrowScan.to_table to replace project_table by @JE-Chen in #1180
  • Add Docstrings to pyiceberg/table/__init__.py by @sungwy in #1189
  • Support python 3.12 in poetry by @kevinjqliu in #1192
  • Use cachetools's LRUCache to cache manifest list by @kevinjqliu in #1187
  • HA HMS support by @awdavidson in #752
  • Bug Fix: Position Deletes + row_filter yields less data when the DataFile is large by @sungwy in #1141
  • Remove dead loom link by @kevinjqliu in #1213
  • Drop support for Python 3.8 by @raulcd in #1221
  • Add clarifying docs to transform result types by @kevinzwang in #1211
  • Add flag to allow disabling creation of catalog tables by @isc-patrick in #1155
  • Bug Fix: Glue and Hive catalog return only Iceberg tables by @mark-major in #1145
  • Move snapshot history expire table properties to constants by @ndrluis in #1217
  • abort the whole table transaction if any updates in the transaction has failed by @stevie9868 in #1246
  • PyArrow: Pass in null-mask by @Fokko in #1264
  • Bump PyArrow to 18.0.0 by @Fokko in #1256
  • Remove numpy as a hard dependency by @Fokko in #1270
  • Allow for missing operation by @Fokko in #1263
  • fix: list_tables method in glue catalog now only return tables. by @omkenge in #1258
  • Replace numpy usage and remove from pyproject.toml by @kevinjqliu in #1272
  • Bump version to 0.8.0 by @Fokko in #1276
  • Remove initial_change when CreateTableTransaction apply table updates on an empty metadata by @HonahX in #1219
  • Deprecate for 0.8.0 release by @kevinjqliu in #1269
  • Pass table-token to commit endpoint by @Fokko in #1278
  • Updating configuration docs by @Samreay in #1292
  • Allow union of {int,long}, {float,double}, etc by @Fokko in #1283
  • Allow passing in ARN Role and Session name to the PyArrowFileIO by @Fokko in #1...
Read more

pyiceberg-0.8.0rc2

14 Nov 20:47
3ccdc44
Compare
Choose a tag to compare
pyiceberg-0.8.0rc2 Pre-release
Pre-release

What's Changed

PR

  • Update PyIceberg Verify Release doc by @chinmay-bhat in #976
  • DOCS: Add Github Actions Screenshots to Release Notes by @sungwy in #975
  • Bump up version in dev Dockerfile and Issue Template by @ndrluis in #981
  • Fix pydantic warning in the commit process by @ndrluis in #972
  • Bump up Iceberg version to 1.6.0 by @ndrluis in #982
  • Bug Fix: use appropriate partition spec for delete by @sungwy in #984
  • [Bug Fix]Use self.table_metadata when in transaction by @HonahX in #985
  • DOCS: Add more post release notes by @sungwy in #983
  • Treat warning as error in CI/Dev by @ndrluis in #973
  • Use 'strtobool' instead of comparing with a string. by @ndrluis in #988
  • Fix: accept empty arrays in struct field lookup by @grobgl in #997
  • Add ndrluis as collaborator by @sungwy in #1009
  • Fix list namespace response in rest catalog by @ndrluis in #995
  • Pyarrow IO property for configuring large v small types on read by @sungwy in #986
  • Update metadata-log for non-rest catalogs by @soumya-ghosh in #977
  • Exclude Python 3.9.7 due to import error in catalog module by @ndrluis in #526
  • Deprecate rest.authorization-url in favor of oauth2-server-uri by @ndrluis in #962
  • Allow setting write.parquet.row-group-limit by @Fokko in #1016
  • Deprecate Redundant Identifier Support in TableIdentifier, and row_filter by @sungwy in #994
  • Fix: Handle Empty RecordBatch within _task_to_record_batches, fix correctness issue with positional deletes by @sungwy in #1026
  • Fix overwrite when filtering all the data by @ndrluis in #1023
  • Allow setting write.parquet.page-row-limit by @Fokko in #1017
  • DOCS: Remove older row for write.parquet.row-group-limit by @sungwy in #1030
  • Improve test_version_format() error message for version mismatches by @laksh-krishna-sharma in #1015
  • Bump version to 0.7.1 by @sungwy in #1034
  • Support s3.signer.endpoint for nessie by @guitcastro in #1029
  • [bug] fix reading with to_arrow_batch_reader and limit by @kevinjqliu in #1042
  • Use VisitorWithPartner for name-mapping by @Fokko in #1014
  • Fix tracing existing entries when there are deletes by @Fokko in #1046
  • Coverage Run unit tests first before docker containers are set up by @Minfante377 in #1055
  • Update "verify release" instruction by @kevinjqliu in #1064
  • Fix Install Issues with docutils = 0.21.post1 and exclude 3.12 from supported python dependencies by @sungwy in #1067
  • Post Release 0.7.1 version updates by @sungwy in #1073
  • Update create table doc to clarify ID re-assignment by @paulcichonski in #1072
  • Refactor PyArrow DataFiles Projection functions by @sungwy in #1043
  • DOCS: Exclude signature files from twine upload by @sungwy in #1071
  • Increase the minimal required pyarrow version to 14.0.0 by @ndrluis in #1090
  • Fix table_exists behavior in REST catalog by @ndrluis in #1096
  • fix: improve makefile by @TiansuYu in #1091
  • fix (issue-1079): allow update_column to set doc as '' by @TiansuYu in #1083
  • prevent adding duplicate files by @amitgilad3 in #1036
  • Add list_views to rest catalog by @ndrluis in #817
  • Emit warnings instead of failing when seeing unsupported configuration by @Fokko in #1111
  • Use markdownlint instead of mdformat by @kevinjqliu in #1118
  • Add drop_view to the rest catalog by @ndrluis in #820
  • Support python 3.12 by @kevinjqliu in #1068
  • Make commit_table public by @Fokko in #1112
  • Refactoring: Break down very large table/__init__.py module by @sungwy in #1144
  • fix: Invert case_sensitive logic in StructType by @AnthonyLam in #1147
  • Bump duckdb to version 1.1.0 by @kevinjqliu in #1149
  • Deprecate ADLFS prefix in favor of ADLS by @ndrluis in #961
  • Cache Manifest files by @chinmay-bhat in #787
  • Use the correct spec when rewiting existing manifests by @Fokko in #1157
  • Bug Fix: Use historical partition field name by @sungwy in #1161
  • fix: remove old, incorrect docstring by @dataders in #1166
  • Preserve Backward compatibility in 0.8.0 for #1144 by @sungwy in #1151
  • follow up for more cleanup by @dataders in #1168
  • [bug] [REST] Dont remove identifier root by @kevinjqliu in #1172
  • fix: support MonthTransform for partitioning by @felixscherz in #1176
  • Add metadata tables for data_files and delete_files by @soumya-ghosh in #1066
  • Use ArrowScan.to_table to replace project_table by @JE-Chen in #1180
  • Add Docstrings to pyiceberg/table/__init__.py by @sungwy in #1189
  • Support python 3.12 in poetry by @kevinjqliu in #1192
  • Use cachetools's LRUCache to cache manifest list by @kevinjqliu in #1187
  • HA HMS support by @awdavidson in #752
  • Bug Fix: Position Deletes + row_filter yields less data when the DataFile is large by @sungwy in #1141
  • Remove dead loom link by @kevinjqliu in #1213
  • Drop support for Python 3.8 by @raulcd in #1221
  • Add clarifying docs to transform result types by @kevinzwang in #1211
  • Add flag to allow disabling creation of catalog tables by @isc-patrick in #1155
  • Bug Fix: Glue and Hive catalog return only Iceberg tables by @mark-major in #1145
  • Move snapshot history expire table properties to constants by @ndrluis in #1217
  • abort the whole table transaction if any updates in the transaction has failed by @stevie9868 in #1246
  • PyArrow: Pass in null-mask by @Fokko in #1264
  • Bump PyArrow to 18.0.0 by @Fokko in #1256
  • Remove numpy as a hard dependency by @Fokko in #1270
  • Allow for missing operation by @Fokko in #1263
  • fix: list_tables method in glue catalog now only return tables. by @omkenge in #1258
  • Replace numpy usage and remove from pyproject.toml by @kevinjqliu in #1272
  • Bump version to 0.8.0 by @Fokko in #1276
  • Remove initial_change when CreateTableTransaction apply table updates on an empty metadata by @HonahX in #1219
  • Deprecate for 0.8.0 release by @kevinjqliu in #1269
  • Pass table-token to commit endpoint by @Fokko in #1278
  • Updating configuration docs by @Samreay in #1292
  • Allow union of {int,long}, {float,double}, etc by @Fokko in #1283
  • Allow passing in ARN Role and Session name to the PyArrowFileIO by @Fokko in #1...
Read more

pyiceberg-0.8.0-rc1

07 Nov 20:49
0eaadb9
Compare
Choose a tag to compare
pyiceberg-0.8.0-rc1 Pre-release
Pre-release

What's Changed

PRs

Read more

pyiceberg-0.7.1

19 Aug 18:34
Compare
Choose a tag to compare

What's Changed

  • Fix delete to trace existing manifests when a data file is partially rewritten by @Fokko in #1046
  • Fix 'to_arrow_batch_reader' to respect the limit input arg by @kevinjqliu in #1042
  • Fix correctness of applying positional deletes on Merge-On-Read tables by @sungwy in #1026
  • Fix overwrite when filtering data by @ndrluis in #1023
  • Bug fix for deletes across multiple partition specs on partition evolution by @sungwy in #984
  • Fix evolving the table and writing in the same transaction by @HonahX in #985
  • Fix scans when result is empty by @grobgl in #997
  • Fix ListNamespace response in REST Catalog by @ndrluis in #995
  • Exclude Python 3.9.7 from list of supported versions by @ndrluis in #526
  • Allow setting write.parquet.row-group-limit by @Fokko in #1016
  • Allow setting write.parquet.page-row-limit by @Fokko in #1017
  • Fix pydantic warning during commit by @ndrluis in #972

Full Changelog: pyiceberg-0.7.0...pyiceberg-0.7.1

pyiceberg-0.7.0

30 Jul 23:44
be5c426
Compare
Choose a tag to compare

What's Changed

Read more

PyIceberg 0.6.1

PyIceberg 0.6.0

20 Feb 10:34
cc44926
Compare
Choose a tag to compare

What's Changed

Read more

PyIceberg 0.5.1

30 Oct 13:49
Compare
Choose a tag to compare