Releases: mjakubowski84/parquet4s
v1.9.0
Many changes in this version. It slowly builds foundation for version 2.x in which you may expect many changes internally and in API.
Parquet schema migrated to version 1.11+
- Parquet4S no longer uses deprecated schema types internally.
- Definition of custom types using old types is deprecated in favour for new types.
- Simplified API of defining custom types is introduced - check the README and examples.
- Deprecated API got slightly modified but is supposed to be backwards compatible. Some that use it very extensively may encounter small compilation issues after upgrade.
- Please migrate to new API as it is going to be removed in 2.x version.
Revamped Filtering
- Ability to filter by
UDP
- user defined predicate - write your own filter predicate! - Simplified rewritten internal API.
- Old API was mostly private so no backward incompatibility is expected for most of users.
- New Filter API is public so you can define filters for your own custom types now, too. Just implement
FilterCodec[T]
for your typeT
. - Allowed to filter by
Array[Byte]
In
predicate in version with multiarg got modified so it expect now at least single parameter - that enforces proper usage of it at compilation time.
Bufixes
- FS2's
viaParquet
was processingpartitionBy
parameters improperly. In effect, one could not partition by non-root field. It is fixed now.
v1.8.3
Multiple small improvements added to the code:
- Fixed various compilation warnings
- Fixed several Scaladoc links
- Reorganised build.sbt
- CI/CD improvements
- Created basic benchmarks for testing future code changes
- Using Blocker when creating writer in FS2's single-file write.
v1.8.2
A critical bug was introduced in version 1.8.0 that made core module usable with nothing but local file system. All should be fixed with this bugfix release.
It was a simple code mistake caused by an improper usage of API of hadoop-client. In effect file://
schema was enforced when listing files on path when preparing Stats component. Moreover, in order to delay this premature file listing Stats component initialisation is now made lazy.
v1.8.1
This release contain a fatal bug in core module. Please use version 1.8.2 or higher.
@malonsocasas reported and fixed an old issue with reading one of many legacy formats of lists. This should be fixed with this release.
v1.8.0
Release 1.8.0 introduces new functionalities and improvements in core library. Besides that each module undergoes multiple upgrades of Scala, Parquet and other dependencies.
This release contain a fatal bug in core module. Please use version 1.8.2 or higher.
New features
- From now on when calling
size
onParquetIterable
Parquet4S does not iterate over all records but tries to leverage file's metadata and statistics. It is especially fast in case of unfiltered files. But it is also quite fast when reading with filter as Parquet4S tries to omit row groups which, thanks to statistics, it already knows that don't contain better values. ParquetIterable
also receivesmin
andmax
functions that provide smallest and greatest value of the chosen column. Similarly to the new implementation ofsize
Parquet4S leverages file metadata and works for both filtered and unfiltered files.- You can access aforementioned functions also by direct call to
Stats
. - Added custom errors for unresolved implicits for better feedback how to use Parquet4S with custom types.
Upgrades - Scala 2.12 is upgraded to 2.12.13 and Scala 2.13 to 2.13.5
- Parquet is upgraded to 1.12.0. Please note a change that is not breaking in case of interoperability with older versions of Parquet4S and Spark but might (however shouldn't) be breaking in case of other systems - from now on
Map
is saved internally usingkey_value
field instead ofmap
. - FS2 upgraded to 2.5.4
- Shapeless upgraded to 2.3.4
v1.7.0
This a next maintenance release that improves stability and functionality of integration with Akka Streams and FS2.
Akka Streams:
- Thanks to @dkwi
viaParquet
receives a new functionality:withPostWriteHandler
allows to monitor, flush files are take any action based on the current state of the ParquetWriter. - Further fixes of resource cleanup in
viaParquet
. From now on writers are properly closed also on internal and downstream errors.
FS2:
viaParquet
receives similarPostWriteHandler
as in Akka Streams.- Redundant synchronisation in
viaParquet
is removed for better performance.
v1.6.0
Release 1.6.0 brings an important feature of Parquet that was missing so far - an ability to read a subset of columns from Parquet files. This is called schema projection. And now it is available in every module of Parquet4S. Check updated Readme for more.
A new feature implied the need of redesign of API. Core library just got a new function pointing to new reader but in Akka and FS2 module a new reader builders are introduced and those deprecate the old readers.
Moreover, recently introduced FS2 module received several API fixes that may be breaking for some. Unfortunately, those were required for consistency.
Full list of changes:
- core:
ParquetReader.withProjection[YourSchema]
that points to the reader that has schema projection applied
- akka:
ParquetStreams.fromParquet[YourSchema](path, options, filter)
is deprecated in favour of builder with the same name:ParquetStreams.fromParquet[YourSchema]
- fs2:
- function
parquet.read
is deprecated in favour of builderparquet.fromParquet
- trait of
Builder
used in API ofparquet.viaParquet
is moved torotatingWriter
package withPreWriteTransformation
inparquet.viaParquet
is replaced bypreWriteTransformation
for consistency- redundant dependency of
parquet.viaParquet
to implicitSync[F]
is removed as it has already dependency toConcurrent[F]
parquet.writeSingleFile
returns nowStream[F, fs2.INothing] instead of
Stream[F, Unit] in order to emphasise that it doesn't emit anything
- function
v1.5.1
This release contains a bug fix for Akka module. toParquetSingleFile
and viaParquet
now close underlying file writers in case of stream failure. Thanks to that all writes executed so far are flushed and the probability of data loss is minimised.
v1.5.0
Release 1.5.0 introduces an integration of Parquet4S with FS2. It provides similar functionality as integration with Akka Streams:
- read a file, directory or partitioned directory, with optional filter
- write a single file
- write an indefinite stream, optionally partitioned
Additionally:
- Scala 2.12 is upgraded to 2.12.12
- Scala-collection-compat is upgraded to 2.2.0
v1.4.0
Two main feature comes with this release:
- @mac01021 made that schema names are now by default determined from canonical class name. In case of generic records schema name come from provided original schema. Thanks to that schemas are more descriptive and files created with Parquet4S are more compliant with Avro reader. Signature of Parquet4S is now written into file metadata (instead to schema name as before)
viaParquet
Akka flow now gets the ability to write genericRowParquetRecord
as other writers.