Releases: mjakubowski84/parquet4s
v2.7.1
The release contains many minor dependency updates not included in the v2.7.0.
Additionally, the FS2 module, thanks to an update of Cats Effect and multiple refactorings, significantly improved its performance. The biggest improvement can be observed in reading.
v2.7.0
The release introduces multiple changes to the core library and its downstream. The changes are concentrated around support for INT64 timestamps:
- Out-of-the-box support for reading all versions of INT64 timestamps if the projection and filtering are not in use.
- Support for writing and reading with projection and filtering:
- INT64 micros by importing
import com.github.mjakubowski84.parquet4s.TimestampFormat.Implicits.Micros._
- INT64 millis by importing
import com.github.mjakubowski84.parquet4s.TimestampFormat.Implicits.Millis._
- INT64 nanos by importing
import com.github.mjakubowski84.parquet4s.TimestampFormat.Implicits.Nanos._
All the aforementioned imports include value codecs, filter codecs and schemas.
- INT64 micros by importing
- Fixes in internal transformations of
java.sql.Timestamp
so that they do not rely on the system timezone anymore. Parquet4s uses solely the timezone provided in the options.
v2.6.0
Numerous dependency updates. Most notable:
- Scala 2.12.x updated to 2.12.16
- Scala 3.1.x updated to 3.1.2
- Parquet updated to 1.12.3
- FS2 updated to 3.2.8
- Cats Effect updated to 3.3.12
Bugfix:
- Ignoring NPE thrown by parquet-hadoop when closing writer under some conditions. Library users will not see misleading exceptions in their logs
API changes:
- Due to the update of FS2
Pipe
in writer returnNothing
in place of deprecatedfs2.INothing
. That is a small breaking API change but asfs2.INothing
is bounded byNothing
then the change should not be a problem for library users.
v2.5.1
Release 2.5.1 addresses a reported issue of failing postStop
action on viaParquet
in Akka module. By design Akka is supposed to not allow concurrent calls to flow's logic, however, the reported error could only be caused by that. In order to mitigate the problem the state of the flow is now held in a concurrent map.
v2.5.0
Release 2.5.0 continues evolution of ParquetIterable
. By taking advantage of previously introduced compound iterables a support for reading partitioned data is now introduced to core module. Unlike in Akka and FS2 module, reading partitions must be enabled explicitly. Such an approach is chosen because looking for partitions adds an I/O overhead which is unwelcome in low level libraries. You can enable a new feature by just calling partitioned
switch in the builder:
ParquetReader.as[YourSchema].partitioned.read(yourPath)
Moreover, number experimental ETL features is growing. A convenient way of writing datasets is added. You can now call writeAndClose
on ParquetIterable
directly to write the dataset and release all open resources. This makes the ETL DSL clean and more readable.
v2.4.1
Release 2.4.1 introduces several minor improvements to reading partitioned data.
- More allowed special characters in partition names. Following characters are now accepted:
?
,+
,,
,&
,$
,:
,;
,/
and - Do not attempt to read empty directories in Akka module.
- Minor dependency updates.
v2.4.0
This is a major release that brings important improvements and fixes.
TypedSchemaDef fix
Long time ago it seemed to be a great idea to make TypedSchemaDef
a tagged type alias for SchemaDef
. And as it was used only by ParquetSchemaResolver
it was defined in its scope. All implicit implementations were defined inside companion object of ParquetSchemaResolver
. This design didn't change much for a long time but it had a big flaw - it was not a proper type class. As implicit implementations were not defined in a companion object of TypedSchemaDef
users could encounter problems with ambiguous implicits when defining own schema definitions.
In this release TypedSchemaDef
is turned into a proper type class with its own trait and companion object. ParquetSchemaResolver.TypedSchemaDef
is left as an alias to a new trait but is marked as deprecated. All implementations are moved to companion object of TypedSchemaDef
- so if you are referencing provided schema definitions explicitly then it is a breaking change for you. However, this change is necessary to bring the improvement.
Chunks in FS2
FS2 suggests processing stream elements in chunks for the best performance. Parquet4S was not following this advice - so far. It had an advantage of keeping code simple and allowed library users to take an action per each processed element. However, this approach had a big influence on performance which was visible especially when reading or writing local files.
Now you are able to choose if you want to process your data in chunks or not. By default fromParquet
or viaParquet
are using chunks of size equal to 16
. To keep previous behaviour change it to 1
using chunkSize
property. Or increase it if you are not afraid if higher memory consumption.
Simple write
uses chunking provided by the upstream and remains unchanged.
Local benchmarks prove that with chunks of 16 elements reading speed increased even 10x while viaParquet
is ~2x faster than in previous releases of Parquet4s.
Upgraded dependencies
- Shapeless upgraded to 2.3.9
- Akka upgraded to 2.6.19
- FS2 upgraded to 3.2.7
- Cats effect upgraded to 3.3.9
v2.3.0
- Upgrades Scala 3 version to 3.1.1
- Adds Scala 3 version of Akka integration module
- Upgrades FS2 to version 3.2.4
- Upgrades Cats Effect to 3.3.5
- Several minor performance improvements in FS2 module
v2.2.0
This is a maintenance release.
A feature of custom types (including custom codecs and schemas) was revised and improved. It turned out that due to insufficient tests the functionality deteriorated with each library release and each new Scala version. Severity was dependant on Scala version. The problem was mostly with using case class as a definition of custom type and that Parquet4s was unable to resolve custom implicit codecs and schemas. Now the problem is solved in each supported Scala version.
Moreover, in some JDK 8 distributions internal error is thrown when getSimpleName
or getCanonicalName
is called on deeply nested Scala class. Those functions are used internally by Parquet4s. In order to provide better explanation to the user about the nature of the problem internal error is now caught, warning with a feedback is logged and failover action is taken.
v2.1.0
Version 2.1.0 brings new features and further improvements to Parquet4s.
Column projection
Available in all modules of Parquet4s: core, akka and fs2. Makes generic reads much more powerful. Allows to use SQL-like DSL to define projection over specific columns. Allows to extract a value from a nested fields and to use aliases. Please refer to the documentation and examples to learn more.
(Experimental) joins and concats
Implementation of ParquetIterable
undergoes evolution that eventually will simplify ETL operations on your datasets. It will allow to read partitioned datasets in core module in the future. But now it introduces join and concat operations on datasets. Please refer to the documentation and examples for more information.