Skip to content

Releases: mjakubowski84/parquet4s

v2.7.1

15 Dec 11:05
Compare
Choose a tag to compare

The release contains many minor dependency updates not included in the v2.7.0.

Additionally, the FS2 module, thanks to an update of Cats Effect and multiple refactorings, significantly improved its performance. The biggest improvement can be observed in reading.

v2.7.0

12 Dec 18:24
Compare
Choose a tag to compare

The release introduces multiple changes to the core library and its downstream. The changes are concentrated around support for INT64 timestamps:

  • Out-of-the-box support for reading all versions of INT64 timestamps if the projection and filtering are not in use.
  • Support for writing and reading with projection and filtering:
    • INT64 micros by importing import com.github.mjakubowski84.parquet4s.TimestampFormat.Implicits.Micros._
    • INT64 millis by importing import com.github.mjakubowski84.parquet4s.TimestampFormat.Implicits.Millis._
    • INT64 nanos by importing import com.github.mjakubowski84.parquet4s.TimestampFormat.Implicits.Nanos._
      All the aforementioned imports include value codecs, filter codecs and schemas.
  • Fixes in internal transformations of java.sql.Timestamp so that they do not rely on the system timezone anymore. Parquet4s uses solely the timezone provided in the options.

v2.6.0

16 Jun 13:24
Compare
Choose a tag to compare

Numerous dependency updates. Most notable:

  • Scala 2.12.x updated to 2.12.16
  • Scala 3.1.x updated to 3.1.2
  • Parquet updated to 1.12.3
  • FS2 updated to 3.2.8
  • Cats Effect updated to 3.3.12

Bugfix:

  • Ignoring NPE thrown by parquet-hadoop when closing writer under some conditions. Library users will not see misleading exceptions in their logs

API changes:

  • Due to the update of FS2 Pipe in writer return Nothing in place of deprecated fs2.INothing. That is a small breaking API change but as fs2.INothing is bounded by Nothing then the change should not be a problem for library users.

v2.5.1

04 May 11:22
Compare
Choose a tag to compare

Release 2.5.1 addresses a reported issue of failing postStop action on viaParquet in Akka module. By design Akka is supposed to not allow concurrent calls to flow's logic, however, the reported error could only be caused by that. In order to mitigate the problem the state of the flow is now held in a concurrent map.

v2.5.0

24 Apr 17:17
Compare
Choose a tag to compare

Release 2.5.0 continues evolution of ParquetIterable. By taking advantage of previously introduced compound iterables a support for reading partitioned data is now introduced to core module. Unlike in Akka and FS2 module, reading partitions must be enabled explicitly. Such an approach is chosen because looking for partitions adds an I/O overhead which is unwelcome in low level libraries. You can enable a new feature by just calling partitioned switch in the builder:

ParquetReader.as[YourSchema].partitioned.read(yourPath)

Moreover, number experimental ETL features is growing. A convenient way of writing datasets is added. You can now call writeAndClose on ParquetIterable directly to write the dataset and release all open resources. This makes the ETL DSL clean and more readable.

v2.4.1

09 Apr 16:38
Compare
Choose a tag to compare

Release 2.4.1 introduces several minor improvements to reading partitioned data.

  1. More allowed special characters in partition names. Following characters are now accepted: ?, +, ,, &, $, :, ;, / and . Beware that a file system that you use may not allow such characters in directory names or that they may require special treatment.
  2. Do not attempt to read empty directories in Akka module.
  3. Minor dependency updates.

v2.4.0

02 Apr 13:27
Compare
Choose a tag to compare

This is a major release that brings important improvements and fixes.

TypedSchemaDef fix

Long time ago it seemed to be a great idea to make TypedSchemaDef a tagged type alias for SchemaDef. And as it was used only by ParquetSchemaResolver it was defined in its scope. All implicit implementations were defined inside companion object of ParquetSchemaResolver. This design didn't change much for a long time but it had a big flaw - it was not a proper type class. As implicit implementations were not defined in a companion object of TypedSchemaDef users could encounter problems with ambiguous implicits when defining own schema definitions.

In this release TypedSchemaDef is turned into a proper type class with its own trait and companion object. ParquetSchemaResolver.TypedSchemaDef is left as an alias to a new trait but is marked as deprecated. All implementations are moved to companion object of TypedSchemaDef - so if you are referencing provided schema definitions explicitly then it is a breaking change for you. However, this change is necessary to bring the improvement.

Chunks in FS2

FS2 suggests processing stream elements in chunks for the best performance. Parquet4S was not following this advice - so far. It had an advantage of keeping code simple and allowed library users to take an action per each processed element. However, this approach had a big influence on performance which was visible especially when reading or writing local files.

Now you are able to choose if you want to process your data in chunks or not. By default fromParquet or viaParquet are using chunks of size equal to 16. To keep previous behaviour change it to 1 using chunkSize property. Or increase it if you are not afraid if higher memory consumption.

Simple write uses chunking provided by the upstream and remains unchanged.

Local benchmarks prove that with chunks of 16 elements reading speed increased even 10x while viaParquet is ~2x faster than in previous releases of Parquet4s.

Upgraded dependencies

  1. Shapeless upgraded to 2.3.9
  2. Akka upgraded to 2.6.19
  3. FS2 upgraded to 3.2.7
  4. Cats effect upgraded to 3.3.9

v2.3.0

21 Feb 07:58
Compare
Choose a tag to compare
  1. Upgrades Scala 3 version to 3.1.1
  2. Adds Scala 3 version of Akka integration module
  3. Upgrades FS2 to version 3.2.4
  4. Upgrades Cats Effect to 3.3.5
  5. Several minor performance improvements in FS2 module

v2.2.0

09 Feb 16:35
Compare
Choose a tag to compare

This is a maintenance release.

A feature of custom types (including custom codecs and schemas) was revised and improved. It turned out that due to insufficient tests the functionality deteriorated with each library release and each new Scala version. Severity was dependant on Scala version. The problem was mostly with using case class as a definition of custom type and that Parquet4s was unable to resolve custom implicit codecs and schemas. Now the problem is solved in each supported Scala version.

Moreover, in some JDK 8 distributions internal error is thrown when getSimpleName or getCanonicalName is called on deeply nested Scala class. Those functions are used internally by Parquet4s. In order to provide better explanation to the user about the nature of the problem internal error is now caught, warning with a feedback is logged and failover action is taken.

v2.1.0

30 Dec 11:29
Compare
Choose a tag to compare

Version 2.1.0 brings new features and further improvements to Parquet4s.

Column projection

Available in all modules of Parquet4s: core, akka and fs2. Makes generic reads much more powerful. Allows to use SQL-like DSL to define projection over specific columns. Allows to extract a value from a nested fields and to use aliases. Please refer to the documentation and examples to learn more.

(Experimental) joins and concats

Implementation of ParquetIterable undergoes evolution that eventually will simplify ETL operations on your datasets. It will allow to read partitioned datasets in core module in the future. But now it introduces join and concat operations on datasets. Please refer to the documentation and examples for more information.