Support DST timezones conversion for ORC by res-life · Pull Request #14544 · NVIDIA/cudf-spark

res-life · 2026-04-03T10:07:50Z

Depends on:

part 1: [Part1] [ORC-timezone]: Build ORC timezone metadata at runtime, drop pregenerated binary cudf-spark-jni#4539
part 2: TODO
part 3: TODO
part 4: TODO
part 5: TODO

Description

Context

ORC OSS uses java.util.TimeZone to do rebase, it does not use java.time.ZoneId API.
The java.util.TimeZone and java.time.ZoneId have inconsistent behavior.
cuDF have java.time compatible impl, but does not have java.util compatible impl
java.util.TimeZone.getOffset and java.time.ZoneId.getOffset are not always consistent.

For more details, click to expand

static void testDiffBehaviorBetweenTwoAPIs() {
  // diff in `Africa/Casablanca` timezone at 6424721300000: 0 vs 3600000
  String tzId = "Africa/Casablanca";
  long epochMillis = 6424721300000L;
  int offsetMillis_1 = java.util.TimeZone.getTimeZone(tzId).getOffset(epochMillis);
  int offsetMillis_2 = java.time.ZoneId.of(tzId, ZoneId.SHORT_IDS).getRules().getOffset(Instant.ofEpochMilli(epochMillis)).getTotalSeconds() * 1000;
  if (offsetMillis_1 != offsetMillis_2) {
    // print: get diff!! 0 vs 3600000
    System.out.println("get diff!! " + offsetMillis_1 + " vs " + offsetMillis_2);
  }
}

Solution 1 [not feasible], use cuDF with ignoreTimezoneInStripeFooter=False.

cuDF manages the ORC writer timezone decode.
Problem: for far-future timestamps projected into the synthetic 400-year cycle, dates before the first synthetic DST transition were incorrectly using the first cycle entry, which is the DST offset. That causes exactly the +1 hour winter drift you saw for America/Los_Angeles with years like 8770.
cuDF has java.time compatible impl instead of java.util. We can not get correct result.
So this solution is not feasible.

Solution 2, develope kernel, use cuDF with ignoreTimezoneInStripeFooter=True

cuDF does not manage the ORC writer timezone decode. Decode as UTC in cuDF.
All time rebasing logic is handled by customized JNI kernel which is compatible to java.util.

changes

Remove the orc_timezone_info.data file, get the timezone info dynamically.
Why: A pre-built timezone info file can not handle for all Java versions, in future the timezone info may change.
Implements java.util.Timezone logic
Add OrcTimezoneSuite to test reader/writer timezone combinations.

Related cuDF issue

ORC reader returns incorrect timestamp for epoch boundary values rapidsai/cudf#21993

perf number

Test	CPU avg (ms)	GPU avg (ms)	Speedup
cross-tz (LA→UTC)	23,662	13,815	1.71x
same-tz-baseline (LA→LA)	79,699	13,766	5.79x

How to run, refer to `OrcTimezonePerfSuite.scala`, click to expand

argLine="-DenableOrcTimeZonePerf=true \
         -DorcPerfWriterTZ=America/Los_Angeles \
         -DorcPerfReaderTZ=UTC \
         -DorcPerfRows=1073741824" \
mvn test -Dbuildver=350 \
  -DwildcardSuites=com.nvidia.spark.rapids.timezone.OrcTimezonePerfSuite

Checklists

This PR has added documentation for new or modified features or behaviors.
This PR has added new tests or modified existing tests to cover new code paths.
OrcTimezonePerfSuite covers the ORC timezone read path end-to-end (CPU vs GPU correctness + performance).
Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.
yes, perf number refer to the above.

Checklists

I have reviewed my PR using AI tools and by myself.
This PR has added documentation for new or modified features or behaviors.
This PR has added new tests or modified existing tests to cover new code paths.
OrcTimezoneSuite covers the ORC reader/writer timezones conversion.
Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.
yes, for the perf number, please refer to the above section.

Signed-off-by: Chong Gao chongg@nvidia.com

Signed-off-by: Chong Gao <res_life@163.com>

greptile-apps · 2026-04-08T01:45:06Z

Greptile Summary

This PR adds full DST-aware cross-timezone support for ORC reads by implementing a java.util.TimeZone-compatible JNI kernel path instead of relying on cuDF's java.time-based (and incompatible) rebasing. It also removes the previous UTC-only restriction on ORC reads and writes.

Core conversion logic (GpuOrcTimezoneUtils): A new rebaseOrcTimestamps dispatcher routes to a fast same-TZ base-offset path or a cross-TZ JNI kernel path (GpuTimeZoneDB.convertOrcTimezones), correctly using ZoneId.of + SHORT_IDS to canonicalise ORC footer timezone IDs and throwing explicitly for truly unrecognised IDs.
Writer timezone plumbing (GpuOrcScan): writerTimezone is extracted from ORC stripe footers and threaded through OrcPartitionReaderContext, OrcExtraInfo, and OrcBlockMetaForSplitCheck, with a new writerTimezonesShareRules split-check guard.
Config / scan guard cleanup: Removes the experimental orcReadIgnoreWriterTimezone config, the UTC-only write-side guard, and the UTC-only read-side guard; adds OrcTimezoneSuite matrix tests and an optional perf benchmark.

Confidence Score: 4/5

The ORC cross-timezone conversion path is well-structured and safe to merge; the one concern is a minor inconsistency in the batch split-check helper that only affects non-standard timezone IDs.

The conversion logic in GpuOrcTimezoneUtils correctly avoids the silent-GMT fallback by using ZoneId.of. The new writerTimezonesShareRules function in GpuOrcScan uses TimeZone.getTimeZone for the batch split-check, which can silently map unrecognised timezone IDs to GMT, creating an inconsistency that could yield wrong timestamps for non-standard ORC writer timezone strings. Practical risk is low since standard IANA/SHORT_IDS timezone IDs are handled correctly by both paths.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala — the writerTimezonesShareRules helper at line 998 should use the same ZoneId-based comparison as hasSameTimezoneRules in GpuOrcTimezoneUtils.

Important Files Changed

Filename	Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcTimezoneUtils.scala	Adds rebaseOrcTimestamps dispatcher and rebaseWithWriterTimezone cross-TZ path; correctly uses ZoneId.of+SHORT_IDS for ORC footer IDs and builds per-table JNI context; memory management looks correct.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala	Propagates writerTimezone through OrcPartitionReaderContext / OrcExtraInfo / OrcBlockMetaForSplitCheck; adds split-check writerTimezonesShareRules that uses TimeZone.getTimeZone (silent GMT fallback) inconsistently with hasSameTimezoneRules in GpuOrcTimezoneUtils.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala	Removes the now-superseded ORC_READ_IGNORE_WRITE_TIMEZONE experimental config and its accessor; clean removal with no remaining references.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuOrcFileFormat.scala	Removes the UTC-only write-side timezone guard; correct since the GPU ORC writer uses UTC internally and the reader now handles cross-TZ conversion.
tests/src/test/scala/com/nvidia/spark/rapids/timezone/OrcTimezoneSuite.scala	New correctness test for all writer x reader TZ combinations including legacy IDs; fixed seed avoids non-determinism; uses withTempPath for cleanup.
tests/src/test/scala/com/nvidia/spark/rapids/timezone/OrcTimezonePerfSuite.scala	New optional perf benchmark gated by system property; uses per-run UUID suffix to avoid cross-run conflicts.
integration_tests/src/main/python/orc_test.py	Removes non_utc_allow_orc_scan fallback allowances and renames the non-UTC timezone integration test.

_{Reviews (6): Last reviewed commit: "Compare ORC writer/reader timezones via ..." | Re-trigger Greptile}

Signed-off-by: Chong Gao <res_life@163.com>

greptile-apps · 2026-04-08T02:26:40Z

Tip:

Greploops — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.

Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

firestarman · 2026-04-24T06:34:09Z

Can you address the comments from greptile first ?

Replace scala.util.Random.nextLong() with a fixed seed (42L) so the ORC timezone matrix tests are reproducible. A non-deterministic seed risks intermittent failures if a generated timestamp happens to land near a DST boundary that exposes a latent bug. Signed-off-by: Chong Gao <chongg@nvidia.com> Signed-off-by: Chong Gao <res_life@163.com>

Replace the hardcoded "/tmp/tmp_OrcTimezonePerfSuite" path with a per-run path under java.io.tmpdir suffixed with a UUID, so concurrent runs on the same host (e.g., CI matrix) cannot corrupt each other's data. Signed-off-by: Chong Gao <chongg@nvidia.com> Signed-off-by: Chong Gao <res_life@163.com>

ORC footers can contain legacy/short timezone IDs like "PST", "CST", or "ACT". On JDK 21 ZoneId.of(...) rejects these, while java.util.TimeZone.getTimeZone(...) still accepts them, so the cross-TZ GPU path could fail on files the CPU path reads cleanly. Route the writer timezone through TimeZone.getTimeZone(...).toZoneId once at the entry point so downstream code (the same-rules check, the JNI kernel, and the writer base-offset computation) all see a canonical ZoneId ID consistent with the java.util.TimeZone semantics ORC uses. Signed-off-by: Chong Gao <chongg@nvidia.com> Signed-off-by: Chong Gao <res_life@163.com>

buildOutputStripes was using raw string equality on the per-stripe writer timezone, so a file whose stripes carry semantically equivalent but differently spelled IDs (e.g. "US/Pacific" vs "America/Los_Angeles", "UTC" vs "GMT", or "" alongside an explicit JVM-default ID) would be rejected with an IOException even though the CPU path reads it fine. Group the collected IDs through java.util.TimeZone.hasSameRules and only fail when the underlying rules actually differ. When all stripes agree, prefer the first non-empty ID so downstream code keeps an explicit timezone string. Signed-off-by: Chong Gao <chongg@nvidia.com> Signed-off-by: Chong Gao <res_life@163.com>

The matrix only exercised canonical region IDs, so the new ORC read path's behavior around legacy/alias timezone strings (e.g. "PST", "US/Pacific") was untested. java.util.TimeZone accepts these and they can show up in real ORC footers, but ZoneId.of rejects them on JDK 21. Add "US/Pacific" (alias of "America/Los_Angeles") and "PST" (legacy short ID) to the writer/reader matrix so the alias-equivalence path in the stripe-timezone check and the legacy-ID normalization in the cross-TZ rebase are both covered. Signed-off-by: Chong Gao <chongg@nvidia.com> Signed-off-by: Chong Gao <res_life@163.com>

The wildcard `import org.apache.spark.sql._` already pulls the project's own `org.apache.spark.sql.FileUtils` object into scope (the same pattern used by `TimeZonePerfSuite`), but greptile's static analysis flags the call as an unresolved symbol. Add an explicit import so the dependency on `FileUtils.deleteRecursively` is obvious to readers and tools. Signed-off-by: Chong Gao <chongg@nvidia.com>

firestarman

One low-priority nit from my pass.

Signed-off-by: Chong Gao <res_life@163.com>

…check Per review: the intra-file check in buildOutputStripes already compares stripe writer timezones via hasSameRules, but the multi-file path in isNeedToSplitDataBlock was still using raw string equality. That forces semantically equivalent IDs (e.g. US/Pacific vs America/Los_Angeles, "" vs JVM default) into separate batches across files, hurting batching efficiency without a correctness benefit. Signed-off-by: Chong Gao <res_life@163.com>

Per review: ZoneId.SHORT_IDS rewrites EST/MST/HST to numeric offsets (-05:00/-07:00/-10:00). TimeZone.getTimeZone then silently maps those unrecognized ids to GMT, so hasSameTimezoneRules("-05:00", "UTC") was returning true and skipping the rebase entirely — EST/MST/HST footers were silently read as UTC. Compare ZoneId.getRules directly. Both inputs are already canonical ZoneId ids at the call site, so ZoneId.of will not throw. Signed-off-by: Chong Gao <res_life@163.com>

revans2

Generally it looks good. My main comment is that it would really be nice to have the code parse/normalize the time zones once, and then pass them around instead of passing strings around. It gets really confusing to know if this string has been normalized or not, if it has then we don't need to worry about short ids, if it has not, then we do...

revans2 · 2026-05-20T14:01:18Z

+# The `spark.sql.session.timeZone` here does not impact reader and writer timezone, but any way, we test it.
+# For the tests that reader and writer timezones are different, refer to `OrcTimezoneSuite`
 @pytest.mark.parametrize("reader_confs", reader_opt_confs, ids=idfn)
-# Setting end timestamp as None almost always generate ts >= 2200 year.


Why delete the comments that explain why we are setting the start and end timestamp to what they are? Have we tested outside of this range recently? Do we have a follow on issue to fix the range limitations?

revans2 · 2026-05-20T15:08:51Z

+        val readerDefaultTz = java.util.TimeZone.getDefault
+        val zones = distinctTzs.map { tz =>
+          if (tz.isEmpty) readerDefaultTz else java.util.TimeZone.getTimeZone(tz)
+        }


Even though this section is small it is duplicated at least here and at writerTimezonesShareRules. It might be worth trying to make the conversion to a timezone a standard function. Also this is deduping time zones similarly. It might be nice to also make that generic.

revans2 · 2026-05-20T15:11:30Z

+   * @param writerTimezone the writer timezone from the ORC stripe footer
+   * @return table with rebased timestamp columns; input is closed
+   */
+  def rebaseOrcTimestamps(input: Table, writerTimezone: String): Table = {


Would it be cleaner to pass around a ZoneId instead of the string for the writer everywhere? It looks like we try to resolve it in multiple places. Might be nice to do it once.

revans2 · 2026-05-20T15:13:27Z

+      readerTz
+    } else {
+      try {
+        ZoneId.of(writerTimezone, ZoneId.SHORT_IDS).getId


Are these not rejected by java.util.TimeZone.getTimeZone? Are the different ways of parsing/normalizing them consistent?

nvauto · 2026-05-25T05:03:22Z

NOTE: release/26.06 has been created from main. Please retarget your PR to release/26.06 if it should be included in the release.

res-life mentioned this pull request Apr 3, 2026

Support DST timezones conversion for ORC NVIDIA/cudf-spark-jni#4432

Draft

3 tasks

res-life changed the title ~~Optimize ORC timezone rebasing and add perf test suite~~ Support non-DST(Daylight Saving Time) timezone for ORC datasource Apr 3, 2026

res-life changed the title ~~Support non-DST(Daylight Saving Time) timezone for ORC datasource~~ Support DST(Daylight Saving Time) timezone for ORC datasource Apr 7, 2026

res-life changed the title ~~Support DST(Daylight Saving Time) timezone for ORC datasource~~ Support DST timezones conversion for ORC Apr 7, 2026

res-life force-pushed the orc-tz branch from e6fdddc to ee288d8 Compare April 7, 2026 08:06

Chong Gao added 2 commits April 7, 2026 17:04

Support DST timezones for ORC

183176d

Signed-off-by: Chong Gao <res_life@163.com>

Update tests

3e4fca1

Signed-off-by: Chong Gao <res_life@163.com>

res-life force-pushed the orc-tz branch from ee288d8 to 3e4fca1 Compare April 7, 2026 10:24

Update tests

181c5e7

Signed-off-by: Chong Gao <res_life@163.com>

res-life marked this pull request as ready for review April 8, 2026 01:40

greptile-apps Bot reviewed Apr 8, 2026

View reviewed changes

Comment thread tests/src/test/scala/com/nvidia/spark/rapids/timezone/OrcTimezoneSuite.scala Outdated

Comment thread tests/src/test/scala/com/nvidia/spark/rapids/timezone/OrcTimezonePerfSuite.scala Outdated

res-life requested review from jihoonson and revans2 April 8, 2026 01:48

Update tests

c969c75

Signed-off-by: Chong Gao <res_life@163.com>

sameerz added the feature request New feature or request label Apr 21, 2026

firestarman reviewed Apr 24, 2026

View reviewed changes

Comment thread sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcTimezoneUtils.scala

firestarman reviewed Apr 24, 2026

View reviewed changes

Comment thread sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala Outdated

firestarman reviewed Apr 24, 2026

View reviewed changes

Comment thread tests/src/test/scala/com/nvidia/spark/rapids/timezone/OrcTimezoneSuite.scala Outdated

Chong Gao added 5 commits May 6, 2026 13:21

greptile-apps Bot reviewed May 6, 2026

View reviewed changes

Comment thread tests/src/test/scala/com/nvidia/spark/rapids/timezone/OrcTimezonePerfSuite.scala

greptile-apps Bot reviewed May 7, 2026

View reviewed changes

Comment thread sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcTimezoneUtils.scala

firestarman reviewed May 8, 2026

View reviewed changes

Comment thread sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala Outdated

Chong Gao added 2 commits May 9, 2026 17:38

Fix comment

ebabd91

Signed-off-by: Chong Gao <res_life@163.com>

greptile-apps Bot reviewed May 9, 2026

View reviewed changes

Comment thread sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcTimezoneUtils.scala

revans2 reviewed May 20, 2026

View reviewed changes

Uh oh!

Conversation

res-life commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Depends on:

Description

Context

Solution 1 [not feasible], use cuDF with ignoreTimezoneInStripeFooter=False.

Solution 2, develope kernel, use cuDF with ignoreTimezoneInStripeFooter=True

changes

Related cuDF issue

perf number

Checklists

Checklists

Uh oh!

greptile-apps Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Apr 8, 2026

Uh oh!

firestarman commented Apr 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

firestarman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

revans2 left a comment

Choose a reason for hiding this comment

Uh oh!

revans2 May 20, 2026

Choose a reason for hiding this comment

Uh oh!

revans2 May 20, 2026

Choose a reason for hiding this comment

Uh oh!

revans2 May 20, 2026

Choose a reason for hiding this comment

Uh oh!

revans2 May 20, 2026

Choose a reason for hiding this comment

Uh oh!

nvauto commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

res-life commented Apr 3, 2026 •

edited

Loading

greptile-apps Bot commented Apr 8, 2026 •

edited

Loading