Skip to content

Support DST timezones conversion for ORC#14544

Open
res-life wants to merge 13 commits into
NVIDIA:mainfrom
res-life:orc-tz
Open

Support DST timezones conversion for ORC#14544
res-life wants to merge 13 commits into
NVIDIA:mainfrom
res-life:orc-tz

Conversation

@res-life

@res-life res-life commented Apr 3, 2026

Copy link
Copy Markdown
Collaborator

Fixes #13437.

Depends on:

Description

Context

ORC OSS uses java.util.TimeZone to do rebase, it does not use java.time.ZoneId API.
The java.util.TimeZone and java.time.ZoneId have inconsistent behavior.
cuDF have java.time compatible impl, but does not have java.util compatible impl
java.util.TimeZone.getOffset and java.time.ZoneId.getOffset are not always consistent.

For more details, click to expand
static void testDiffBehaviorBetweenTwoAPIs() {
  // diff in `Africa/Casablanca` timezone at 6424721300000: 0 vs 3600000
  String tzId = "Africa/Casablanca";
  long epochMillis = 6424721300000L;
  int offsetMillis_1 = java.util.TimeZone.getTimeZone(tzId).getOffset(epochMillis);
  int offsetMillis_2 = java.time.ZoneId.of(tzId, ZoneId.SHORT_IDS).getRules().getOffset(Instant.ofEpochMilli(epochMillis)).getTotalSeconds() * 1000;
  if (offsetMillis_1 != offsetMillis_2) {
    // print: get diff!! 0 vs 3600000
    System.out.println("get diff!! " + offsetMillis_1 + " vs " + offsetMillis_2);
  }
}

Solution 1 [not feasible], use cuDF with ignoreTimezoneInStripeFooter=False.

cuDF manages the ORC writer timezone decode.
Problem: for far-future timestamps projected into the synthetic 400-year cycle, dates before the first synthetic DST transition were incorrectly using the first cycle entry, which is the DST offset. That causes exactly the +1 hour winter drift you saw for America/Los_Angeles with years like 8770.
cuDF has java.time compatible impl instead of java.util. We can not get correct result.
So this solution is not feasible.

Solution 2, develope kernel, use cuDF with ignoreTimezoneInStripeFooter=True

cuDF does not manage the ORC writer timezone decode. Decode as UTC in cuDF.
All time rebasing logic is handled by customized JNI kernel which is compatible to java.util.

changes

  • Remove the orc_timezone_info.data file, get the timezone info dynamically.
    Why: A pre-built timezone info file can not handle for all Java versions, in future the timezone info may change.
  • Implements java.util.Timezone logic
  • Add OrcTimezoneSuite to test reader/writer timezone combinations.

Related cuDF issue

perf number

Test CPU avg (ms) GPU avg (ms) Speedup
cross-tz (LA→UTC) 23,662 13,815 1.71x
same-tz-baseline (LA→LA) 79,699 13,766 5.79x
How to run, refer to `OrcTimezonePerfSuite.scala`, click to expand
argLine="-DenableOrcTimeZonePerf=true \
         -DorcPerfWriterTZ=America/Los_Angeles \
         -DorcPerfReaderTZ=UTC \
         -DorcPerfRows=1073741824" \
mvn test -Dbuildver=350 \
  -DwildcardSuites=com.nvidia.spark.rapids.timezone.OrcTimezonePerfSuite

Checklists

  • This PR has added documentation for new or modified features or behaviors.
  • This PR has added new tests or modified existing tests to cover new code paths.
    OrcTimezonePerfSuite covers the ORC timezone read path end-to-end (CPU vs GPU correctness + performance).
  • Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.
    yes, perf number refer to the above.

Checklists

  • I have reviewed my PR using AI tools and by myself.
  • This PR has added documentation for new or modified features or behaviors.
  • This PR has added new tests or modified existing tests to cover new code paths.
    OrcTimezoneSuite covers the ORC reader/writer timezones conversion.
  • Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.
    yes, for the perf number, please refer to the above section.

Signed-off-by: Chong Gao chongg@nvidia.com

@res-life res-life changed the title Optimize ORC timezone rebasing and add perf test suite Support non-DST(Daylight Saving Time) timezone for ORC datasource Apr 3, 2026
@res-life res-life changed the title Support non-DST(Daylight Saving Time) timezone for ORC datasource Support DST(Daylight Saving Time) timezone for ORC datasource Apr 7, 2026
@res-life res-life changed the title Support DST(Daylight Saving Time) timezone for ORC datasource Support DST timezones conversion for ORC Apr 7, 2026
Chong Gao added 2 commits April 7, 2026 17:04
Signed-off-by: Chong Gao <res_life@163.com>
Signed-off-by: Chong Gao <res_life@163.com>
Signed-off-by: Chong Gao <res_life@163.com>
@res-life res-life marked this pull request as ready for review April 8, 2026 01:40
@greptile-apps

greptile-apps Bot commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds full DST-aware cross-timezone support for ORC reads by implementing a java.util.TimeZone-compatible JNI kernel path instead of relying on cuDF's java.time-based (and incompatible) rebasing. It also removes the previous UTC-only restriction on ORC reads and writes.

  • Core conversion logic (GpuOrcTimezoneUtils): A new rebaseOrcTimestamps dispatcher routes to a fast same-TZ base-offset path or a cross-TZ JNI kernel path (GpuTimeZoneDB.convertOrcTimezones), correctly using ZoneId.of + SHORT_IDS to canonicalise ORC footer timezone IDs and throwing explicitly for truly unrecognised IDs.
  • Writer timezone plumbing (GpuOrcScan): writerTimezone is extracted from ORC stripe footers and threaded through OrcPartitionReaderContext, OrcExtraInfo, and OrcBlockMetaForSplitCheck, with a new writerTimezonesShareRules split-check guard.
  • Config / scan guard cleanup: Removes the experimental orcReadIgnoreWriterTimezone config, the UTC-only write-side guard, and the UTC-only read-side guard; adds OrcTimezoneSuite matrix tests and an optional perf benchmark.

Confidence Score: 4/5

The ORC cross-timezone conversion path is well-structured and safe to merge; the one concern is a minor inconsistency in the batch split-check helper that only affects non-standard timezone IDs.

The conversion logic in GpuOrcTimezoneUtils correctly avoids the silent-GMT fallback by using ZoneId.of. The new writerTimezonesShareRules function in GpuOrcScan uses TimeZone.getTimeZone for the batch split-check, which can silently map unrecognised timezone IDs to GMT, creating an inconsistency that could yield wrong timestamps for non-standard ORC writer timezone strings. Practical risk is low since standard IANA/SHORT_IDS timezone IDs are handled correctly by both paths.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala — the writerTimezonesShareRules helper at line 998 should use the same ZoneId-based comparison as hasSameTimezoneRules in GpuOrcTimezoneUtils.

Important Files Changed

Filename Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcTimezoneUtils.scala Adds rebaseOrcTimestamps dispatcher and rebaseWithWriterTimezone cross-TZ path; correctly uses ZoneId.of+SHORT_IDS for ORC footer IDs and builds per-table JNI context; memory management looks correct.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala Propagates writerTimezone through OrcPartitionReaderContext / OrcExtraInfo / OrcBlockMetaForSplitCheck; adds split-check writerTimezonesShareRules that uses TimeZone.getTimeZone (silent GMT fallback) inconsistently with hasSameTimezoneRules in GpuOrcTimezoneUtils.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala Removes the now-superseded ORC_READ_IGNORE_WRITE_TIMEZONE experimental config and its accessor; clean removal with no remaining references.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuOrcFileFormat.scala Removes the UTC-only write-side timezone guard; correct since the GPU ORC writer uses UTC internally and the reader now handles cross-TZ conversion.
tests/src/test/scala/com/nvidia/spark/rapids/timezone/OrcTimezoneSuite.scala New correctness test for all writer x reader TZ combinations including legacy IDs; fixed seed avoids non-determinism; uses withTempPath for cleanup.
tests/src/test/scala/com/nvidia/spark/rapids/timezone/OrcTimezonePerfSuite.scala New optional perf benchmark gated by system property; uses per-run UUID suffix to avoid cross-run conflicts.
integration_tests/src/main/python/orc_test.py Removes non_utc_allow_orc_scan fallback allowances and renames the non-UTC timezone integration test.

Reviews (6): Last reviewed commit: "Compare ORC writer/reader timezones via ..." | Re-trigger Greptile

Comment thread tests/src/test/scala/com/nvidia/spark/rapids/timezone/OrcTimezoneSuite.scala Outdated
Comment thread tests/src/test/scala/com/nvidia/spark/rapids/timezone/OrcTimezonePerfSuite.scala Outdated
@res-life res-life requested review from jihoonson and revans2 April 8, 2026 01:48
Signed-off-by: Chong Gao <res_life@163.com>
@greptile-apps

greptile-apps Bot commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

Tip:

Greploops — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.

Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

@sameerz sameerz added the feature request New feature or request label Apr 21, 2026
@firestarman

Copy link
Copy Markdown
Collaborator

Can you address the comments from greptile first ?

Comment thread sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala Outdated
Comment thread tests/src/test/scala/com/nvidia/spark/rapids/timezone/OrcTimezoneSuite.scala Outdated
Chong Gao added 5 commits May 6, 2026 13:21
Replace scala.util.Random.nextLong() with a fixed seed (42L) so the
ORC timezone matrix tests are reproducible. A non-deterministic seed
risks intermittent failures if a generated timestamp happens to land
near a DST boundary that exposes a latent bug.

Signed-off-by: Chong Gao <chongg@nvidia.com>
Signed-off-by: Chong Gao <res_life@163.com>
Replace the hardcoded "/tmp/tmp_OrcTimezonePerfSuite" path with a
per-run path under java.io.tmpdir suffixed with a UUID, so concurrent
runs on the same host (e.g., CI matrix) cannot corrupt each other's
data.

Signed-off-by: Chong Gao <chongg@nvidia.com>
Signed-off-by: Chong Gao <res_life@163.com>
ORC footers can contain legacy/short timezone IDs like "PST", "CST",
or "ACT". On JDK 21 ZoneId.of(...) rejects these, while
java.util.TimeZone.getTimeZone(...) still accepts them, so the cross-TZ
GPU path could fail on files the CPU path reads cleanly.

Route the writer timezone through TimeZone.getTimeZone(...).toZoneId
once at the entry point so downstream code (the same-rules check, the
JNI kernel, and the writer base-offset computation) all see a canonical
ZoneId ID consistent with the java.util.TimeZone semantics ORC uses.

Signed-off-by: Chong Gao <chongg@nvidia.com>
Signed-off-by: Chong Gao <res_life@163.com>
buildOutputStripes was using raw string equality on the per-stripe
writer timezone, so a file whose stripes carry semantically equivalent
but differently spelled IDs (e.g. "US/Pacific" vs "America/Los_Angeles",
"UTC" vs "GMT", or "" alongside an explicit JVM-default ID) would be
rejected with an IOException even though the CPU path reads it fine.

Group the collected IDs through java.util.TimeZone.hasSameRules and
only fail when the underlying rules actually differ. When all stripes
agree, prefer the first non-empty ID so downstream code keeps an
explicit timezone string.

Signed-off-by: Chong Gao <chongg@nvidia.com>
Signed-off-by: Chong Gao <res_life@163.com>
The matrix only exercised canonical region IDs, so the new ORC read
path's behavior around legacy/alias timezone strings (e.g. "PST",
"US/Pacific") was untested. java.util.TimeZone accepts these and they
can show up in real ORC footers, but ZoneId.of rejects them on JDK 21.

Add "US/Pacific" (alias of "America/Los_Angeles") and "PST" (legacy
short ID) to the writer/reader matrix so the alias-equivalence path
in the stripe-timezone check and the legacy-ID normalization in the
cross-TZ rebase are both covered.

Signed-off-by: Chong Gao <chongg@nvidia.com>
Signed-off-by: Chong Gao <res_life@163.com>
The wildcard `import org.apache.spark.sql._` already pulls the project's
own `org.apache.spark.sql.FileUtils` object into scope (the same pattern
used by `TimeZonePerfSuite`), but greptile's static analysis flags the
call as an unresolved symbol. Add an explicit import so the dependency
on `FileUtils.deleteRecursively` is obvious to readers and tools.

Signed-off-by: Chong Gao <chongg@nvidia.com>

@firestarman firestarman left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One low-priority nit from my pass.

Comment thread sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala Outdated
Chong Gao added 2 commits May 9, 2026 17:38
Signed-off-by: Chong Gao <res_life@163.com>
…check

Per review: the intra-file check in buildOutputStripes already compares
stripe writer timezones via hasSameRules, but the multi-file path in
isNeedToSplitDataBlock was still using raw string equality. That forces
semantically equivalent IDs (e.g. US/Pacific vs America/Los_Angeles, ""
vs JVM default) into separate batches across files, hurting batching
efficiency without a correctness benefit.

Signed-off-by: Chong Gao <res_life@163.com>
Per review: ZoneId.SHORT_IDS rewrites EST/MST/HST to numeric offsets
(-05:00/-07:00/-10:00). TimeZone.getTimeZone then silently maps those
unrecognized ids to GMT, so hasSameTimezoneRules("-05:00", "UTC") was
returning true and skipping the rebase entirely — EST/MST/HST footers
were silently read as UTC.

Compare ZoneId.getRules directly. Both inputs are already canonical
ZoneId ids at the call site, so ZoneId.of will not throw.

Signed-off-by: Chong Gao <res_life@163.com>

@revans2 revans2 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally it looks good. My main comment is that it would really be nice to have the code parse/normalize the time zones once, and then pass them around instead of passing strings around. It gets really confusing to know if this string has been normalized or not, if it has then we don't need to worry about short ids, if it has not, then we do...

# The `spark.sql.session.timeZone` here does not impact reader and writer timezone, but any way, we test it.
# For the tests that reader and writer timezones are different, refer to `OrcTimezoneSuite`
@pytest.mark.parametrize("reader_confs", reader_opt_confs, ids=idfn)
# Setting end timestamp as None almost always generate ts >= 2200 year.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why delete the comments that explain why we are setting the start and end timestamp to what they are? Have we tested outside of this range recently? Do we have a follow on issue to fix the range limitations?

val readerDefaultTz = java.util.TimeZone.getDefault
val zones = distinctTzs.map { tz =>
if (tz.isEmpty) readerDefaultTz else java.util.TimeZone.getTimeZone(tz)
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though this section is small it is duplicated at least here and at writerTimezonesShareRules. It might be worth trying to make the conversion to a timezone a standard function. Also this is deduping time zones similarly. It might be nice to also make that generic.

* @param writerTimezone the writer timezone from the ORC stripe footer
* @return table with rebased timestamp columns; input is closed
*/
def rebaseOrcTimestamps(input: Table, writerTimezone: String): Table = {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be cleaner to pass around a ZoneId instead of the string for the writer everywhere? It looks like we try to resolve it in multiple places. Might be nice to do it once.

readerTz
} else {
try {
ZoneId.of(writerTimezone, ZoneId.SHORT_IDS).getId

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these not rejected by java.util.TimeZone.getTimeZone? Are the different ways of parsing/normalizing them consistent?

@nvauto

nvauto commented May 25, 2026

Copy link
Copy Markdown
Collaborator

NOTE: release/26.06 has been created from main. Please retarget your PR to release/26.06 if it should be included in the release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] [follow-up] ORC reading supports rebasing writer timezones.

5 participants