Support DST timezones conversion for ORC#14544
Conversation
Signed-off-by: Chong Gao <res_life@163.com>
Signed-off-by: Chong Gao <res_life@163.com>
Signed-off-by: Chong Gao <res_life@163.com>
Greptile SummaryThis PR adds full DST-aware cross-timezone support for ORC reads by implementing a
Confidence Score: 4/5The ORC cross-timezone conversion path is well-structured and safe to merge; the one concern is a minor inconsistency in the batch split-check helper that only affects non-standard timezone IDs. The conversion logic in GpuOrcTimezoneUtils correctly avoids the silent-GMT fallback by using ZoneId.of. The new writerTimezonesShareRules function in GpuOrcScan uses TimeZone.getTimeZone for the batch split-check, which can silently map unrecognised timezone IDs to GMT, creating an inconsistency that could yield wrong timestamps for non-standard ORC writer timezone strings. Practical risk is low since standard IANA/SHORT_IDS timezone IDs are handled correctly by both paths. sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala — the writerTimezonesShareRules helper at line 998 should use the same ZoneId-based comparison as hasSameTimezoneRules in GpuOrcTimezoneUtils. Important Files Changed
Reviews (6): Last reviewed commit: "Compare ORC writer/reader timezones via ..." | Re-trigger Greptile |
Signed-off-by: Chong Gao <res_life@163.com>
|
Tip: Greploops — Automatically fix all review issues by running Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal. |
|
Can you address the comments from greptile first ? |
Replace scala.util.Random.nextLong() with a fixed seed (42L) so the ORC timezone matrix tests are reproducible. A non-deterministic seed risks intermittent failures if a generated timestamp happens to land near a DST boundary that exposes a latent bug. Signed-off-by: Chong Gao <chongg@nvidia.com> Signed-off-by: Chong Gao <res_life@163.com>
Replace the hardcoded "/tmp/tmp_OrcTimezonePerfSuite" path with a per-run path under java.io.tmpdir suffixed with a UUID, so concurrent runs on the same host (e.g., CI matrix) cannot corrupt each other's data. Signed-off-by: Chong Gao <chongg@nvidia.com> Signed-off-by: Chong Gao <res_life@163.com>
ORC footers can contain legacy/short timezone IDs like "PST", "CST", or "ACT". On JDK 21 ZoneId.of(...) rejects these, while java.util.TimeZone.getTimeZone(...) still accepts them, so the cross-TZ GPU path could fail on files the CPU path reads cleanly. Route the writer timezone through TimeZone.getTimeZone(...).toZoneId once at the entry point so downstream code (the same-rules check, the JNI kernel, and the writer base-offset computation) all see a canonical ZoneId ID consistent with the java.util.TimeZone semantics ORC uses. Signed-off-by: Chong Gao <chongg@nvidia.com> Signed-off-by: Chong Gao <res_life@163.com>
buildOutputStripes was using raw string equality on the per-stripe writer timezone, so a file whose stripes carry semantically equivalent but differently spelled IDs (e.g. "US/Pacific" vs "America/Los_Angeles", "UTC" vs "GMT", or "" alongside an explicit JVM-default ID) would be rejected with an IOException even though the CPU path reads it fine. Group the collected IDs through java.util.TimeZone.hasSameRules and only fail when the underlying rules actually differ. When all stripes agree, prefer the first non-empty ID so downstream code keeps an explicit timezone string. Signed-off-by: Chong Gao <chongg@nvidia.com> Signed-off-by: Chong Gao <res_life@163.com>
The matrix only exercised canonical region IDs, so the new ORC read path's behavior around legacy/alias timezone strings (e.g. "PST", "US/Pacific") was untested. java.util.TimeZone accepts these and they can show up in real ORC footers, but ZoneId.of rejects them on JDK 21. Add "US/Pacific" (alias of "America/Los_Angeles") and "PST" (legacy short ID) to the writer/reader matrix so the alias-equivalence path in the stripe-timezone check and the legacy-ID normalization in the cross-TZ rebase are both covered. Signed-off-by: Chong Gao <chongg@nvidia.com> Signed-off-by: Chong Gao <res_life@163.com>
The wildcard `import org.apache.spark.sql._` already pulls the project's own `org.apache.spark.sql.FileUtils` object into scope (the same pattern used by `TimeZonePerfSuite`), but greptile's static analysis flags the call as an unresolved symbol. Add an explicit import so the dependency on `FileUtils.deleteRecursively` is obvious to readers and tools. Signed-off-by: Chong Gao <chongg@nvidia.com>
firestarman
left a comment
There was a problem hiding this comment.
One low-priority nit from my pass.
Signed-off-by: Chong Gao <res_life@163.com>
…check Per review: the intra-file check in buildOutputStripes already compares stripe writer timezones via hasSameRules, but the multi-file path in isNeedToSplitDataBlock was still using raw string equality. That forces semantically equivalent IDs (e.g. US/Pacific vs America/Los_Angeles, "" vs JVM default) into separate batches across files, hurting batching efficiency without a correctness benefit. Signed-off-by: Chong Gao <res_life@163.com>
Per review: ZoneId.SHORT_IDS rewrites EST/MST/HST to numeric offsets
(-05:00/-07:00/-10:00). TimeZone.getTimeZone then silently maps those
unrecognized ids to GMT, so hasSameTimezoneRules("-05:00", "UTC") was
returning true and skipping the rebase entirely — EST/MST/HST footers
were silently read as UTC.
Compare ZoneId.getRules directly. Both inputs are already canonical
ZoneId ids at the call site, so ZoneId.of will not throw.
Signed-off-by: Chong Gao <res_life@163.com>
revans2
left a comment
There was a problem hiding this comment.
Generally it looks good. My main comment is that it would really be nice to have the code parse/normalize the time zones once, and then pass them around instead of passing strings around. It gets really confusing to know if this string has been normalized or not, if it has then we don't need to worry about short ids, if it has not, then we do...
| # The `spark.sql.session.timeZone` here does not impact reader and writer timezone, but any way, we test it. | ||
| # For the tests that reader and writer timezones are different, refer to `OrcTimezoneSuite` | ||
| @pytest.mark.parametrize("reader_confs", reader_opt_confs, ids=idfn) | ||
| # Setting end timestamp as None almost always generate ts >= 2200 year. |
There was a problem hiding this comment.
Why delete the comments that explain why we are setting the start and end timestamp to what they are? Have we tested outside of this range recently? Do we have a follow on issue to fix the range limitations?
| val readerDefaultTz = java.util.TimeZone.getDefault | ||
| val zones = distinctTzs.map { tz => | ||
| if (tz.isEmpty) readerDefaultTz else java.util.TimeZone.getTimeZone(tz) | ||
| } |
There was a problem hiding this comment.
Even though this section is small it is duplicated at least here and at writerTimezonesShareRules. It might be worth trying to make the conversion to a timezone a standard function. Also this is deduping time zones similarly. It might be nice to also make that generic.
| * @param writerTimezone the writer timezone from the ORC stripe footer | ||
| * @return table with rebased timestamp columns; input is closed | ||
| */ | ||
| def rebaseOrcTimestamps(input: Table, writerTimezone: String): Table = { |
There was a problem hiding this comment.
Would it be cleaner to pass around a ZoneId instead of the string for the writer everywhere? It looks like we try to resolve it in multiple places. Might be nice to do it once.
| readerTz | ||
| } else { | ||
| try { | ||
| ZoneId.of(writerTimezone, ZoneId.SHORT_IDS).getId |
There was a problem hiding this comment.
Are these not rejected by java.util.TimeZone.getTimeZone? Are the different ways of parsing/normalizing them consistent?
|
NOTE: release/26.06 has been created from main. Please retarget your PR to release/26.06 if it should be included in the release. |
Fixes #13437.
Depends on:
Description
Context
ORC OSS uses java.util.TimeZone to do rebase, it does not use java.time.ZoneId API.
The java.util.TimeZone and java.time.ZoneId have inconsistent behavior.
cuDF have
java.timecompatible impl, but does not havejava.utilcompatible impljava.util.TimeZone.getOffsetandjava.time.ZoneId.getOffsetare not always consistent.For more details, click to expand
Solution 1 [not feasible], use cuDF with ignoreTimezoneInStripeFooter=False.
cuDF manages the ORC writer timezone decode.
Problem: for far-future timestamps projected into the synthetic 400-year cycle, dates before the first synthetic DST transition were incorrectly using the first cycle entry, which is the DST offset. That causes exactly the +1 hour winter drift you saw for America/Los_Angeles with years like 8770.
cuDF has
java.timecompatible impl instead ofjava.util. We can not get correct result.So this solution is not feasible.
Solution 2, develope kernel, use cuDF with ignoreTimezoneInStripeFooter=True
cuDF does not manage the ORC writer timezone decode. Decode as UTC in cuDF.
All time rebasing logic is handled by customized JNI kernel which is compatible to
java.util.changes
Why: A pre-built timezone info file can not handle for all Java versions, in future the timezone info may change.
java.util.TimezonelogicRelated cuDF issue
perf number
How to run, refer to `OrcTimezonePerfSuite.scala`, click to expand
Checklists
OrcTimezonePerfSuitecovers the ORC timezone read path end-to-end (CPU vs GPU correctness + performance).yes, perf number refer to the above.
Checklists
OrcTimezoneSuitecovers the ORC reader/writer timezones conversion.yes, for the perf number, please refer to the above section.
Signed-off-by: Chong Gao chongg@nvidia.com