Skip to content

[Part1] [ORC-timezone]: Build ORC timezone metadata at runtime, drop pregenerated binary#4539

Merged
res-life merged 17 commits into
NVIDIA:mainfrom
res-life:orc-tz-pr1-runtime-info
May 25, 2026
Merged

[Part1] [ORC-timezone]: Build ORC timezone metadata at runtime, drop pregenerated binary#4539
res-life merged 17 commits into
NVIDIA:mainfrom
res-life:orc-tz-pr1-runtime-info

Conversation

@res-life

Copy link
Copy Markdown
Collaborator

First in a stack splitting #4432 into reviewable chunks.

Description

Previously OrcTimezoneInfo loaded its transition tables from a pregenerated 544KB binary (src/main/resources/orc_timezone_info.data, derived from sun.util.calendar.ZoneInfo on OpenJDK 8). This froze ORC timezone behavior to that snapshot and required regenerating and committing the binary whenever the data needed to change.

This PR replaces the binary with a runtime build path:

  • buildHistoricalTransitions walks ZoneRules.getTransitions() from year 0001 (the lower bound ORC supports) and probes TimeZone.getOffset() to capture the offset before each known transition.
  • collectTimeZoneTransitionsByScanning covers any gap between two known transitions by stepping forward at 1-day granularity and binary-searching the exact transition millisecond when an offset change is detected.
  • Per-id results are cached in RUNTIME_TIMEZONE_INFOS.
  • getAllTimezoneIds() now returns the live TimeZone.getAvailableIDs() instead of a hard-coded array.

OrcTimezoneInfo keeps its (rawOffset, transitions, offsets) shape so the CUDA kernel is unchanged. DST timezones remain blocked by the existing hasDaylightSavingTime guard in convertOrcTimezones — follow-up PRs lift this guard:

  1. (this PR) Build OrcTimezoneInfo at runtime
  2. DST rule extraction (Java side, CPU verified)
  3. GPU DST kernel + new convertOrcTimezones signature
  4. OrcTimezoneContext + remaining tests

isSupportedTimeZone is broadened from catch (ZoneRulesException) to catch (DateTimeException) so IDs like +05:30 are rejected/accepted consistently with the runtime build path (ZoneId.of may throw either, depending on whether the id has a malformed offset vs. an unknown name).

Checklists

  • Documentation: no user-facing change.
  • Tests: behavior preserved for non-DST timezones; covered by existing tests in this repo and the companion spark-rapids PR.
  • Performance: no expected change for the existing non-DST path. The first call per timezone pays a one-time runtime-build cost (measured ~4 ms typical, ~12 ms worst on the previous full-DST iteration); subsequent calls hit the cache.

Signed-off-by: Chong Gao chongg@nvidia.com

@res-life res-life changed the title Build ORC timezone metadata at runtime, drop pregenerated binary [Part1] [ORC-timezone]: Build ORC timezone metadata at runtime, drop pregenerated binary May 12, 2026
@nvauto

nvauto commented May 18, 2026

Copy link
Copy Markdown
Collaborator

NOTE: release/26.06 has been created from main. Please retarget your PR to release/26.06 if it should be included in the release.

@res-life

Copy link
Copy Markdown
Collaborator Author

build

Replace `orc_timezone_info.data` (544KB, OpenJDK-8 snapshot) with a
runtime build path that derives historical transitions from
`ZoneRules.getTransitions()` and `TimeZone.getOffset()`. Per-id results
are cached in `RUNTIME_TIMEZONE_INFOS`. `OrcTimezoneInfo` keeps its
`(rawOffset, transitions, offsets)` shape; the CUDA kernel and the
`hasDaylightSavingTime` DST guard are unchanged.

Switch `isSupportedTimeZone` to catch `DateTimeException` so IDs like
`+05:30` are handled the same as named zones.

Split from NVIDIA#4432.

Signed-off-by: Chong Gao <chongg@nvidia.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@res-life res-life force-pushed the orc-tz-pr1-runtime-info branch from 05f1663 to e0df2a6 Compare May 19, 2026 07:18
@res-life res-life changed the base branch from main to release/26.06 May 19, 2026 09:26
@res-life

Copy link
Copy Markdown
Collaborator Author

build

@res-life res-life marked this pull request as ready for review May 19, 2026 13:46
@res-life res-life requested a review from firestarman May 19, 2026 13:46
@greptile-apps

greptile-apps Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR replaces a frozen 544 KB pregenerated binary (orc_timezone_info.data) with a runtime build path that derives ORC timezone metadata from public java.time/java.util APIs. Results are cached per ID in a ConcurrentHashMap, the timezone-ID list now comes from the live JVM tzdata, and isSupportedTimeZone broadens its catch to DateTimeException so offset-style IDs are handled consistently.

  • OrcTimezoneInfo is completely rewritten: buildHistoricalTransitions walks ZoneRules.getTransitions() from year 0001, while collectTimeZoneTransitionsByScanning fills any gap using exponential probing + binary search; a fixed-offset fast path handles +HH:MM-style IDs correctly (the old TimeZone.getTimeZone() would have silently mapped them to GMT).
  • GpuTimeZoneDB gains two new tests covering the DateTimeException-widening regression and the no-GMT-fallback contract; getOrcSupportedTimezones Javadoc is updated to reflect the pre-filtered list.
  • OrcTimezoneInfoTest is added with unit coverage for fixed-offset zones, cache identity, invalid-ID rejection, strict monotonicity of the generated transition table, and the getAllTimezoneIds sort/distinct/filter contract.

Confidence Score: 5/5

Safe to merge for the non-DST path; the existing DST guard in convertOrcTimezones keeps the one documented scan limitation dormant until a follow-up PR lifts it.

The core algorithms (exponential expand + binary search, fixed-offset fast path, canonical ID resolution via zoneId.getId()) are well-reasoned and the previous thread concerns have all been addressed. The only residual limitation — an A→B→A paired transition within the 6 h scan step — is explicitly documented in comments and is unreachable in practice because the DST guard blocks the code path that would exercise it. No new logic errors were found in the scanning, cache, or test coverage.

OrcTimezoneInfo.java (the new runtime build path) and OrcTimezoneInfoTest.java (test contract strength) are the files worth a close second read.

Important Files Changed

Filename Overview
src/main/java/com/nvidia/spark/rapids/jni/OrcTimezoneInfo.java Complete rewrite replacing a 544 KB pregenerated binary with a runtime build path using ZoneRules + TimeZone.getOffset(). Key algorithms: exponential-expand probe + binary search for the gap scan, per-ID ConcurrentHashMap cache, and a fixed-offset fast path. Logic is sound for non-DST zones; an acknowledged A→B→A paired-transition gap in the exponential scan is kept dormant by the DST guard in convertOrcTimezones.
src/main/java/com/nvidia/spark/rapids/jni/GpuTimeZoneDB.java Minimal targeted changes: isSupportedTimeZone now catches the broader DateTimeException instead of just ZoneRulesException (so offset-style IDs like "+25:00" are consistently rejected), a comment on the fixed-offset null-check is removed, and the getOrcSupportedTimezones Javadoc is updated to describe the pre-filtered contract.
src/test/java/com/nvidia/spark/rapids/jni/OrcTimezoneInfoTest.java New test class covering: fixed-offset zones (+05:30, UTC), cache identity, invalid-ID contract, getAllTimezoneIds sort/distinct/pre-filter contract, and strict-monotonicity of transitions for Asia/Shanghai. The getAllTimezoneIds contract test verifies isSupportedTimeZone() for each ID rather than calling OrcTimezoneInfo.get() directly.
src/test/java/com/nvidia/spark/rapids/jni/GpuTimeZoneDBTest.java Adds testIsSupportedTimeZone (covers the DateTimeException widening for offset IDs) and testConvertOrcTimezonesRejectsInvalidId (regression guard for the no-silent-GMT-fallback contract).
src/main/resources/orc_timezone_info.data Deleted 544 KB pregenerated binary; all its data is now derived at runtime. Deletion is correct — the binary is no longer referenced anywhere.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant OrcTimezoneInfo
    participant Cache as RUNTIME_TIMEZONE_INFOS
    participant GpuTimeZoneDB
    participant ZoneRules
    participant TimeZone

    Caller->>OrcTimezoneInfo: get(timezoneId)
    OrcTimezoneInfo->>Cache: computeIfAbsent(timezoneId)
    alt cache hit
        Cache-->>OrcTimezoneInfo: cached OrcTimezoneInfo
    else cache miss
        OrcTimezoneInfo->>GpuTimeZoneDB: getZoneId(timezoneId)
        GpuTimeZoneDB-->>OrcTimezoneInfo: ZoneId (or DateTimeException)
        OrcTimezoneInfo->>ZoneRules: getRules() → isFixedOffset()?
        alt fixed offset
            OrcTimezoneInfo-->>Cache: OrcTimezoneInfo(fixedOffsetMs, null, null)
        else historical zone
            OrcTimezoneInfo->>TimeZone: getTimeZone(zoneId.getId())
            OrcTimezoneInfo->>ZoneRules: getTransitions()
            OrcTimezoneInfo->>OrcTimezoneInfo: buildHistoricalTransitions(tz, transitions)
            loop "each ZoneOffsetTransition >= year 0001"
                OrcTimezoneInfo->>TimeZone: getOffset(transitionMs-1)
                alt gap detected (offset mismatch)
                    OrcTimezoneInfo->>OrcTimezoneInfo: collectTimeZoneTransitionsByScanning(exponential expand + binarySearch)
                end
                OrcTimezoneInfo->>TimeZone: getOffset(transitionMs)
            end
            OrcTimezoneInfo-->>Cache: OrcTimezoneInfo(rawOffset, transitions[], offsets[])
        end
        Cache-->>OrcTimezoneInfo: new OrcTimezoneInfo
    end
    OrcTimezoneInfo-->>Caller: OrcTimezoneInfo
Loading

Reviews (5): Last reviewed commit: "Use canonical zone id when looking up Ti..." | Re-trigger Greptile

Comment thread src/main/java/com/nvidia/spark/rapids/jni/OrcTimezoneInfo.java
Comment thread src/main/java/com/nvidia/spark/rapids/jni/OrcTimezoneInfo.java
Comment thread src/main/java/com/nvidia/spark/rapids/jni/OrcTimezoneInfo.java Outdated
Chong Gao added 13 commits May 21, 2026 17:03
Cover the named-zone, unknown-id, and offset-style paths in
isSupportedTimeZone. The "+25:00" case in particular guards the
widened DateTimeException catch added in this PR, which was previously
only ZoneRulesException.

Signed-off-by: Chong Gao <chongg@nvidia.com>
rawOffset, transitions, and offsets are only assigned in the
constructor. Instances are cached in RUNTIME_TIMEZONE_INFOS (a
ConcurrentMap) and shared across threads via the public get(...)
method, so the JLS 17.5 final-field semantics are what actually
guarantee safe publication without external synchronization.

Signed-off-by: Chong Gao <chongg@nvidia.com>
The private buildRuntimeOrcTimezoneInfo Javadoc already states "no
silent fallback to GMT" on invalid IDs, but the public get(...) entry
point did not surface this in its contract. Add the @throws clause so
callers do not assume invalid IDs return null or a default.

Also normalize the abbreviation "Id" to "ID" in the @param to match
the all-caps convention used by java.util.TimeZone.

Signed-off-by: Chong Gao <chongg@nvidia.com>
The end-to-end testConvertOrcTimezones only exercises OrcTimezoneInfo
via the GPU pipeline on a curated non-DST list, so the four core
behaviours of the new runtime build path have no direct coverage:

  - fixed-offset zones (+05:30) return ZoneRules-derived rawOffset,
    not the silent GMT fallback of TimeZone.getTimeZone
  - computeIfAbsent returns the cached instance on repeated lookups
  - invalid IDs throw IllegalArgumentException
  - named historical-transition zones produce parallel, strictly
    increasing transitions[] and offsets[] arrays

Signed-off-by: Chong Gao <chongg@nvidia.com>
The call site utcMillisForDate(1, 0, 1) mixed 0-based month with
1-based day and relied on a +1 adjustment inside the helper. Make
all three parameters 1-indexed so the call reads as year=1, month=1
(January), day=1 — matching LocalDate.of and removing the silent
asymmetry flagged in PR review thread 3.

The computed instant (year 0001-01-01 UTC, MIN_SUPPORTED_ORC_UTC_MILLIS)
is unchanged.

Signed-off-by: Chong Gao <chongg@nvidia.com>
The runtime build path documents that invalid timezone IDs fail with
an exception rather than silently falling back to GMT, but no test
exercised that contract end-to-end through convertOrcTimezones. Add
one so a future refactor that broadens a catch block can't quietly
re-introduce the silent-GMT regression.

Signed-off-by: Chong Gao <chongg@nvidia.com>
TimeZone.getAvailableIDs() returns POSIX-style names on some JDK
distributions (e.g. EST5EDT, SystemV/AST4) that ZoneId.of(id,
ZoneId.SHORT_IDS) rejects. Test code that iterates this list via the
getOrcSupportedTimezones() bridge and calls OrcTimezoneInfo.get(id)
would hit IllegalArgumentException for those entries. Pre-filtering
via isSupportedTimeZone aligns the lister with the loader.

Signed-off-by: Chong Gao <chongg@nvidia.com>
Document that the returned list is the intersection of
TimeZone.getAvailableIDs() and isSupportedTimeZone, matches the
contract of get(...), and is recomputed on every call.

Signed-off-by: Chong Gao <chongg@nvidia.com>
Verify the documented contract: non-empty, sorted, includes the
well-known UTC and Asia/Shanghai entries, and every returned id is
accepted by isSupportedTimeZone (i.e. the lister and the loader
agree).

Signed-off-by: Chong Gao <chongg@nvidia.com>
The test bridge promises "all supported timezones" but used to return
the raw TimeZone.getAvailableIDs() output. Now that getAllTimezoneIds
filters to ids OrcTimezoneInfo.get can build, surface the alignment in
the Javadoc so future readers know the contract matches the name and
that callers do not need to pre-filter.

Signed-off-by: Chong Gao <chongg@nvidia.com>
The previous fixed 24h-step scan walked ~686k probes when called from
year 0001 toward a zone's first historical transition (~year 1880 for
typical IANA zones). Because the entire pre-history stretch has no
transitions, each tz.getOffset call returned the same offset and the
loop did no useful work — yet it ran under
RUNTIME_TIMEZONE_INFOS.computeIfAbsent, blocking other threads waiting
on the same key on cold-cache lookups.

Replace it with an exponential-extend-then-bisect strategy: keep the
24h base step, but double the step whenever the offset is unchanged
so empty centuries are skipped in O(log N) probes. Once a mismatch is
seen, hand the [lo, hi] bracket to the existing binarySearchTransition
to pin the transition to 1ms precision.

The optimisation assumes at most one offset transition lives inside
the expanded bracket, which is true for IANA tzdata; A->B->A pairs
narrower than the base step are an independent concern of the step
size itself, not of this optimisation.

Signed-off-by: Chong Gao <chongg@nvidia.com>
The 24h step let a pair of offset transitions A->B->A inside the same
window net to zero: tz.getOffset(probe) would still return A and the
scanner would silently skip both. Real IANA tzdata does not contain
such pairs today, but the previous step left no margin if a future
data update added one (or if the upcoming DST work routes DST zones
through this scanner). 6h is the largest base step that still sits
safely under the minimum spacing of real-world transitions.

Combined with the exponential extension in the previous commit, the
empty-pre-history sweep remains O(log N) probes — the smaller base
step only adds a constant factor at the bottom of the search.

Signed-off-by: Chong Gao <chongg@nvidia.com>
Signed-off-by: Chong Gao <res_life@163.com>
@res-life

Copy link
Copy Markdown
Collaborator Author

@greptile-apps review

@res-life res-life changed the base branch from release/26.06 to main May 22, 2026 01:14
- Fix copyright year to 2026 (new file)
- Document the ZoneRules.getTransitions() completeness invariant the
  scan algorithm relies on, flagged for DST follow-ups to revisit
- Cover the named-fixed-offset case (UTC) in OrcTimezoneInfoTest
- Tighten the getAllTimezoneIds sort check to also assert distinctness
- Document the empty-getTransitions() coverage gap in
  testGetHistoricalTransitionsZone

Signed-off-by: Chong Gao <res_life@163.com>
Comment thread src/main/java/com/nvidia/spark/rapids/jni/OrcTimezoneInfo.java Outdated
Chong Gao added 2 commits May 22, 2026 17:06
The base step constant was shrunk to 6h but the comment in
collectTimeZoneTransitionsByScanning still said "base 24h step".

Signed-off-by: Chong Gao <res_life@163.com>
GpuTimeZoneDB.getZoneId resolves caller input through ZoneId.SHORT_IDS,
so "IST" -> Asia/Kolkata, "CTT" -> Asia/Shanghai, etc. The follow-up
TimeZone.getTimeZone(timezoneId) used the raw caller string, and on
JVM distributions whose legacy TimeZone database maps "IST"/"BST"/etc.
differently the transition list and the tz.getOffset probes targeted
different zones, silently producing mixed offset data. Use
zoneId.getId() so both APIs always refer to the same zone.

Signed-off-by: Chong Gao <res_life@163.com>
@res-life

Copy link
Copy Markdown
Collaborator Author

build

@firestarman firestarman left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@res-life res-life merged commit 8ccbd97 into NVIDIA:main May 25, 2026
5 checks passed
res-life added a commit that referenced this pull request Jun 5, 2026
… APIs (#4635)

## Context

Part 2 of the split of #4432 (full ORC DST GPU support). Part 1 (#4539,
runtime build of `OrcTimezoneInfo`) is merged. This PR adds the
Java-side machinery to **extract** a recurring DST rule for a zone, but
does **not** yet wire it into the GPU dispatch path — production
behavior is unchanged.

## What this adds

New file `OrcDstRuleExtractor.java` (package-private `final class`),
which hosts:
- `static final class DstRule` — SimpleTimeZone-shaped DST descriptor
(raw offset, dst savings, start/end month / day / dayOfWeek / time /
mode), 0-based month and Calendar-style 1=Sun..7=Sat to match the GPU
consumer.
- `extractDstRule(timezoneId, tz, rules)` — entry point. Returns `null`
for non-DST zones (`rules.isFixedOffset()` or `!tz.useDaylightTime()`).
Fail-fast sanity check that `tz` and `rules` describe the same zone
(guards against the silent-GMT trap where `TimeZone.getTimeZone(id)`
returns GMT for offset-style ids).
- **Path B (tried first):** `extractDstRuleByProbing` — probe
`tz.getOffset()` hourly across a reference year, find the two
transitions, encode them in SimpleTimeZone form. Captures what
`java.util.TimeZone.getOffset` actually returns, which is the source of
truth the GPU side must match.
- **Path A (fallback):** `extractDstRuleFromZoneRules` — derive from
`ZoneRules.getTransitionRules()`. Used for zones whose recurring rule
cannot be recovered by probing alone.
- Cross-year verification: the extracted rule is re-evaluated against
`tz.getOffset` across anchor years **2060, 2400, 9997** to catch
divergence well into the recurring-rule fallback regime.

In `OrcTimezoneInfo.java`:
- Widen `utcMillisForDate` and `binarySearchTransition` from `private`
to package-private so `OrcDstRuleExtractor` can reuse the same UTC
anchor and bracketed binary search. No behavioral change.

## What this does **not** change

- `GpuTimeZoneDB.convertOrcTimezones` — DST gate still throws
`UnsupportedOperationException` for DST zones.
- `buildRuntimeOrcTimezoneInfo` — does not yet populate any `DstRule`.
- `OrcTimezoneInfo`'s public constructor / fields (the two widened
methods are package-private, not public).
- JNI, CUDA.

`extractDstRule` is package-private and exercised only by
`OrcTimezoneInfoTest`.

## Verification

Extracted rules cross-checked against tzdata; the test suite is green on
**both JDK 8 (the build's `maven.compiler.target`) and JDK 17 Zulu**:

| Zone | start | end | dstSavings |
|------|-------|-----|------------|
| America/New_York | 2nd Sun of March, 02:00 standard | 1st Sun of
November, 01:00 standard | +1h |
| Europe/London | last Sun of March, 01:00 standard | last Sun of
October, 01:00 standard | +1h |
| Australia/Sydney | 1st Sun of October, 02:00 standard | 1st Sun of
April, 02:00 standard | +1h |
| Asia/Shanghai, UTC, +05:30 | `null` | `null` | — |

Real-zone tests assert the exact (`month`, `day`, `dayOfWeek`, `time`,
`mode`) shape for each zone. Synthetic-zone tests additionally cover
both extraction paths (probing success, ZoneRules fallback), every
`IllegalStateException` branch (unsupported rule count / shape,
zero-/both-positive-/both-negative-delta, mismatched savings, negative
day indicator, cross-year verification failure, tz/rules zone mismatch),
and edge cases (`isMidnightEndOfDay`, `TimeDefinition.WALL`, February
day-of-month clamp, multiple-DST-on probing guard). 26 tests total.

## Diff size

`+1314 / -2` across 3 files:
- `OrcDstRuleExtractor.java` `+529` (new file)
- `OrcTimezoneInfo.java` `+6 / -2` (visibility widening)
- `OrcTimezoneInfoTest.java` `+779`

No changes outside this slice.

## Follow-up parts

- **Part 3**: CUDA civil-calendar utilities + Java↔CUDA `DstRule`
serialization scaffolding (still all dead-code).
- **Part 4**: CUDA DST kernel implementation + kernel-level C++ tests.
- **Part 5**: Remove the DST gate in `convertOrcTimezones`, dispatch DST
zones through the new kernel, add end-to-end Java tests. This is the
only follow-up part that changes runtime behavior.

---------

Signed-off-by: Chong Gao <res_life@163.com>
Co-authored-by: Chong Gao <res_life@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants