[Part1] [ORC-timezone]: Build ORC timezone metadata at runtime, drop pregenerated binary#4539
Conversation
|
NOTE: release/26.06 has been created from main. Please retarget your PR to release/26.06 if it should be included in the release. |
|
build |
Replace `orc_timezone_info.data` (544KB, OpenJDK-8 snapshot) with a runtime build path that derives historical transitions from `ZoneRules.getTransitions()` and `TimeZone.getOffset()`. Per-id results are cached in `RUNTIME_TIMEZONE_INFOS`. `OrcTimezoneInfo` keeps its `(rawOffset, transitions, offsets)` shape; the CUDA kernel and the `hasDaylightSavingTime` DST guard are unchanged. Switch `isSupportedTimeZone` to catch `DateTimeException` so IDs like `+05:30` are handled the same as named zones. Split from NVIDIA#4432. Signed-off-by: Chong Gao <chongg@nvidia.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
05f1663 to
e0df2a6
Compare
|
build |
Greptile SummaryThis PR replaces a frozen 544 KB pregenerated binary (
Confidence Score: 5/5Safe to merge for the non-DST path; the existing DST guard in convertOrcTimezones keeps the one documented scan limitation dormant until a follow-up PR lifts it. The core algorithms (exponential expand + binary search, fixed-offset fast path, canonical ID resolution via zoneId.getId()) are well-reasoned and the previous thread concerns have all been addressed. The only residual limitation — an A→B→A paired transition within the 6 h scan step — is explicitly documented in comments and is unreachable in practice because the DST guard blocks the code path that would exercise it. No new logic errors were found in the scanning, cache, or test coverage. OrcTimezoneInfo.java (the new runtime build path) and OrcTimezoneInfoTest.java (test contract strength) are the files worth a close second read. Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant OrcTimezoneInfo
participant Cache as RUNTIME_TIMEZONE_INFOS
participant GpuTimeZoneDB
participant ZoneRules
participant TimeZone
Caller->>OrcTimezoneInfo: get(timezoneId)
OrcTimezoneInfo->>Cache: computeIfAbsent(timezoneId)
alt cache hit
Cache-->>OrcTimezoneInfo: cached OrcTimezoneInfo
else cache miss
OrcTimezoneInfo->>GpuTimeZoneDB: getZoneId(timezoneId)
GpuTimeZoneDB-->>OrcTimezoneInfo: ZoneId (or DateTimeException)
OrcTimezoneInfo->>ZoneRules: getRules() → isFixedOffset()?
alt fixed offset
OrcTimezoneInfo-->>Cache: OrcTimezoneInfo(fixedOffsetMs, null, null)
else historical zone
OrcTimezoneInfo->>TimeZone: getTimeZone(zoneId.getId())
OrcTimezoneInfo->>ZoneRules: getTransitions()
OrcTimezoneInfo->>OrcTimezoneInfo: buildHistoricalTransitions(tz, transitions)
loop "each ZoneOffsetTransition >= year 0001"
OrcTimezoneInfo->>TimeZone: getOffset(transitionMs-1)
alt gap detected (offset mismatch)
OrcTimezoneInfo->>OrcTimezoneInfo: collectTimeZoneTransitionsByScanning(exponential expand + binarySearch)
end
OrcTimezoneInfo->>TimeZone: getOffset(transitionMs)
end
OrcTimezoneInfo-->>Cache: OrcTimezoneInfo(rawOffset, transitions[], offsets[])
end
Cache-->>OrcTimezoneInfo: new OrcTimezoneInfo
end
OrcTimezoneInfo-->>Caller: OrcTimezoneInfo
Reviews (5): Last reviewed commit: "Use canonical zone id when looking up Ti..." | Re-trigger Greptile |
Cover the named-zone, unknown-id, and offset-style paths in isSupportedTimeZone. The "+25:00" case in particular guards the widened DateTimeException catch added in this PR, which was previously only ZoneRulesException. Signed-off-by: Chong Gao <chongg@nvidia.com>
rawOffset, transitions, and offsets are only assigned in the constructor. Instances are cached in RUNTIME_TIMEZONE_INFOS (a ConcurrentMap) and shared across threads via the public get(...) method, so the JLS 17.5 final-field semantics are what actually guarantee safe publication without external synchronization. Signed-off-by: Chong Gao <chongg@nvidia.com>
The private buildRuntimeOrcTimezoneInfo Javadoc already states "no silent fallback to GMT" on invalid IDs, but the public get(...) entry point did not surface this in its contract. Add the @throws clause so callers do not assume invalid IDs return null or a default. Also normalize the abbreviation "Id" to "ID" in the @param to match the all-caps convention used by java.util.TimeZone. Signed-off-by: Chong Gao <chongg@nvidia.com>
The end-to-end testConvertOrcTimezones only exercises OrcTimezoneInfo
via the GPU pipeline on a curated non-DST list, so the four core
behaviours of the new runtime build path have no direct coverage:
- fixed-offset zones (+05:30) return ZoneRules-derived rawOffset,
not the silent GMT fallback of TimeZone.getTimeZone
- computeIfAbsent returns the cached instance on repeated lookups
- invalid IDs throw IllegalArgumentException
- named historical-transition zones produce parallel, strictly
increasing transitions[] and offsets[] arrays
Signed-off-by: Chong Gao <chongg@nvidia.com>
The call site utcMillisForDate(1, 0, 1) mixed 0-based month with 1-based day and relied on a +1 adjustment inside the helper. Make all three parameters 1-indexed so the call reads as year=1, month=1 (January), day=1 — matching LocalDate.of and removing the silent asymmetry flagged in PR review thread 3. The computed instant (year 0001-01-01 UTC, MIN_SUPPORTED_ORC_UTC_MILLIS) is unchanged. Signed-off-by: Chong Gao <chongg@nvidia.com>
The runtime build path documents that invalid timezone IDs fail with an exception rather than silently falling back to GMT, but no test exercised that contract end-to-end through convertOrcTimezones. Add one so a future refactor that broadens a catch block can't quietly re-introduce the silent-GMT regression. Signed-off-by: Chong Gao <chongg@nvidia.com>
TimeZone.getAvailableIDs() returns POSIX-style names on some JDK distributions (e.g. EST5EDT, SystemV/AST4) that ZoneId.of(id, ZoneId.SHORT_IDS) rejects. Test code that iterates this list via the getOrcSupportedTimezones() bridge and calls OrcTimezoneInfo.get(id) would hit IllegalArgumentException for those entries. Pre-filtering via isSupportedTimeZone aligns the lister with the loader. Signed-off-by: Chong Gao <chongg@nvidia.com>
Document that the returned list is the intersection of TimeZone.getAvailableIDs() and isSupportedTimeZone, matches the contract of get(...), and is recomputed on every call. Signed-off-by: Chong Gao <chongg@nvidia.com>
Verify the documented contract: non-empty, sorted, includes the well-known UTC and Asia/Shanghai entries, and every returned id is accepted by isSupportedTimeZone (i.e. the lister and the loader agree). Signed-off-by: Chong Gao <chongg@nvidia.com>
The test bridge promises "all supported timezones" but used to return the raw TimeZone.getAvailableIDs() output. Now that getAllTimezoneIds filters to ids OrcTimezoneInfo.get can build, surface the alignment in the Javadoc so future readers know the contract matches the name and that callers do not need to pre-filter. Signed-off-by: Chong Gao <chongg@nvidia.com>
The previous fixed 24h-step scan walked ~686k probes when called from year 0001 toward a zone's first historical transition (~year 1880 for typical IANA zones). Because the entire pre-history stretch has no transitions, each tz.getOffset call returned the same offset and the loop did no useful work — yet it ran under RUNTIME_TIMEZONE_INFOS.computeIfAbsent, blocking other threads waiting on the same key on cold-cache lookups. Replace it with an exponential-extend-then-bisect strategy: keep the 24h base step, but double the step whenever the offset is unchanged so empty centuries are skipped in O(log N) probes. Once a mismatch is seen, hand the [lo, hi] bracket to the existing binarySearchTransition to pin the transition to 1ms precision. The optimisation assumes at most one offset transition lives inside the expanded bracket, which is true for IANA tzdata; A->B->A pairs narrower than the base step are an independent concern of the step size itself, not of this optimisation. Signed-off-by: Chong Gao <chongg@nvidia.com>
The 24h step let a pair of offset transitions A->B->A inside the same window net to zero: tz.getOffset(probe) would still return A and the scanner would silently skip both. Real IANA tzdata does not contain such pairs today, but the previous step left no margin if a future data update added one (or if the upcoming DST work routes DST zones through this scanner). 6h is the largest base step that still sits safely under the minimum spacing of real-world transitions. Combined with the exponential extension in the previous commit, the empty-pre-history sweep remains O(log N) probes — the smaller base step only adds a constant factor at the bottom of the search. Signed-off-by: Chong Gao <chongg@nvidia.com>
|
@greptile-apps review |
- Fix copyright year to 2026 (new file) - Document the ZoneRules.getTransitions() completeness invariant the scan algorithm relies on, flagged for DST follow-ups to revisit - Cover the named-fixed-offset case (UTC) in OrcTimezoneInfoTest - Tighten the getAllTimezoneIds sort check to also assert distinctness - Document the empty-getTransitions() coverage gap in testGetHistoricalTransitionsZone Signed-off-by: Chong Gao <res_life@163.com>
The base step constant was shrunk to 6h but the comment in collectTimeZoneTransitionsByScanning still said "base 24h step". Signed-off-by: Chong Gao <res_life@163.com>
GpuTimeZoneDB.getZoneId resolves caller input through ZoneId.SHORT_IDS, so "IST" -> Asia/Kolkata, "CTT" -> Asia/Shanghai, etc. The follow-up TimeZone.getTimeZone(timezoneId) used the raw caller string, and on JVM distributions whose legacy TimeZone database maps "IST"/"BST"/etc. differently the transition list and the tz.getOffset probes targeted different zones, silently producing mixed offset data. Use zoneId.getId() so both APIs always refer to the same zone. Signed-off-by: Chong Gao <res_life@163.com>
|
build |
… APIs (#4635) ## Context Part 2 of the split of #4432 (full ORC DST GPU support). Part 1 (#4539, runtime build of `OrcTimezoneInfo`) is merged. This PR adds the Java-side machinery to **extract** a recurring DST rule for a zone, but does **not** yet wire it into the GPU dispatch path — production behavior is unchanged. ## What this adds New file `OrcDstRuleExtractor.java` (package-private `final class`), which hosts: - `static final class DstRule` — SimpleTimeZone-shaped DST descriptor (raw offset, dst savings, start/end month / day / dayOfWeek / time / mode), 0-based month and Calendar-style 1=Sun..7=Sat to match the GPU consumer. - `extractDstRule(timezoneId, tz, rules)` — entry point. Returns `null` for non-DST zones (`rules.isFixedOffset()` or `!tz.useDaylightTime()`). Fail-fast sanity check that `tz` and `rules` describe the same zone (guards against the silent-GMT trap where `TimeZone.getTimeZone(id)` returns GMT for offset-style ids). - **Path B (tried first):** `extractDstRuleByProbing` — probe `tz.getOffset()` hourly across a reference year, find the two transitions, encode them in SimpleTimeZone form. Captures what `java.util.TimeZone.getOffset` actually returns, which is the source of truth the GPU side must match. - **Path A (fallback):** `extractDstRuleFromZoneRules` — derive from `ZoneRules.getTransitionRules()`. Used for zones whose recurring rule cannot be recovered by probing alone. - Cross-year verification: the extracted rule is re-evaluated against `tz.getOffset` across anchor years **2060, 2400, 9997** to catch divergence well into the recurring-rule fallback regime. In `OrcTimezoneInfo.java`: - Widen `utcMillisForDate` and `binarySearchTransition` from `private` to package-private so `OrcDstRuleExtractor` can reuse the same UTC anchor and bracketed binary search. No behavioral change. ## What this does **not** change - `GpuTimeZoneDB.convertOrcTimezones` — DST gate still throws `UnsupportedOperationException` for DST zones. - `buildRuntimeOrcTimezoneInfo` — does not yet populate any `DstRule`. - `OrcTimezoneInfo`'s public constructor / fields (the two widened methods are package-private, not public). - JNI, CUDA. `extractDstRule` is package-private and exercised only by `OrcTimezoneInfoTest`. ## Verification Extracted rules cross-checked against tzdata; the test suite is green on **both JDK 8 (the build's `maven.compiler.target`) and JDK 17 Zulu**: | Zone | start | end | dstSavings | |------|-------|-----|------------| | America/New_York | 2nd Sun of March, 02:00 standard | 1st Sun of November, 01:00 standard | +1h | | Europe/London | last Sun of March, 01:00 standard | last Sun of October, 01:00 standard | +1h | | Australia/Sydney | 1st Sun of October, 02:00 standard | 1st Sun of April, 02:00 standard | +1h | | Asia/Shanghai, UTC, +05:30 | `null` | `null` | — | Real-zone tests assert the exact (`month`, `day`, `dayOfWeek`, `time`, `mode`) shape for each zone. Synthetic-zone tests additionally cover both extraction paths (probing success, ZoneRules fallback), every `IllegalStateException` branch (unsupported rule count / shape, zero-/both-positive-/both-negative-delta, mismatched savings, negative day indicator, cross-year verification failure, tz/rules zone mismatch), and edge cases (`isMidnightEndOfDay`, `TimeDefinition.WALL`, February day-of-month clamp, multiple-DST-on probing guard). 26 tests total. ## Diff size `+1314 / -2` across 3 files: - `OrcDstRuleExtractor.java` `+529` (new file) - `OrcTimezoneInfo.java` `+6 / -2` (visibility widening) - `OrcTimezoneInfoTest.java` `+779` No changes outside this slice. ## Follow-up parts - **Part 3**: CUDA civil-calendar utilities + Java↔CUDA `DstRule` serialization scaffolding (still all dead-code). - **Part 4**: CUDA DST kernel implementation + kernel-level C++ tests. - **Part 5**: Remove the DST gate in `convertOrcTimezones`, dispatch DST zones through the new kernel, add end-to-end Java tests. This is the only follow-up part that changes runtime behavior. --------- Signed-off-by: Chong Gao <res_life@163.com> Co-authored-by: Chong Gao <res_life@163.com>
First in a stack splitting #4432 into reviewable chunks.
Description
Previously
OrcTimezoneInfoloaded its transition tables from a pregenerated 544KB binary (src/main/resources/orc_timezone_info.data, derived fromsun.util.calendar.ZoneInfoon OpenJDK 8). This froze ORC timezone behavior to that snapshot and required regenerating and committing the binary whenever the data needed to change.This PR replaces the binary with a runtime build path:
buildHistoricalTransitionswalksZoneRules.getTransitions()from year 0001 (the lower bound ORC supports) and probesTimeZone.getOffset()to capture the offset before each known transition.collectTimeZoneTransitionsByScanningcovers any gap between two known transitions by stepping forward at 1-day granularity and binary-searching the exact transition millisecond when an offset change is detected.RUNTIME_TIMEZONE_INFOS.getAllTimezoneIds()now returns the liveTimeZone.getAvailableIDs()instead of a hard-coded array.OrcTimezoneInfokeeps its(rawOffset, transitions, offsets)shape so the CUDA kernel is unchanged. DST timezones remain blocked by the existinghasDaylightSavingTimeguard inconvertOrcTimezones— follow-up PRs lift this guard:convertOrcTimezonessignatureOrcTimezoneContext+ remaining testsisSupportedTimeZoneis broadened fromcatch (ZoneRulesException)tocatch (DateTimeException)so IDs like+05:30are rejected/accepted consistently with the runtime build path (ZoneId.ofmay throw either, depending on whether the id has a malformed offset vs. an unknown name).Checklists
Signed-off-by: Chong Gao chongg@nvidia.com