Skip to content

ORC tz split 3/6: OrcTimezoneInfo runtime build and registry#4547

Closed
res-life wants to merge 1 commit into
orc-tz-2-orctzinfo-dst-extractfrom
orc-tz-3-orctzinfo-runtime-build
Closed

ORC tz split 3/6: OrcTimezoneInfo runtime build and registry#4547
res-life wants to merge 1 commit into
orc-tz-2-orctzinfo-dst-extractfrom
orc-tz-3-orctzinfo-runtime-build

Conversation

@res-life

Copy link
Copy Markdown
Collaborator

Part of the split of #4432.
Companion: NVIDIA/cudf-spark#14544
Previous: #4546

Completes OrcTimezoneInfo by adding the runtime build pipeline and the remaining DST math helpers:

  • verifyDstRule, verifyDstRuleAcrossReferenceYears: lightweight verification (~200 getOffset() calls instead of ~52K).
  • computeDstOffset, computeTransitionUtcMillis, computeRuleDay, utcMillisForDate: SimpleTimeZone-compatible offset math.
  • toString.
  • RUNTIME_TIMEZONE_INFOS registry, get(timezoneId), buildRuntimeOrcTimezoneInfo, getAllTimezoneIds.
  • getInitialOffset, buildHistoricalTransitions, collectTimeZoneTransitionsByScanning, toLongArray / toIntArray, and the HistoricalTransitions value class.

After this PR, OrcTimezoneInfo.java matches the version in #4432. Callers introduced in #orc-tz-1 now resolve.

Signed-off-by: Chong Gao chongg@nvidia.com

Part of the split of #4432.

Completes OrcTimezoneInfo by adding the runtime build pipeline and the
remaining DST math helpers:

- verifyDstRule, verifyDstRuleAcrossReferenceYears: lightweight
  verification (~200 getOffset() calls instead of ~52K).
- computeDstOffset, computeTransitionUtcMillis, computeRuleDay,
  utcMillisForDate: SimpleTimeZone-compatible offset math.
- toString.
- RUNTIME_TIMEZONE_INFOS registry, get(timezoneId),
  buildRuntimeOrcTimezoneInfo, getAllTimezoneIds.
- getInitialOffset, buildHistoricalTransitions,
  collectTimeZoneTransitionsByScanning, toLongArray / toIntArray, and
  the HistoricalTransitions value class.

After this PR, OrcTimezoneInfo.java matches the version in #4432.
Callers introduced in orc-tz-1-jni-plumbing now resolve.

Signed-off-by: Chong Gao <chongg@nvidia.com>
@res-life res-life force-pushed the orc-tz-3-orctzinfo-runtime-build branch from d0baa11 to 944737e Compare May 19, 2026 08:33
@res-life res-life force-pushed the orc-tz-2-orctzinfo-dst-extract branch from c11c7cd to 0d0a2c4 Compare May 19, 2026 08:33
@greptile-apps

greptile-apps Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR completes OrcTimezoneInfo by adding the runtime build pipeline (buildRuntimeOrcTimezoneInfo, get, RUNTIME_TIMEZONE_INFOS), DST verification helpers (verifyDstRule, computeDstOffset, computeTransitionUtcMillis, computeRuleDay, utcMillisForDate), and the historical-transition collector (buildHistoricalTransitions, collectTimeZoneTransitionsByScanning, HistoricalTransitions).

  • DST rule extraction and verification probe TimeZone.getOffset() hourly to find transition boundaries, then verify the extracted rule against the JVM across three far-future reference years, reducing verification cost from ~52 K to ~200 getOffset() calls.
  • getAllTimezoneIds() sources its list from TimeZone.getAvailableIDs(), which includes POSIX/legacy aliases (e.g. \"EST5EDT\") not accepted by GpuTimeZoneDB.getZoneId() (backed by ZoneId.of), creating a mismatch with get() — any caller iterating the list and calling get() for each entry will encounter IllegalArgumentException for those IDs.
  • Historical transition scanning uses binary search to locate exact transition milliseconds and guards against ZoneOffsetTransition entries that don't produce an observable offset change in TimeZone.getOffset().

Confidence Score: 3/5

Safe to merge with caution — the DST math and historical-transition logic are well-designed and self-verified, but getAllTimezoneIds() produces IDs that get() cannot handle, which will surface as runtime exceptions in any code path that iterates and looks up every ID.

The getAllTimezoneIds() / get() mismatch is a concrete functional gap: TimeZone.getAvailableIDs() emits POSIX-style aliases like 'EST5EDT' and 'SystemV/CST6CDT' that ZoneId.of(id, ZoneId.SHORT_IDS) cannot resolve, so iterating the returned list and calling get() for each entry will throw for a subset of IDs. GpuTimeZoneDB.getOrcSupportedTimezones() is already wired to delegate here, so tests exercising all supported zones are directly at risk. The rest of the code — verification loop, binary-search narrowing, ConcurrentHashMap caching, and fixed-offset fast path — looks correct and robust.

src/main/java/com/nvidia/spark/rapids/jni/OrcTimezoneInfo.java — specifically the getAllTimezoneIds() implementation and its interaction with get().

Important Files Changed

Filename Overview
src/main/java/com/nvidia/spark/rapids/jni/OrcTimezoneInfo.java Adds runtime DST math helpers, a ConcurrentHashMap-backed registry, and the buildRuntimeOrcTimezoneInfo pipeline. One functional inconsistency: getAllTimezoneIds() sources IDs from TimeZone.getAvailableIDs() while get() validates via GpuTimeZoneDB.getZoneId() (ZoneId.of + SHORT_IDS), so legacy aliases like 'EST5EDT' returned by the former will throw when passed to the latter.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Caller
    participant OrcTimezoneInfo
    participant GpuTimeZoneDB
    participant ZoneRules
    participant TimeZone

    Caller->>OrcTimezoneInfo: get(timezoneId)
    OrcTimezoneInfo->>OrcTimezoneInfo: RUNTIME_TIMEZONE_INFOS.computeIfAbsent(...)
    OrcTimezoneInfo->>GpuTimeZoneDB: getZoneId(timezoneId)
    GpuTimeZoneDB-->>OrcTimezoneInfo: ZoneId (or throws IllegalArgumentException)
    OrcTimezoneInfo->>ZoneRules: zoneId.getRules()
    alt Fixed offset
        OrcTimezoneInfo-->>Caller: OrcTimezoneInfo(fixedOffset, ...)
    else DST zone
        OrcTimezoneInfo->>TimeZone: getTimeZone(timezoneId)
        OrcTimezoneInfo->>OrcTimezoneInfo: getInitialOffset(tz)
        OrcTimezoneInfo->>OrcTimezoneInfo: extractDstRule(timezoneId, tz, rules)
        note over OrcTimezoneInfo: probe hourly → binarySearchTransition → verifyDstRuleAcrossReferenceYears
        OrcTimezoneInfo->>ZoneRules: getTransitions()
        OrcTimezoneInfo->>OrcTimezoneInfo: buildHistoricalTransitions(tz, transitionList)
        note over OrcTimezoneInfo: scan gaps, binary-search exact timestamps
        OrcTimezoneInfo-->>Caller: OrcTimezoneInfo(initialOffset, rawOffset, transitions[], offsets[], dstRule)
    end
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Caller
    participant OrcTimezoneInfo
    participant GpuTimeZoneDB
    participant ZoneRules
    participant TimeZone

    Caller->>OrcTimezoneInfo: get(timezoneId)
    OrcTimezoneInfo->>OrcTimezoneInfo: RUNTIME_TIMEZONE_INFOS.computeIfAbsent(...)
    OrcTimezoneInfo->>GpuTimeZoneDB: getZoneId(timezoneId)
    GpuTimeZoneDB-->>OrcTimezoneInfo: ZoneId (or throws IllegalArgumentException)
    OrcTimezoneInfo->>ZoneRules: zoneId.getRules()
    alt Fixed offset
        OrcTimezoneInfo-->>Caller: OrcTimezoneInfo(fixedOffset, ...)
    else DST zone
        OrcTimezoneInfo->>TimeZone: getTimeZone(timezoneId)
        OrcTimezoneInfo->>OrcTimezoneInfo: getInitialOffset(tz)
        OrcTimezoneInfo->>OrcTimezoneInfo: extractDstRule(timezoneId, tz, rules)
        note over OrcTimezoneInfo: probe hourly → binarySearchTransition → verifyDstRuleAcrossReferenceYears
        OrcTimezoneInfo->>ZoneRules: getTransitions()
        OrcTimezoneInfo->>OrcTimezoneInfo: buildHistoricalTransitions(tz, transitionList)
        note over OrcTimezoneInfo: scan gaps, binary-search exact timestamps
        OrcTimezoneInfo-->>Caller: OrcTimezoneInfo(initialOffset, rawOffset, transitions[], offsets[], dstRule)
    end
Loading

Reviews (1): Last reviewed commit: "ORC tz split 3/6: OrcTimezoneInfo runtim..." | Re-trigger Greptile

Comment on lines +605 to +609
public static List<String> getAllTimezoneIds() {

String[] ids = TimeZone.getAvailableIDs();
Arrays.sort(ids);
return Arrays.asList(ids);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 getAllTimezoneIds() returns IDs that get() cannot handle

TimeZone.getAvailableIDs() includes POSIX/legacy aliases such as "EST5EDT", "PST8PDT", "SystemV/CST6CDT", etc. that are not recognized by ZoneId.of(id, ZoneId.SHORT_IDS) — the mechanism used internally by GpuTimeZoneDB.getZoneId(). When any caller (including GpuTimeZoneDB.getOrcSupportedTimezones()) iterates the result and calls OrcTimezoneInfo.get() for each entry, those legacy IDs will throw IllegalArgumentException. Replacing TimeZone.getAvailableIDs() with ZoneId.getAvailableZoneIds() produces exactly the set of IDs that ZoneId.of() accepts and is therefore consistent with get().

Comment on lines +605 to +610
public static List<String> getAllTimezoneIds() {

String[] ids = TimeZone.getAvailableIDs();
Arrays.sort(ids);
return Arrays.asList(ids);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Extra blank line after the opening brace — minor style nit.

Suggested change
public static List<String> getAllTimezoneIds() {
String[] ids = TimeZone.getAvailableIDs();
Arrays.sort(ids);
return Arrays.asList(ids);
}
public static List<String> getAllTimezoneIds() {
String[] ids = TimeZone.getAvailableIDs();
Arrays.sort(ids);
return Arrays.asList(ids);
}

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +492 to +530
switch (ruleMode) {
case 1: {
if (ruleDay > 0) {
int diff = ruleDayOfWeek - firstDayOfWeek;
if (diff < 0) {
diff += 7;
}
return 1 + diff + (ruleDay - 1) * 7;
} else {
int lastDayOfWeek = toCalendarDayOfWeek(
LocalDate.of(year, month + 1, monthLength).getDayOfWeek().getValue());
int diff = lastDayOfWeek - ruleDayOfWeek;
if (diff < 0) {
diff += 7;
}
return monthLength - diff + (ruleDay + 1) * 7;
}
}
case 2: {
int targetDayOfWeek = toCalendarDayOfWeek(
LocalDate.of(year, month + 1, ruleDay).getDayOfWeek().getValue());
int diff = ruleDayOfWeek - targetDayOfWeek;
if (diff < 0) {
diff += 7;
}
return ruleDay + diff;
}
case 3: {
int targetDayOfWeek = toCalendarDayOfWeek(
LocalDate.of(year, month + 1, ruleDay).getDayOfWeek().getValue());
int diff = targetDayOfWeek - ruleDayOfWeek;
if (diff < 0) {
diff += 7;
}
return ruleDay - diff;
}
default:
return ruleDay;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 computeRuleDay cases 1 and 3 are dead code with potential out-of-bounds results

All DST rules in this class are encoded exclusively as DOW_GE_DOM_MODE (value 2) — both decodeTransition and fillDstRuleFromTransitionRule hard-code DstRuleMode.DOW_GE_DOM_MODE.value — so cases 1 and 3 in computeRuleDay are never reached today. However, if case 1 (DOW_IN_MONTH) were ever invoked with ruleDay = 5 on a short month, 1 + diff + (ruleDay - 1) * 7 can exceed the month length and LocalDate.of(year, month+1, day) downstream would throw DateTimeException. Similarly, case 3 with a small ruleDay (e.g. 1) can yield a non-positive result. A comment noting the dead-code status or an explicit guard on the returned day would prevent silent misbehavior if the switch is extended later.

@res-life

Copy link
Copy Markdown
Collaborator Author

useless now.

@res-life res-life closed this Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants