feat: rand expression support #1199

akupchinskiy · 2024-12-24T11:56:16Z

Which issue does this PR close?

Closes #1198

Rationale for this change

Support of the spark rand() expression

What changes are included in this PR?

rand() expression implementation
partition-awareness of the planner

How are these changes tested?

Spark compatibility tests and expression correctness test are included in the PR

codecov-commenter · 2024-12-28T06:52:05Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 34.11%. Comparing base (58dee73) to head (7e4ca2c).
Report is 2 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #1199      +/-   ##
============================================
- Coverage     34.78%   34.11%   -0.67%     
+ Complexity      957      925      -32     
============================================
  Files           115      115              
  Lines         43569    43586      +17     
  Branches       9528     9556      +28     
============================================
- Hits          15155    14870     -285     
- Misses        25449    25763     +314     
+ Partials       2965     2953      -12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

andygrove · 2024-12-30T03:30:49Z

Thanks @akupchinskiy. I plan on reviewing this after the holidays.

mbutrovich · 2025-01-02T15:48:47Z

Are the partition related changes necessary for this PR? Otherwise, it might be better to reduce the scope to just the rand() expression.

dharanad · 2025-01-02T15:58:54Z

native/spark-expr/src/rand.rs

+const DOUBLE_UNIT: f64 = 1.1102230246251565e-16;
+const SPARK_MURMUR_ARRAY_SEED: u32 = 0x3c074a61;


It would really helpful if you could add documentation / refrences around these constants

Added doc comments with all the references.

dharanad · 2025-01-02T16:00:56Z

native/spark-expr/src/rand.rs

+        match self.seed.evaluate(batch)? {
+            ColumnarValue::Scalar(seed) => self.evaluate_batch(seed, batch.num_rows()),
+            ColumnarValue::Array(_arr) => Err(DataFusionError::NotImplemented(format!(
+                "Only literal seeds are not supported for {}",


Error message seems to have a typo

comphead · 2025-01-02T20:58:17Z

native/core/src/execution/jni_api.rs

@@ -317,7 +317,7 @@ pub unsafe extern "system" fn Java_org_apache_comet_Native_executePlan(
        // query plan, we need to defer stream initialization to first time execution.
        if exec_context.root_op.is_none() {
            let start = Instant::now();
-            let planner = PhysicalPlanner::new(Arc::clone(&exec_context.session_ctx))
+            let planner = PhysicalPlanner::new(Arc::clone(&exec_context.session_ctx), partition)


here is interesting. Is there any reason the partition is not used in Comet native physical planner? this is def used in DF physical plan during plan node execution https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/execution_plan.rs#L371

The spark partition index is erased when a native DF plan is sent for the execution for some reason : https://github.com/apache/datafusion-comet/blob/main/native/core/src/execution/jni_api.rs#L496

…r-support

akupchinskiy · 2025-01-05T08:58:35Z

Are the partition related changes necessary for this PR? Otherwise, it might be better to reduce the scope to just the rand() expression.

There is a handful of expressions besides rand() relying on the partition index. All of them implement nondetermenistic trait providing a hook method to initialize a state before a partition evaluation for spark runtime.

Encapsulation-wise, I agree that the scope of the partition exposure should be limited. But I could not find another way to extract it other than making it a part of a planner struct.

rluvaton · 2025-01-05T10:30:49Z

native/spark-expr/src/rand.rs

+/// Adoption of the XOR-shift algorithm used in Apache Spark.
+/// See: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/random/XORShiftRandom.scala
+
+/// Normalization multiplier used in mapping from a random i64 value to the f64 interval [0.0, 1.0).
+/// Corresponds to the java implementation: https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/util/Random.java#L302)
+/// Due to the lack of hexadecimal float literals support in rust, the scientific notation is used instead.
+const DOUBLE_UNIT: f64 = 1.1102230246251565e-16;
+
+/// Spark-compatible initial seed which is actually a part of  the scala standard library murmurhash3 implementation.
+/// The references:
+/// https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/random/XORShiftRandom.scala#L63
+/// https://github.com/scala/scala/blob/2.13.x/src/library/scala/util/hashing/MurmurHash3.scala#L331
+const SPARK_MURMUR_ARRAY_SEED: u32 = 0x3c074a61;


Please replace the links with peramlink as when the file will move the link will not be valid anymore + if the logic changed we can know when

akupchinskiy added 2 commits December 24, 2024 15:32

feat: rand expression support

c5c80e2

fix: support for spark-compatible null seed

7e4ca2c

akupchinskiy force-pushed the rand-expr-support branch from 2c1c0c4 to 7e4ca2c Compare December 24, 2024 16:09

fix: unnecessary borrowing removal

41c917b

dharanad reviewed Jan 2, 2025

View reviewed changes

comphead reviewed Jan 2, 2025

View reviewed changes

akupchinskiy added 3 commits January 5, 2025 11:07

Merge branch 'main' into rand-expr-support

fdb8949

added references to the constants and typo fix

cc2b20f

Merge remote-tracking branch 'forked/rand-expr-support' into rand-exp…

783c381

…r-support

rluvaton reviewed Jan 5, 2025

View reviewed changes

akupchinskiy added 2 commits January 5, 2025 22:05

added permalinks for the reference links

10f310d

fixed compile errors after master merge

e7e629c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: rand expression support #1199

feat: rand expression support #1199

akupchinskiy commented Dec 24, 2024

codecov-commenter commented Dec 28, 2024

andygrove commented Dec 30, 2024

mbutrovich commented Jan 2, 2025

dharanad Jan 2, 2025

akupchinskiy Jan 5, 2025

dharanad Jan 2, 2025

akupchinskiy Jan 5, 2025

comphead Jan 2, 2025

akupchinskiy Jan 5, 2025

akupchinskiy commented Jan 5, 2025

rluvaton Jan 5, 2025

		const DOUBLE_UNIT: f64 = 1.1102230246251565e-16;
		const SPARK_MURMUR_ARRAY_SEED: u32 = 0x3c074a61;

feat: rand expression support #1199

Are you sure you want to change the base?

feat: rand expression support #1199

Conversation

akupchinskiy commented Dec 24, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

codecov-commenter commented Dec 28, 2024

Codecov Report

andygrove commented Dec 30, 2024

mbutrovich commented Jan 2, 2025

dharanad Jan 2, 2025

Choose a reason for hiding this comment

akupchinskiy Jan 5, 2025

Choose a reason for hiding this comment

dharanad Jan 2, 2025

Choose a reason for hiding this comment

akupchinskiy Jan 5, 2025

Choose a reason for hiding this comment

comphead Jan 2, 2025

Choose a reason for hiding this comment

akupchinskiy Jan 5, 2025

Choose a reason for hiding this comment

akupchinskiy commented Jan 5, 2025

rluvaton Jan 5, 2025

Choose a reason for hiding this comment