Describe the bug
This is a follow-up on #15058 (comment). The skills define a helper assertDataFrameEquals to check dataframe equality, which sorts based on toString. However the result of toString for map types depends on insertion order, so values can be spuriously misaligned.
(This bug also exists for Spark's internal test utils QueryTest.scala and assertDataFrameEqual, from which this method was derived).
Steps/Code to reproduce bug
E.g., this demonstrates the bug.
//> using scala 2.12.15
//> using dep "org.apache.spark::spark-sql:3.5.5"
//> using dep "org.apache.spark:spark-sql_2.12:3.5.5,classifier=tests"
//> using dep "org.scalatest::scalatest:3.2.17"
import org.apache.spark.sql.{Row, QueryTest}
object MapSortRepro {
def main(args: Array[String]): Unit = {
// Same data, but the map entries are in a different insertion order
val actual = Seq(
Row(Map("z" -> 1, "a" -> 2)),
Row(Map("m" -> 3, "n" -> 4))
)
val expected = Seq(
Row(Map("a" -> 2, "z" -> 1)),
Row(Map("m" -> 3, "n" -> 4))
)
println("row 0 map values are equal: " + (actual(0).getMap[String, Int](0) == expected(0).getMap[String, Int](0)))
println("actual(0) toString = " + actual(0))
println("expected(0) toString = " + expected(0))
val a = actual.sortBy(_.toString)
val e = expected.sortBy(_.toString)
val oursDiffers = a.indices.exists(i => a(i).getMap[String, Int](0) != e(i).getMap[String, Int](0))
println("assertDataFrameEquals reports equality: " + !oursDiffers)
}
}
$ scala-cli run --server=false MapSortRepro.scala
row 0 map values are equal: true
actual(0) toString = [Map(z -> 1, a -> 2)]
expected(0) toString = [Map(a -> 2, z -> 1)]
assertDataFrameEquals reports equality: false
Expected behavior
Our assertDataFrameEquals should use a multiset comparison or use a canonical sort key to properly sort on map types to avoid spurious false negatives (reporting not equal on equal values).
Describe the bug
This is a follow-up on #15058 (comment). The skills define a helper
assertDataFrameEqualsto check dataframe equality, which sorts based ontoString. However the result oftoStringfor map types depends on insertion order, so values can be spuriously misaligned.(This bug also exists for Spark's internal test utils QueryTest.scala and assertDataFrameEqual, from which this method was derived).
Steps/Code to reproduce bug
E.g., this demonstrates the bug.
Expected behavior
Our
assertDataFrameEqualsshould use a multiset comparison or use a canonical sort key to properly sort on map types to avoid spurious false negatives (reporting not equal on equal values).