[BUG] Potential misalignment of map types in assertDataFrameEqual

**Describe the bug**

This is a follow-up on https://github.com/NVIDIA/spark-rapids/pull/15058#discussion_r3415842458. The skills define a helper `assertDataFrameEquals` to check dataframe equality, which sorts based on `toString`. However the result of `toString` for map types depends on insertion order, so values can be spuriously misaligned.

(This bug also exists for Spark's internal test utils [QueryTest.scala](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala#L1013) and [assertDataFrameEqual](https://github.com/apache/spark/blob/0caf79030573233c6db526a4289499f8756adeaf/python/pyspark/testing/utils.py#L692), from which this method was derived). 


**Steps/Code to reproduce bug**

E.g., this demonstrates the bug.

```scala
//> using scala 2.12.15
//> using dep "org.apache.spark::spark-sql:3.5.5"
//> using dep "org.apache.spark:spark-sql_2.12:3.5.5,classifier=tests"
//> using dep "org.scalatest::scalatest:3.2.17"

import org.apache.spark.sql.{Row, QueryTest}

object MapSortRepro {
  def main(args: Array[String]): Unit = {
    // Same data, but the map entries are in a different insertion order
    val actual = Seq(
      Row(Map("z" -> 1, "a" -> 2)),
      Row(Map("m" -> 3, "n" -> 4))
    )
    val expected = Seq(
      Row(Map("a" -> 2, "z" -> 1)),
      Row(Map("m" -> 3, "n" -> 4))
    )
    println("row 0 map values are equal: " + (actual(0).getMap[String, Int](0) == expected(0).getMap[String, Int](0)))
    println("actual(0) toString = " + actual(0))
    println("expected(0) toString = " + expected(0))

    val a = actual.sortBy(_.toString)
    val e = expected.sortBy(_.toString)
    val oursDiffers = a.indices.exists(i => a(i).getMap[String, Int](0) != e(i).getMap[String, Int](0))
    println("assertDataFrameEquals reports equality: " + !oursDiffers)
  }
}

```

```bash
$ scala-cli run --server=false MapSortRepro.scala
row 0 map values are equal: true
actual(0) toString = [Map(z -> 1, a -> 2)]
expected(0) toString = [Map(a -> 2, z -> 1)]
assertDataFrameEquals reports equality: false
```

**Expected behavior**

Our `assertDataFrameEquals` should use a multiset comparison or use a canonical sort key to properly sort on map types to avoid spurious false negatives (reporting not equal on equal values).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Potential misalignment of map types in assertDataFrameEqual #15095

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[BUG] Potential misalignment of map types in assertDataFrameEqual #15095

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions