Skip to content

[BUG] Potential misalignment of map types in assertDataFrameEqual #15095

Description

@rishic3

Describe the bug

This is a follow-up on #15058 (comment). The skills define a helper assertDataFrameEquals to check dataframe equality, which sorts based on toString. However the result of toString for map types depends on insertion order, so values can be spuriously misaligned.

(This bug also exists for Spark's internal test utils QueryTest.scala and assertDataFrameEqual, from which this method was derived).

Steps/Code to reproduce bug

E.g., this demonstrates the bug.

//> using scala 2.12.15
//> using dep "org.apache.spark::spark-sql:3.5.5"
//> using dep "org.apache.spark:spark-sql_2.12:3.5.5,classifier=tests"
//> using dep "org.scalatest::scalatest:3.2.17"

import org.apache.spark.sql.{Row, QueryTest}

object MapSortRepro {
  def main(args: Array[String]): Unit = {
    // Same data, but the map entries are in a different insertion order
    val actual = Seq(
      Row(Map("z" -> 1, "a" -> 2)),
      Row(Map("m" -> 3, "n" -> 4))
    )
    val expected = Seq(
      Row(Map("a" -> 2, "z" -> 1)),
      Row(Map("m" -> 3, "n" -> 4))
    )
    println("row 0 map values are equal: " + (actual(0).getMap[String, Int](0) == expected(0).getMap[String, Int](0)))
    println("actual(0) toString = " + actual(0))
    println("expected(0) toString = " + expected(0))

    val a = actual.sortBy(_.toString)
    val e = expected.sortBy(_.toString)
    val oursDiffers = a.indices.exists(i => a(i).getMap[String, Int](0) != e(i).getMap[String, Int](0))
    println("assertDataFrameEquals reports equality: " + !oursDiffers)
  }
}
$ scala-cli run --server=false MapSortRepro.scala
row 0 map values are equal: true
actual(0) toString = [Map(z -> 1, a -> 2)]
expected(0) toString = [Map(a -> 2, z -> 1)]
assertDataFrameEquals reports equality: false

Expected behavior

Our assertDataFrameEquals should use a multiset comparison or use a canonical sort key to properly sort on map types to avoid spurious false negatives (reporting not equal on equal values).

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingskillsImprovements or additions to skills

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions