Skip to content
This repository was archived by the owner on Dec 15, 2021. It is now read-only.
This repository was archived by the owner on Dec 15, 2021. It is now read-only.

The Apache Avro library failed to parse the header #57

@matthew-fishkin

Description

@matthew-fishkin

Spark version: 2.2.0
Spotify/spark-bigquery version: 0.2.2

Hi,

I am trying to use the saveAsBigQuery table function to write a schema that has an array of struct as a field. However, I am getting the following error:

The Apache Avro library failed to parse the header with the follwing error: Invalid namespace: .topic_scores

The offending field is:


{
            "type": [
                {
                    "items": [
                        {
                            "namespace": ".topic_scores",
                            "type": "record",
                            "name": "topic_scores",
                            "fields": [
                                {
                                    "type": "int",
                                    "name": "index"
                                },
                                {
                                    "type": "float",
                                    "name": "score"
                                }
                            ]
                        },
                        "null"
                    ],
                    "type": "array"
                },
                "null"
            ],
            "name": "topic_scores"
        }

You can see that the namespace field begins with a dot. My guess is that the issue stems from https://github.com/spotify/spark-bigquery/blob/master/src/main/scala/com/databricks/spark/avro/SchemaConverters.scala#L342-L346

I can't find a way to configure the recordNamespace value. According to avro documentation:

You can specify the record name and namespace like this:

import com.databricks.spark.avro._
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().master("local").getOrCreate()
val df = spark.read.avro("src/test/resources/episodes.avro")

val name = "AvroTest"
val namespace = "com.databricks.spark.avro"
val parameters = Map("recordName" -> name, "recordNamespace" -> namespace)

df.write.options(parameters).avro("/tmp/output")

I think this is the line that reads that option, and sets the value to an empty string if not provided: https://github.com/databricks/spark-avro/blob/branch-4.0/src/main/scala/com/databricks/spark/avro/DefaultSource.scala#L114

These options are not parameterized anywhere in the Spotify library. Has anyone seen this issue or have a workaround? Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions