Skip to content

Releases: MobileTeleSystems/data-rentgen

0.4.5 (2025-12-24)

24 Dec 15:48
02b53ee

Choose a tag to compare

Improvements

Allow disabling SessionMiddleware, as it only required by KeycloakAuthProvider.

0.4.4 (2025-11-21)

21 Nov 16:51
d76fdb5

Choose a tag to compare

Bug Fixes

  • 0.4.3 release broken inputs with 0 bytes statistics, fixed

0.4.3 (2025-11-21)

21 Nov 15:52
04d73bb

Choose a tag to compare

Features

  • Disable server.session.enabled by default. It is required only by KeycloakAuthProvider which is not used by default.

Bug Fixes

  • Escape unprintable ASCII symbols in SQL queries before storing them in Postgres. Previously saving queries containing \x00 symbol lead to exceptions.
  • Kafka topic with malformed messages doesn't have to use the same number partitions as input topics.
  • Prevent OpenLineage from reporting events which claim to read 8 Exabytes of data, this is actually a Spark quirk.

0.4.2 (2025-10-29)

29 Oct 15:32
bb01ca3

Choose a tag to compare

Bug fixes

  • Fix search query filter on UI Run list page.
  • Fix passing multiple filters to GET /v1/runs.

Doc only Changes

  • Document DATA_RENTGEN__UI__AUTH_PROVIDER config variable.

0.4.1 (2025-10-08)

08 Oct 14:15
c5a2ade

Choose a tag to compare

Features

  • Add new GET /v1/locations/types endpoint returning list of all known location types. (#328)

  • Add new filter to GET /v1/jobs (#328):

    • location_type: list[str]
  • Add new filter to GET /v1/datasets (#328):

    • location_type: list[str]
  • Allow passing multiple location_type filters to GET /v1/locations. (#328)

  • Allow passing multiple values to GET endpoinds with filters like job_id, parent_run_id, and so on. (#329)

0.4.0 (2025-10-03)

03 Oct 13:56
9e97ab2

Choose a tag to compare

Features

  • Introduce new http2kafka component. (#281)

    It allows using DataRentgen with OpenLineage HttpTransport. Authentication is done using personal tokens.

  • Add REST API endpoints for managing personal tokens. (#276)

    • List of endpoints:

      • GET /personal-tokens - get personal tokens for current user.
      • POST /personal-tokens - create new personal token for current user.
      • PATCH /personal-tokens/:id - refresh personal token (revoke token and create new one).
      • DELETE /personal-tokens/:id - revoke personal token.
  • Add new entities Tag and TagValue. #268

    Tags can be used as additional properties for another entities. This feature is still under construction.

  • Added endpoint GET /v1/tags. #289

    Tag names and values can be paginated, searched by, or fetched by ids.

    Response example

    [
        {
        "id": 1,
        "name": "env",
        "values": [
            {
              "id": 1,
              "value": "dev"
            },
            {
              "id": 2,
              "value": "prod"
            }
          ]
        }
    ]
  • Updated GET /v1/datasets to include tags: [...] in response. #289

    Dataset response examples

    Before:

    {
        "id": "8400",
        "location": {...},
        "name": "dataset_name",
        "schema": {},
    }

    After:

    {
        "id": "25896",
        "location": {...},
        "name": "dataset_name",
        "schema": {...},
        "tags": [  # <---
            {
                "id": "1",
                "name": "environment",
                "values": [
                    {
                        "id": "2",
                        "value": "production"
                    }
                ]
            },
            {
                "id": "2",
                "name": "team",
                "values": [
                    {
                        "id": "4",
                        "value": "my_awesome_team"
                    }
                ]
            }
        ]
    }
  • Added new filters to GET /v1/datasets endpoint. (#294, #289)

    • Query params:

      • location_id: int
      • tag_value_id: list[int] - if multiple values are passed, dataset should have all of them.
  • Added new filters for GET /v1/jobs endpoint. #319

    • Query params:

      • location_id: int
      • job_type: list[str]
  • Added new filters to GET /v1/runs endpoint. (#322, #323)

    • Query params:

      • job_type: list[str]
      • status: list[RunStatus]
      • started_since: datetime | None
      • started_until: datetime | None
      • ended_since: datetime | None
      • ended_until: datetime | None
      • job_location_id: int | None
      • started_by_user: list[str] | None
  • Added new endpoint GET /v1/jobs/types. #319

  • Add custom dataRentgen_run and dataRentgen_operation facets. #265

    • These facets allow to:

      • Passing custom external_id, persistent_log_url and other fields of Run.
      • Passing custom name, description, group, positition fields of Operation.
      • mark event as containing only Operation or both Run + Operation data.
  • Set output.type based on executed SQL query, e.g. INSERT, UPDATE, DELETE, and so on. #310

Improvements

  • Improve consumer performance by reducing DB load on reading operations. #314

  • Add workaround if OpenLineage emitted Spark application event with job.name=unknown. #263

    This requires installing OpenLineage with this fix merged: OpenLineage/OpenLineage#3848.

  • Dataset symlinks with no inputs/outputs are no longer removed from lineage graph. #269

  • Make matching for addresses and locations more deterministic by converting them to lowercase. #313

    Items oracle://host:1521 and ORACLE://HOST:1521 are the same item oracle://host:1521 now.

  • Make matching for datasets, jobs, tags and user names case-insensitive by using unique indexes on lower(name) expression. #313

    Item database.schema.table and DATABASE.SCHEMA.TABLE are the same item now.

    As dataset canonical name depends on database naming convention (UPPERCASE for Oracle, lowercase for Postgres), we can't convert them into one specific case (upper or lower). Instead we use first received value as canonical one.

Bug Fixes

  • For lineage with granularity=DATASET return real lineage graph. #264

    v0.4.x resolved lineage by run_id, but this may produce wrong lineage. v0.4.x now resolves lineage by operation_id.

  • Exclude self-referencing lineage edges in case granularity=DATASET. #261

    If some run uses the same table as both input and output (e.g. merging duplicates or performing some checks before writing), DataRentgen excludes dataset1 -> dataset1 relations from lineage.

    This doesn't affect chains like dataset1 -> job1 -> dataset1 or dataset1 -> dataset2 -> dataset1.

0.3.1 (2025-07-04)

04 Jul 20:26
f6d5474

Choose a tag to compare

Breaking changes

  • Drop Dataset.format field.

Improvements

  • Added syntax highlighting for SQL queries.

0.3.0 (2025-07-04)

04 Jul 14:50
ca2155a

Choose a tag to compare

Features

  • Improved support for openlineage-airflow #210.

    Before we tracked only DAG and Task start/stop events, but not lineage. Now we store lineage produces by Airflow Operators like SQLExecuteQueryOperator.

  • Added support for openlineage-flink #214.

  • Added support for openlineage-hive #245.

  • Added support for openlineage-dbt #223.

  • Add DATASET granularity for GET /api/datasets/lineage #235.

  • Store SQL queries received from OpenLineage integrations. (#213, #218).

Breaking changes

  • Change Output.type in GET /api/*/lineage response from Enum to List[Enum] #222.

    Response examples

    Before:

    {
        "nodes": {...},
        "relations": {
            "outputs": [
                {
                "from": {"kind": "JOB", "id": 3981},
                "to": {"kind": "DATASET", "id": 8400},
                "types": "OVERWRITE",  # <---
                ...
                }
            ]
        },
    }

    After

    {
        "nodes": {...},
        "relations": {
            "outputs": [
                {
                "from": {"kind": "JOB", "id": 3981},
                "to": {"kind": "DATASET", "id": 8400},
                "types": ["OVERWRITE", "DROP", "TRUNCATE"],  # <---
                ...
                }
            ]
        },
    }

    We're using output schema, if any, then fallback to input schema.

  • Moved Input.schema and Output.schema to Dataset.schema in GET /api/*/lineage response #249.

    Response examples

    Before:

    {
        "nodes": {
            "datasets": {
                "8400": {
                    "id": "8400",
                    "location": {...},
                    "name": "dataset_name",
                    ...
                }
            }
    
        },
        "relations": {
            "outputs": [
                {
                    "from": {"kind": "JOB", "id": 3981},
                    "to": {"kind": "DATASET", "id": 8400},
                    "types": "OVERWRITE",
                    "schema": {  # <---
                        "id": "10062",
                        "fields": [ ... ],
                        "relevance_type": "EXACT_MATCH"
                    }
                }
            ]
        },
    }

    After:

    {
        "nodes": {
            "datasets": {
                "8400": {
                    "id": "25896",
                    "location": {...},
                    "name": "dataset_name",
                    "schema": {  # <---
                        "id": "10062",
                        "fields": [...],
                        "relevance_type": "EXACT_MATCH"
                    }
                }
            }
            ...
        },
        "relations": {
            "outputs": [
                {
                    "from": {"kind": "JOB", "id": 3981},
                    "to": {"kind": "DATASET", "id": 8400},
                    "types": ["OVERWRITE", "DROP", "TRUNCATE"],
                }
            ]
        },
    }

Improvements

  • Added cleanup_partitions.py script to automate the cleanup of old table partitions #254.

  • Added data_rentgen.db.seed script which creates example data in database #257.

  • Speedup fetching Run and Operation from database by id #247.

  • Speedup consuming OpenLineage events from Kafka #236.

  • Make consumer message parsing more robust #204.

    Previously malformed OpenLineage events (JSON) lead to skipping the entire message batch read from Kafka. Now messages are parsed separately, and malformed ones are send back to input.runs__malformed Kafka topic.

  • Improve storing lineage data for long running operations #253.

    Description

    Previously if operation was running for a long time (more than a day, Flink streaming jobs can easily run for months or years), and lineage graph was build for last day, there were no Flink job/run/operation in the graph.

    This is because we created input/output/column lineage at operation start, and RUNNING events of the same operation (checkpoints) were just updating the same row statistics.

    Now we create new input/output/column lineage row for checkpoints events as well. But only one row for each hour since operation was started, as increasing number of rows slows down lineage graph resolution.

    For short-lived operations (most of batch operations take less than hour) behavior remains unchanged.

Bug Fixes

  • Fix Airflow 3.x DAG and Task url template (#227).

0.2.1 (2025-04-07)

07 Apr 12:03
a4f8cf2

Choose a tag to compare

Improvements

  • Reduce image size x2
  • Change docker image user from root to data-rentgen, to improve security.
  • SBOM file is generated on release.

0.2.0 (2025-03-25)

25 Mar 13:49
9c40220

Choose a tag to compare

TL;DR

  • Implemented column lineage support.
  • HDFS/S3 partitions are now truncated from table path.
  • Added total run/operation statistics (input/output bytes, rows, files).
  • Lineage graph UX improvements.
  • Kafka -> consumer integrations improvements.

Breaking Changes

  • Change response schema of GET /operations. (158)

    Operation properties are moved to data key, added new statistics key. This allows to show operation statistics in UI without building up lineage graph.

    Response examples

    Before:

    {
        "meta": {
            // ...
        },
        "items": [
            {
                "kind": "OPERATION",
                "id": "00000000-0000-0000-0000-000000000000",
                "name": "abc",
                "description": "some",
                // ...
            }
        ],
    }

    to:

    {
        "meta": {
            // ...
        },
        "items": [
            {
                "id": "00000000-0000-0000-0000-000000000000",
                "data": {
                    "id": "00000000-0000-0000-0000-000000000000",
                    "name": "abc",
                    "description": "some",
                    // ...
                },
                "statistics": {
                    "inputs": {
                        "total_datasets": 2,
                        "total_bytes": 123456,
                        "total_rows": 100,
                        "total_files": 0,
                    },
                    "outputs": {
                        "total_datasets": 2,
                        "total_bytes": 123456,
                        "total_rows": 100,
                        "total_files": 0,
                    },
                },
            }
        ],
    }
  • Change response schema of GET /runs. (159)

    Run properties are moved to data key, added new statistics key. This allows to show run statistics in UI without building up lineage graph.

    Response examples

    Before:

    {
        "meta": {
            // ...
        },
        "items": [
            {
                "kind": "RUN",
                "id": "00000000-0000-0000-0000-000000000000",
                "external_id": "abc",
                "description": "some",
                // ...
            }
        ],
    }

    to:

    {
        "meta": {
            // ...
        },
        "items": [
            {
                "id": "00000000-0000-0000-0000-000000000000",
                "data": {
                    "id": "00000000-0000-0000-0000-000000000000",
                    "external_id": "abc",
                    "description": "some",
                    // ...
                },
                "statistics": {
                    "inputs": {
                        "total_datasets": 2,
                        "total_bytes": 123456,
                        "total_rows": 100,
                        "total_files": 0,
                    },
                    "outputs": {
                        "total_datasets": 2,
                        "total_bytes": 123456,
                        "total_rows": 100,
                        "total_files": 0,
                    },
                    "operations": {
                        "total_operations": 10,
                    },
                },
            }
        ],
    }
  • Change response schema of GET /locations. (160)

    Location properties are moved to data key, added new statistics key. This allows to show location statistics in UI.

    Response examples

    Before:

    {
        "meta": {
            // ...
        },
        "items": [
            {
                "kind": "LOCATION",
                "id": 123,
                "name": "rnd_dwh",
                "type": "hdfs",
                // ...
            }
        ],
    }

    to:

    {
        "meta": {
            // ...
        },
        "items": [
            {
                "id": "123",
                "data": {
                    "id": "123",
                    "name": "rnd_dwh",
                    "type": "hdfs",
                    // ...
                },
                "statistics": {
                    "datasets": {"total_datasets": 2},
                    "jobs": {"total_jobs": 0},
                },
            }
        ],
    }

    Same for PATCH /locations/:id:

    Response examples

    Before:

    {
        "kind": "LOCATION",
        "id": 123,
        "name": "abc",
        // ...
    }

    after:

    {
        "id": "123",
        "data": {
            "id": "123",
            "name": "abc",
            // ...
        },
        "statistics": {
            "datasets": {"total_datasets": 2},
            "jobs": {"total_jobs": 0},
        },
    }
  • Change response schema of GET /datasets. (161)

    Dataset properties are moved to data key. This makes API response more consistent with others (e.g. GET /runs, GET /operations).

    Response examples

    Before:

    {
        "meta": {
            // ...
        },
        "items": [
            {
                "kind": "DATASET",
                "id": 123,
                "name": "abc",
                // ...
            }
        ],
    }

    to:

    {
        "meta": {
            // ...
        },
        "items": [
            {
                "id": "123",
                "data": {
                    "id": "123",
                    "name": "abc",
                    // ...
                },
            }
        ],
    }
  • Change response schema of GET /jobs. (162)

    Job properties are moved to data key. This makes API response more consistent with others (e.g. GET /runs, GET /operations).

    Response examples

    Before:

    {
        "meta": {
            // ...
        },
        "items": [
            {
                "kind": "JOB",
                "id": 123,
                "name": "abc",
                // ...
            }
        ],
    }

    after:

    {
        "meta": {
            // ...
        },
        "items": [
            {
                "id": "123",
                "data": {
                    "id": "123",
                    "name": "abc",
                    // ...
                },
            }
        ],
    }
  • Change response schema of GET /:entity/lineage. (164)

    List of all nodes (e.g. list[Node]) is split by node type, and converted to map (e.g. dict[str, Dataset], dict[str, Job]).

    List of all relations (e.g. list[Relation]) is split by relation type (e.g. list[DatasetSymlink], list[Input]).

    Response examples

    Before:

    {
        "relations": [
            {
                "kind": "PARENT",
                "from": {"kind": "JOB", "id": 123},
                "to": {"kind": "RUN", "id": "00000000-0000-0000-0000-000000000000"},
            },
            {
                "kind": "SYMLINK",
                "from": {"kind": "DATASET", "id": 234},
                "to": {"kind": "DATASET", "id": 999},
            },
            {
                "kind": "INPUT",
                "from": {"kind": "DATASET", "id": 234},
                "to": {"kind": "OPERATION", "id": "11111111-1111-1111-1111-111111111111"},
            },
            {
                "kind": "OUTPUT",
                "from": {"kind": "OPERATION", "id": "11111111-1111-1111-1111-111111111111"},
                "to": {"kind": "DATASET", "id": 234},
            },
        ],
        "nodes": [
            {"kind": "DATASET", "id": 123, "name": "abc"},
            {"kind": "JOB", "id": 234, "name": "cde"},
            {
                "kind": "RUN",
                "id": "00000000-0000-0000-0000-000000000000",
                "external_id": "def",
            },
            {
                "kind": "OPERATION",
                "id": "11111111-1111-1111-1111-111111111111",
                "name": "efg",
            },
        ],
    }

    after:

    {
        "relations": {
            "parents": [
                {
                    "from": {"kind": "JOB", "id": "123"},
                    "to": {"kind": "RUN", "id": "00000000-0000-0000-0000-000000000000"},
                },
            ],
            "symlinks": [
                {
                    "from": {"kind": "DATASET", "id": "234"},
                    "to": {"kind": "DATASET", "id": "999"},
                },
            ],
            "inputs": [
                {
                    "from": {"kind": "DATASET", "id": "234"},
                    "to": {
                        "kind": "OPERATION",
                        "id": "11111111-1111-1111-1111-111111111111",
                    },
                },
            ],
            "outputs": [
                {
                    "from": {
                        "kind": "...
Read more