Releases: MobileTeleSystems/data-rentgen
0.4.5 (2025-12-24)
Improvements
Allow disabling SessionMiddleware, as it only required by KeycloakAuthProvider.
0.4.4 (2025-11-21)
Bug Fixes
- 0.4.3 release broken inputs with 0 bytes statistics, fixed
0.4.3 (2025-11-21)
Features
- Disable
server.session.enabledby default. It is required only by KeycloakAuthProvider which is not used by default.
Bug Fixes
- Escape unprintable ASCII symbols in SQL queries before storing them in Postgres. Previously saving queries containing
\x00symbol lead to exceptions. - Kafka topic with malformed messages doesn't have to use the same number partitions as input topics.
- Prevent OpenLineage from reporting events which claim to read 8 Exabytes of data, this is actually a Spark quirk.
0.4.2 (2025-10-29)
Bug fixes
- Fix search query filter on UI Run list page.
- Fix passing multiple filters to
GET /v1/runs.
Doc only Changes
- Document
DATA_RENTGEN__UI__AUTH_PROVIDERconfig variable.
0.4.1 (2025-10-08)
Features
-
Add new
GET /v1/locations/typesendpoint returning list of all known location types. (#328) -
Add new filter to
GET /v1/jobs(#328):- location_type:
list[str]
- location_type:
-
Add new filter to
GET /v1/datasets(#328):- location_type:
list[str]
- location_type:
-
Allow passing multiple
location_typefilters toGET /v1/locations. (#328) -
Allow passing multiple values to
GETendpoinds with filters likejob_id,parent_run_id, and so on. (#329)
0.4.0 (2025-10-03)
Features
-
Introduce new
http2kafkacomponent. (#281)It allows using DataRentgen with OpenLineage HttpTransport. Authentication is done using personal tokens.
-
Add REST API endpoints for managing personal tokens. (#276)
-
List of endpoints:
GET /personal-tokens- get personal tokens for current user.POST /personal-tokens- create new personal token for current user.PATCH /personal-tokens/:id- refresh personal token (revoke token and create new one).DELETE /personal-tokens/:id- revoke personal token.
-
-
Add new entities
TagandTagValue. #268Tags can be used as additional properties for another entities. This feature is still under construction.
-
Added endpoint
GET /v1/tags. #289Tag names and values can be paginated, searched by, or fetched by ids.
Response example
[ { "id": 1, "name": "env", "values": [ { "id": 1, "value": "dev" }, { "id": 2, "value": "prod" } ] } ] -
Updated
GET /v1/datasetsto includetags: [...]in response. #289Dataset response examples
Before:
{ "id": "8400", "location": {...}, "name": "dataset_name", "schema": {}, }After:
{ "id": "25896", "location": {...}, "name": "dataset_name", "schema": {...}, "tags": [ # <--- { "id": "1", "name": "environment", "values": [ { "id": "2", "value": "production" } ] }, { "id": "2", "name": "team", "values": [ { "id": "4", "value": "my_awesome_team" } ] } ] } -
Added new filters to
GET /v1/datasetsendpoint. (#294, #289)-
Query params:
- location_id:
int - tag_value_id:
list[int]- if multiple values are passed, dataset should have all of them.
- location_id:
-
-
Added new filters for
GET /v1/jobsendpoint. #319-
Query params:
- location_id:
int - job_type:
list[str]
- location_id:
-
-
Added new filters to
GET /v1/runsendpoint. (#322, #323)-
Query params:
- job_type:
list[str] - status:
list[RunStatus] - started_since:
datetime | None - started_until:
datetime | None - ended_since:
datetime | None - ended_until:
datetime | None - job_location_id:
int | None - started_by_user:
list[str] | None
- job_type:
-
-
Added new endpoint
GET /v1/jobs/types. #319 -
Add custom
dataRentgen_runanddataRentgen_operationfacets. #265-
These facets allow to:
- Passing custom
external_id,persistent_log_urland other fields of Run. - Passing custom
name,description,group,posititionfields of Operation. - mark event as containing only Operation or both Run + Operation data.
- Passing custom
-
-
Set
output.typebased on executed SQL query, e.g.INSERT,UPDATE,DELETE, and so on. #310
Improvements
-
Improve consumer performance by reducing DB load on reading operations. #314
-
Add workaround if OpenLineage emitted Spark application event with
job.name=unknown. #263This requires installing OpenLineage with this fix merged: OpenLineage/OpenLineage#3848.
-
Dataset symlinks with no inputs/outputs are no longer removed from lineage graph. #269
-
Make matching for addresses and locations more deterministic by converting them to lowercase. #313
Items
oracle://host:1521andORACLE://HOST:1521are the same itemoracle://host:1521now. -
Make matching for datasets, jobs, tags and user names case-insensitive by using unique indexes on
lower(name)expression. #313Item
database.schema.tableandDATABASE.SCHEMA.TABLEare the same item now.As dataset canonical name depends on database naming convention (
UPPERCASEfor Oracle,lowercasefor Postgres), we can't convert them into one specific case (upper or lower). Instead we use first received value as canonical one.
Bug Fixes
-
For lineage with
granularity=DATASETreturn real lineage graph. #264v0.4.x resolved lineage by
run_id, but this may produce wrong lineage. v0.4.x now resolves lineage byoperation_id. -
Exclude self-referencing lineage edges in case
granularity=DATASET. #261If some run uses the same table as both input and output (e.g. merging duplicates or performing some checks before writing), DataRentgen excludes
dataset1 -> dataset1relations from lineage.This doesn't affect chains like
dataset1 -> job1 -> dataset1ordataset1 -> dataset2 -> dataset1.
0.3.1 (2025-07-04)
Breaking changes
- Drop
Dataset.formatfield.
Improvements
- Added syntax highlighting for SQL queries.
0.3.0 (2025-07-04)
Features
-
Improved support for
openlineage-airflow#210.Before we tracked only DAG and Task start/stop events, but not lineage. Now we store lineage produces by Airflow Operators like
SQLExecuteQueryOperator. -
Added support for
openlineage-flink#214. -
Added support for
openlineage-hive#245. -
Added support for
openlineage-dbt#223. -
Add
DATASETgranularity forGET /api/datasets/lineage#235. -
Store SQL queries received from OpenLineage integrations. (#213, #218).
Breaking changes
-
Change
Output.typeinGET /api/*/lineageresponse fromEnumtoList[Enum]#222.Response examples
Before:
{ "nodes": {...}, "relations": { "outputs": [ { "from": {"kind": "JOB", "id": 3981}, "to": {"kind": "DATASET", "id": 8400}, "types": "OVERWRITE", # <--- ... } ] }, }After
{ "nodes": {...}, "relations": { "outputs": [ { "from": {"kind": "JOB", "id": 3981}, "to": {"kind": "DATASET", "id": 8400}, "types": ["OVERWRITE", "DROP", "TRUNCATE"], # <--- ... } ] }, }We're using output schema, if any, then fallback to input schema.
-
Moved
Input.schemaandOutput.schematoDataset.schemainGET /api/*/lineageresponse #249.Response examples
Before:
{ "nodes": { "datasets": { "8400": { "id": "8400", "location": {...}, "name": "dataset_name", ... } } }, "relations": { "outputs": [ { "from": {"kind": "JOB", "id": 3981}, "to": {"kind": "DATASET", "id": 8400}, "types": "OVERWRITE", "schema": { # <--- "id": "10062", "fields": [ ... ], "relevance_type": "EXACT_MATCH" } } ] }, }After:
{ "nodes": { "datasets": { "8400": { "id": "25896", "location": {...}, "name": "dataset_name", "schema": { # <--- "id": "10062", "fields": [...], "relevance_type": "EXACT_MATCH" } } } ... }, "relations": { "outputs": [ { "from": {"kind": "JOB", "id": 3981}, "to": {"kind": "DATASET", "id": 8400}, "types": ["OVERWRITE", "DROP", "TRUNCATE"], } ] }, }
Improvements
-
Added
cleanup_partitions.pyscript to automate the cleanup of old table partitions #254. -
Added
data_rentgen.db.seedscript which creates example data in database #257. -
Speedup fetching
RunandOperationfrom database by id #247. -
Speedup consuming OpenLineage events from Kafka #236.
-
Make consumer message parsing more robust #204.
Previously malformed OpenLineage events (JSON) lead to skipping the entire message batch read from Kafka. Now messages are parsed separately, and malformed ones are send back to
input.runs__malformedKafka topic. -
Improve storing lineage data for long running operations #253.
Description
Previously if operation was running for a long time (more than a day, Flink streaming jobs can easily run for months or years), and lineage graph was build for last day, there were no Flink job/run/operation in the graph.
This is because we created input/output/column lineage at operation start, and
RUNNINGevents of the same operation (checkpoints) were just updating the same row statistics.Now we create new input/output/column lineage row for checkpoints events as well. But only one row for each hour since operation was started, as increasing number of rows slows down lineage graph resolution.
For short-lived operations (most of batch operations take less than hour) behavior remains unchanged.
Bug Fixes
- Fix Airflow 3.x DAG and Task url template (#227).
0.2.1 (2025-04-07)
Improvements
- Reduce image size x2
- Change docker image user from
roottodata-rentgen, to improve security. - SBOM file is generated on release.
0.2.0 (2025-03-25)
TL;DR
- Implemented column lineage support.
- HDFS/S3 partitions are now truncated from table path.
- Added total run/operation statistics (input/output bytes, rows, files).
- Lineage graph UX improvements.
- Kafka -> consumer integrations improvements.
Breaking Changes
-
Change response schema of
GET /operations. (158)Operation properties are moved to
datakey, added newstatisticskey. This allows to show operation statistics in UI without building up lineage graph.Response examples
Before:
{ "meta": { // ... }, "items": [ { "kind": "OPERATION", "id": "00000000-0000-0000-0000-000000000000", "name": "abc", "description": "some", // ... } ], }to:
{ "meta": { // ... }, "items": [ { "id": "00000000-0000-0000-0000-000000000000", "data": { "id": "00000000-0000-0000-0000-000000000000", "name": "abc", "description": "some", // ... }, "statistics": { "inputs": { "total_datasets": 2, "total_bytes": 123456, "total_rows": 100, "total_files": 0, }, "outputs": { "total_datasets": 2, "total_bytes": 123456, "total_rows": 100, "total_files": 0, }, }, } ], } -
Change response schema of
GET /runs. (159)Run properties are moved to
datakey, added newstatisticskey. This allows to show run statistics in UI without building up lineage graph.Response examples
Before:
{ "meta": { // ... }, "items": [ { "kind": "RUN", "id": "00000000-0000-0000-0000-000000000000", "external_id": "abc", "description": "some", // ... } ], }to:
{ "meta": { // ... }, "items": [ { "id": "00000000-0000-0000-0000-000000000000", "data": { "id": "00000000-0000-0000-0000-000000000000", "external_id": "abc", "description": "some", // ... }, "statistics": { "inputs": { "total_datasets": 2, "total_bytes": 123456, "total_rows": 100, "total_files": 0, }, "outputs": { "total_datasets": 2, "total_bytes": 123456, "total_rows": 100, "total_files": 0, }, "operations": { "total_operations": 10, }, }, } ], } -
Change response schema of
GET /locations. (160)Location properties are moved to
datakey, added newstatisticskey. This allows to show location statistics in UI.Response examples
Before:
{ "meta": { // ... }, "items": [ { "kind": "LOCATION", "id": 123, "name": "rnd_dwh", "type": "hdfs", // ... } ], }to:
{ "meta": { // ... }, "items": [ { "id": "123", "data": { "id": "123", "name": "rnd_dwh", "type": "hdfs", // ... }, "statistics": { "datasets": {"total_datasets": 2}, "jobs": {"total_jobs": 0}, }, } ], }Same for
PATCH /locations/:id:Response examples
Before:
{ "kind": "LOCATION", "id": 123, "name": "abc", // ... }after:
{ "id": "123", "data": { "id": "123", "name": "abc", // ... }, "statistics": { "datasets": {"total_datasets": 2}, "jobs": {"total_jobs": 0}, }, } -
Change response schema of
GET /datasets. (161)Dataset properties are moved to
datakey. This makes API response more consistent with others (e.g.GET /runs,GET /operations).Response examples
Before:
{ "meta": { // ... }, "items": [ { "kind": "DATASET", "id": 123, "name": "abc", // ... } ], }to:
{ "meta": { // ... }, "items": [ { "id": "123", "data": { "id": "123", "name": "abc", // ... }, } ], } -
Change response schema of
GET /jobs. (162)Job properties are moved to
datakey. This makes API response more consistent with others (e.g.GET /runs,GET /operations).Response examples
Before:
{ "meta": { // ... }, "items": [ { "kind": "JOB", "id": 123, "name": "abc", // ... } ], }after:
{ "meta": { // ... }, "items": [ { "id": "123", "data": { "id": "123", "name": "abc", // ... }, } ], } -
Change response schema of
GET /:entity/lineage. (164)List of all nodes (e.g.
list[Node]) is split by node type, and converted to map (e.g.dict[str, Dataset],dict[str, Job]).List of all relations (e.g.
list[Relation]) is split by relation type (e.g.list[DatasetSymlink],list[Input]).Response examples
Before:
{ "relations": [ { "kind": "PARENT", "from": {"kind": "JOB", "id": 123}, "to": {"kind": "RUN", "id": "00000000-0000-0000-0000-000000000000"}, }, { "kind": "SYMLINK", "from": {"kind": "DATASET", "id": 234}, "to": {"kind": "DATASET", "id": 999}, }, { "kind": "INPUT", "from": {"kind": "DATASET", "id": 234}, "to": {"kind": "OPERATION", "id": "11111111-1111-1111-1111-111111111111"}, }, { "kind": "OUTPUT", "from": {"kind": "OPERATION", "id": "11111111-1111-1111-1111-111111111111"}, "to": {"kind": "DATASET", "id": 234}, }, ], "nodes": [ {"kind": "DATASET", "id": 123, "name": "abc"}, {"kind": "JOB", "id": 234, "name": "cde"}, { "kind": "RUN", "id": "00000000-0000-0000-0000-000000000000", "external_id": "def", }, { "kind": "OPERATION", "id": "11111111-1111-1111-1111-111111111111", "name": "efg", }, ], }after:
{ "relations": { "parents": [ { "from": {"kind": "JOB", "id": "123"}, "to": {"kind": "RUN", "id": "00000000-0000-0000-0000-000000000000"}, }, ], "symlinks": [ { "from": {"kind": "DATASET", "id": "234"}, "to": {"kind": "DATASET", "id": "999"}, }, ], "inputs": [ { "from": {"kind": "DATASET", "id": "234"}, "to": { "kind": "OPERATION", "id": "11111111-1111-1111-1111-111111111111", }, }, ], "outputs": [ { "from": { "kind": "...