Proposal for a DJ materialization service #407

betodealmeida · 2023-03-12T22:21:27Z

betodealmeida
Mar 12, 2023
Maintainer

Summary

This is a proposal for a materialization service, responsible for reading declarative metadata in non-source nodes and running all the necessary queries to make the metadata true. Note that the term "service" here is used not for an HTTP service, but a scheduler worker that periodically runs tasks; for this reason "service" and "scheduler" are used interchangeably to refer to the materialization service.

Motivation

There are 2 main use cases for this service:

Improve performance by simplifying queries needed in order to compute metrics.
Allow users to manage their warehouse via DJ, similar to the experience provided by dbt.

Example

Imagine a DAG with a simple source, transform, and metric nodes. The source node has 2 tables in different catalogs:

graph TD
    A[name: A\ntype: source\ncolumns: foo,bar] == SELECT foo FROM A WHERE bar > 0 ==> B{{name: B\ntype: transform\ncolumns: foo}}
    B == "SELECT COUNT(DISTINCT foo) AS cnt FROM B" ==> C(name: C\ntype: metric\ncolumns: cnt)
    t1[(catalog: default\nschema: main\ntable: some_table\ncolumns: foo,bar)] -.-> A
    t2[(catalog: warehouse\nschema: main\ntable: some_other_table\ncolumns: foo)] -.-> A
    style t1 fill:lightgreen
    style t2 fill:orange

(Note: it would've been nice if we had standardized on colors and shapes for node types! I'll ask Kim about it!)

When computing metric C we the following query is built:

SELECT COUNT(DISTINCT foo) AS cnt
FROM (
  SELECT foo FROM main.some_table
  WHERE bar > 0
)

(Note that some_other_table can't be used because we filter on bar.)

Now maybe some_table doesn't have an index on the bar column, so that the inner select in the metric query is slow. A user might want to materialize B. They would do this by adding materialization attributes to the node. In pseudo-syntax (note that here nodes can have multiple representations):

# B node
sql: SELECT foo FROM A WHERE bar > 0
representations:
  - catalog: default
    schema: main
    table: my_transform
materialization:
  job: native  # optional, default value is "native"
  schedule:
    cron: 0 * * * *

The materialization service would poll the DJ API for all nodes with a materialization attribute. For each node it would fetch a "materialization SQL"; for node B this looks like:

CREATE TABLE IF NOT EXISTS main.my_transform (foo STRING);
DELETE FROM main.my_transform;
INSERT INTO main.my_transform
SELECT foo FROM main.some_table
WHERE bar > 0;

The materialization service can then run this query in the query service:

sequenceDiagram
    participant MS as Materialization service
    participant DJ as DJ service
    participant QS as Query service
    MS->>DJ: What nodes have materialization attributes?
    DJ->>MS: [B]
    MS->>DJ: What's the materialization query for `B`?
    DJ->>MS: "CREATE TABLE ... DELETE FROM ... INSERT INTO ..."
    MS->>QS: "CREATE TABLE ... DELETE FROM ... INSERT INTO ..."

Now the DAG looks like this:

graph TD
    A[name: A\ntype: source\ncolumns: foo,bar] == SELECT foo FROM A WHERE bar > 0 ==> B{{name: B\ntype: transform\ncolumns: foo}}
    B == "SELECT COUNT(DISTINCT foo) AS cnt FROM B" ==> C(name: C\ntype: metric\ncolumns: cnt)
    t1[(catalog: default\nschema: main\ntable: some_table\ncolumns: foo,bar)] -.-> A
    t2[(catalog: warehouse\nschema: main\ntable: some_other_table\ncolumns: foo)] -.-> A
    style t1 fill:lightgreen
    style t2 fill:orange
    t3[(catalog: default\nschema: main\ntable: my_transform\ncolumns: foo)] -.-> B
    style t3 fill:lightgreen

And queries for metric C are now faster:

SELECT COUNT(DISTINCT foo) AS cnt
FROM main.my_transform

The second use case would be for creating and managing tables in a data warehouse, similar to how dbt works. This way all the information about how tables are generated would live in a single system, DJ, and migrating to different databases would be relatively easy.

Architecture

Goals

The service should be modular and decoupled from DJ. DJ would offer a native scheduler based on Celery and Redis, but people should be use to use other schedulers like Airflow (we could even offer our own Airflow pipeline). Netflix would probably use something else as well.
The service should support non-standard materialization. The native materialization would support the workflow described above, where the materialization can be done via SQL alone. Other specific use cases, eg, materializing using an external tool to Druid, should also be supported.

Discovery

The first step in materializing nodes is discovering which nodes need to be materialized. This could be done either via the GraphQL API, or via the REST API. For the rest API we could add a new endpoint /nodes/searches to where the service would post:

Accept: application/json
Content-Type: application/json
POST http://example.com/nodes/searches
{
  "filter": [{"col": "materialization", "op": "exists"}],
  "order": ["$.name"],
  "page": 0,
  "page_size": 50,
}

[
  {
    "name": "B",
    "materialization": {
      "job": "native",
      "schedule": "0 * * * *",
      "sql": "CREATE TABLE main.my_transform (foo STRING);\nINSERT INTO main.my_transform\nSELECT foo FROM main.some_table\nWHERE bar > 0;"
      "catalog": "default",
      "engine": "engine",
    }
  }
]

Ideally this would return a list of nodes with the materialization attributes in the payload, so that additional requests are no longer needed. For this reason, GraphQL might be a better interface for the native scheduler, since it's probably much simpler.

Execution

Once the service knows the query for each node it can use the query service to submit the materialization query. The response from the query service should be atomic and idempotent, ie, it shouldn't be possible for the CREATE QUERY ... to run but for SELECT ... to fail (the service should rollback), and the result should be the same if the query is submitted multiple times. This allows the materialization service to handle retries on its own.

Advanced

Incremental materialization

For many nodes we'll want to perform incremental materialization. One way of doing that is allowing users to specify a special temporal column in the materialization attribute, eg:

# B node
sql: SELECT foo, timestamp FROM A WHERE bar > 0
representations:
  - catalog: default
    schema: main
    table: my_transform
materialization:
  job: native
  schedule:
    cron: 0 * * * *
  time_column: timestamp  # HERE

When a time column is present the materialization query would take it into consideration:

CREATE TABLE IF NOT EXISTS main.my_transform (foo STRING, timestamp DATETIME);
DELETE FROM main.my_transform WHERE timestamp > (SELECT MAX(timestamp) FROM main.my_transform;
INSERT INTO main.my_transform
SELECT foo, timestamp FROM main.some_table
WHERE bar > 0
AND COALESCE(timestamp > (SELECT MAX(timestamp) FROM main.my_transform), TRUE)

If the upstream table is partitioned on the time column the scheduler could leverage that, allowing the initial materialization (or backfills) to be run in parallel for different partitions.

Non-standard materialization

Some use cases might require tables to be materialized via other methods than SQL. For example, Netflix might have a tool to send data to Druid, since currently Druid only supports SELECT statements. When materialization cannot be done via SQL the node attribute would specify a custom plugin, eg:

# B node
sql: SELECT foo FROM A WHERE bar > 0
representations:
  - catalog: druid
    schema: main
    table: my_transform
materialization:
  job:
    plugin: NetflixDruidMaterialization
    configuration:
      query: SELECT foo FROM spark_cluster.some_table WHERE bar > 0
      cluster_id: 42
      move_to: my_transform
  schedule:
    cron: 0 * * * *

The service would then find the NetflixDruidMaterialization plugin via an entry point:

plugins = {ep.name: ep.load() for ep in iter_entry_points("dj.materialization.job")}

return plugins["NetflixDruidMaterialization"](
    query="SELECT foo FROM spark_cluster.some_table WHERE bar > 0",
    cluster_id=42,
    move_to="my_transform",
)

Note that a drawback of this approach is that it requires the schedulers to be written in Python, so they have access to the plugin.

Open questions

Schema changes

What should we do for schema changes? The easy solution would be to drop and recreate the table when the schema changes, but that might be expensive and might result in data loss if someone is editing a node and accidentally drops a table. The main concerns here are:

New column: the scheduler can run ALTER TABLE, but the column would need to be backfilled, so it might be easier to create a new table instead.
Column deleted: if the database doesn't support DROP COLUMN we could keep the column and fill it with NULL when the materialization runs after the column is deleted. Since the deleted column won't be referenced in any queries it shouldn't be a problem to keep it.
Type change: depending on the database it might be possible to change the type.

Who is responsible for detecting schema changes? Should (1) the materialization service compare the schema of existing tables and the nodes being materialized? Or should (2) the DJ service compare the schemas of different versions of a given node when building the SQL?

The benefit of the second approach is that the materialization SQL produced could contain the ALTER TABLE statements needed to adjust the schema. But it would required the DJ service to keep track of the version currently materialized and the version being materialized, which IMHO should be the responsibility of the materialization service.

The first approach might be easier, and the materialization service can use the reflection service to get the schema of an existing table, so it can be compared with the schema of the node being materialized.

samredai · 2023-03-15T15:15:36Z

samredai
Mar 15, 2023
Collaborator

For the rest API we could add a new endpoint /nodes/searches

Since we're maintaining the /nodes/ endpoint that returns all information for nodes including materialization configurations, do we need to have this specialized endpoint? Even if a DJ system has 10,000 nodes, it should still be performant for the materialization service to request them all and compare the materialization configuration against the current availability state and timestamps (for incremental).

re: Materialization service calling the query service, I'm wondering if the materialization service should have direct connections to the engines the way the query service does, mainly because I think the SLAs for both of these services are so different that it might make sense to keep their logical paths separate. The materialization service being down for 3 hours may almost be a non-event that users don't even notice, whereas the query service being down for 1 hour is probably a huge event (all dashboards, reports, BI tools, etc. stop working)

Overall this design makes sense to me except for the part where multiple tables with different columns and different data are attached to a single node, although I think that's a separate feature that has implications for how we materialize.

0 replies

betodealmeida · 2023-03-15T17:14:35Z

betodealmeida
Mar 15, 2023
Maintainer Author

Since we're maintaining the /nodes/ endpoint that returns all information for nodes including materialization configurations, do we need to have this specialized endpoint? Even if a DJ system has 10,000 nodes, it should still be performant for the materialization service to request them all and compare the materialization configuration against the current availability state and timestamps (for incremental).

The /nodes/search endpoint might be a premature optimization at this point, but I think we'll definitely need it when we build the UI, where responsiveness is a big factor.

re: Materialization service calling the query service, I'm wondering if the materialization service should have direct connections to the engines the way the query service does, mainly because I think the SLAs for both of these services are so different that it might make sense to keep their logical paths separate. The materialization service being down for 3 hours may almost be a non-event that users don't even notice, whereas the query service being down for 1 hour is probably a huge event (all dashboards, reports, BI tools, etc. stop working)

I'm not sure I understand your point. I think we both agree that the query service needs a high SLA and the materialization service a lower one. If the materialization service uses the query service it should be fine then; it can go down for a few hours and no one will care. As long as we have a low SLA service depending on a high SLA one it should be fine, no?

Overall this design makes sense to me except for the part where multiple tables with different columns and different data are attached to a single node, although I think that's a separate feature that has implications for how we materialize.

That part is orthogonal to the materialization service, I just put it here to reinforce the fact that nodes should have multiple materialization attributes.

1 reply

agorajek Mar 20, 2023
Maintainer

The /nodes/search endpoint might be a premature optimization at this point, but I think we'll definitely need it when we build the UI, where responsiveness is a big factor

+1 on both: premature optimization and the probability of eventually needing a nice search API, so the client does not need to get_all_nodes() and find the ones they need every time.

shangyian · 2023-03-15T21:10:26Z

shangyian
Mar 15, 2023
Collaborator

Some use cases might require tables to be materialized via other methods than SQL. For example, Netflix might have a tool to send data to Druid, since currently Druid only supports SELECT statements. When materialization cannot be done via SQL the node attribute would specify a custom plugin, eg:

Is there a reason why the Druid call needs to be a separate plugin? I think it can live as a part of the OSS djqs as well, because in my view the "query service" is not just limited to running SQL, but acts more like a data access layer. So DJ can just make a call to the query service that tells it to ingest to Druid, with the fields necessary to generate the Druid ingestion spec, and the query service can generate the full spec and hit the Druid cluster, assuming one is configured.

1 reply

betodealmeida Mar 15, 2023
Maintainer Author

It could be part of the OSS, I just think there's a "native" path of doing the materialization (running SQL) and another path that involves specific logic, which should then be done through plugins.

samredai · 2023-03-19T18:26:17Z

samredai
Mar 19, 2023
Collaborator

I'm not sure I understand your point. I think we both agree that the query service needs a high SLA and the materialization service a lower one. If the materialization service uses the query service it should be fine then; it can go down for a few hours and no one will care. As long as we have a low SLA service depending on a high SLA one it should be fine, no?

I was thinking more about how the materialization service and main DJ service will have to compete for the query service's resources. In a scenario where the query service is under high load, we may end up going down the path of trying to do things like add a priority queue to separate the requests within the query service and I guess I'm just wondering if we should just separate them all together from the start.

0 replies

samredai · 2023-03-19T19:05:34Z

samredai
Mar 19, 2023
Collaborator

That part is orthogonal to the materialization service, I just put it here to reinforce the fact that nodes should have multiple materialization attributes.

I've been thinking about this since our discussion in the last sync. I agree that we should address this now rather than later for users that are dealing with fragmented data. I sort of see 3 user scenarios and potential solutions listed below in order of increasing complexity. @betodealmeida let me know if these scenarios are aligned with how you view this problem space.

Multiple catalogs, no duplication of data, no requirement for cross-catalog joins
- This is solved for in the current state. A user can have both an isolated DAG all within a postgres catalog and an isolated DAG all within a snowflake catalog, assuming no nodes conceptually represent two tables and there's no need to join tables across catalogs.
Multiple catalogs, but all querying to be served by a single catalog (EL + data lake strategy)
- The solution here relies on the materialization service. If the materialization service can consolidate multiple representations of the data into a single table in the primary catalog, this means all nodes are still available in a single common catalog. We might be able to leverage some OSS EL tools here like airbyte.
Multiple catalogs, dynamic catalog selection at runtime based on request, availability, and cost
- This is the original feature that was lost during the refactoring. This requires a runtime analysis to compare the availability of source nodes in each catalog, the node materializations that exist in each catalog, and a comparison of the columns required for the request to the columns available in each catalog (which may differ). Multiple paths are resolved via a cost comparison.

Tackling 2 next feels like the right thing to do because it's aligned with where the industry is heading. Most organizations have a data lake strategy where they ingest data from all of these disparate sources into a data lake, usually OSS table formats in cloud storage, and then choose from the myriad of OSS compute engines depending on the individual use case. The organizations that don't have this setup are in the process of migrating towards this setup.

3 has the most challenging solution and represents the complexity that probably moved most orgs towards the data lake strategy. I think it's important that we don't understate the additional complexity to the DJ codebase by trying to solve for this. I mentioned in the sync that I have a strong suspicion that if we successfully deliver 2, then people will actually find that as a preferable alternative to more complex runtime analysis.

2 replies

betodealmeida Mar 22, 2023
Maintainer Author

(2) is not an interesting problem, since like you said it would just leverage 3rd party tools to transform the problem into a solved one, (1).

(3) is the original core feature that I added from the start, and the one that's interesting to me. As far as we have all the information needed to do the runtime analysis (ie, we don't need to query the databases to figure out availability) then I don't see how it can be non-performant, even for millions of nodes, since the height is usually very limited (at Facebook, with millions of tables, the longest "Airflow" DAG had 6 tables, IIRC). All it has to do it find common ancestors in a very short DAG.

betodealmeida Mar 22, 2023
Maintainer Author

Though you could also argue that (3) can be solved by leveraging a 3rd party tool like Presto/Trino, transforming the problem into (1). 🙃

agorajek · 2023-03-20T23:13:47Z

agorajek
Mar 20, 2023
Maintainer

@betodealmeida @samredai @shangyian I read through all your comments about the Materialization service functionality vs the Query service one in the light of competition for resources. One thing that I feel no-mentioned clearly yet (sorry if I missed it somewhere) is that the Query service main data access mode is Read, while the Materialization ... mode is Write.

With that in mind I'd like to propose the following:

Query service:
a) read-only access to data catalogs
b) optimized for high SLA and response time
c) in OSS based on SQL Alchemy, easily customizable to specific data access infra
Materialization service:
a) write access to data catalogs
b) optimized for resiliency and restart-ability, not for response time.
c) in OSS based on default hourly/daily cron jobs (and/or example plugins to Airflow), easily pluggable to custom schedulers

The above solution would by default separate the read and write access paths, even though you can see some small implementations going through the same DA layer in the end. But even then the DJ admin would be able to see the Read and Write calls coming from two separate services, which should be easier to manage.

0 replies

shangyian · 2023-03-22T05:11:02Z

shangyian
Mar 22, 2023
Collaborator

+1 to the splitting of functionality between the services along read/write access paths.

I think we can have two materialization config options: (1) using a pre-configured schedule and (2) relying entirely on availability state + valid through timestamp.

With the latter materialization option, we would expect this flow:

The reflection service posts the availability state of source nodes through regular polling, which would include a valid_through_timestamp when the underlying tables are updated.
The materialization service will then:
a) poll for all nodes with materialization configured,
b) check the availability states of these nodes + along with their parents
c) attempt to materialize if the valid_through_timestamp of the parents have changed relative to the node itself's valid_through_ts.

Note that steps (a) and (b) could be combined into a single endpoint in DJ like /nodes/materializable/, or we can add more generic search functionality (in /nodes/search/ as described above).

0 replies

shangyian · 2023-03-22T14:41:47Z

shangyian
Mar 22, 2023
Collaborator

Some additional thoughts on when/how we should materialize:

In the initial version of the materialization service, I think we should materialize all transform, dimension, and cube nodes, assuming that these nodes have MaterializationConfig set. This means that we have a few options:

Option "Always Wait for Upstreams": The DJ core service will return nodes with their immediate parents, which lets us determine which nodes to materialize first. With this DAG of nodes, the materialization service would start by finding all nodes with only source-node upstreams and kick off the materializations for these first. Then, when each node materialization finishes, it would find each node's immediate child nodes and determine if they're ready to be materialized (only true if all of their immediate parents have finished materialization) and kick them off if they are.

Option "Materialize Straight Away": In this strategy, the DJ core service just returns a list of nodes. The materialization service will materialize each node without looking at their parents, meaning that each materialization query will include the transform SQL of each of its parent nodes rather than using the materialized parent node table. This setup is more wasteful, but the logic is very simple. It is also the case that if we have many transform nodes operating on "small" data, the extra overhead of waiting for upstreams so that we can reuse materialized datasets is unnecessary.

My vote is for "Always Wait for Upstreams" as that seems more robust.

1 reply

agorajek Mar 22, 2023
Maintainer

Mostly agree, but I would add a twist to the "Always Wait for Upstreams" option. Let's say for any given node the user marks as to-be-materialized we also let them say whether it is full or incremental. I know this may be debatable but let's say just for now...

Let's look at one "CURRENT" transform node (on some system event (say node CRUD operation) or period basis):

We check the valid_through_timestamp on the CURRENT node and on all the parent nodes that are to be used as sources. If all(parent nodes vtts) > current-node-vtts then we request the materialization to run...

Full case:
2. We send an adhoc request to the scheduler to reload the CURRENT node, OR we create (or update if exists) a recurring scheduler job that wakes up periodically and check for the condition and runs the load if needed.

Incremental case:
2. We send an adhoc request to the scheduler to load the CURRENT node for the time range of current-node-vtts to min(all(parent nodes vtts)), OR we create a recurring scheduler job that (checking and running) for us. If we see that there is a gap between the available data in the parent nodes and the earliest data partition in the CURRENT node we can also kick off a backfill job, unless the user specifies otherwise.

When the materialization is done we update the data availability state on the CURRENT node OR we add the call-back to the recurring job and rely on it to update the CURRENT node availability state.

In the above mechanics there are some good defaults that should help us make this system more practical. e.g.:

default incremental scheduler: daily as soon as data is available
default partition: date, or datestr or dateint, etc
Every node would allow these things to be overridable but I some global settings would make our life easier here for sure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for a DJ materialization service #407

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 5 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Proposal for a DJ materialization service #407

betodealmeida Mar 12, 2023 Maintainer

Summary

Motivation

Example

Architecture

Goals

Discovery

Execution

Advanced

Incremental materialization

Non-standard materialization

Open questions

Schema changes

Replies: 8 comments · 5 replies

samredai Mar 15, 2023 Collaborator

betodealmeida Mar 15, 2023 Maintainer Author

agorajek Mar 20, 2023 Maintainer

shangyian Mar 15, 2023 Collaborator

betodealmeida Mar 15, 2023 Maintainer Author

samredai Mar 19, 2023 Collaborator

samredai Mar 19, 2023 Collaborator

betodealmeida Mar 22, 2023 Maintainer Author

betodealmeida Mar 22, 2023 Maintainer Author

agorajek Mar 20, 2023 Maintainer

shangyian Mar 22, 2023 Collaborator

shangyian Mar 22, 2023 Collaborator

agorajek Mar 22, 2023 Maintainer

betodealmeida
Mar 12, 2023
Maintainer

Replies: 8 comments 5 replies

samredai
Mar 15, 2023
Collaborator

betodealmeida
Mar 15, 2023
Maintainer Author

agorajek Mar 20, 2023
Maintainer

shangyian
Mar 15, 2023
Collaborator

betodealmeida Mar 15, 2023
Maintainer Author

samredai
Mar 19, 2023
Collaborator

samredai
Mar 19, 2023
Collaborator

betodealmeida Mar 22, 2023
Maintainer Author

betodealmeida Mar 22, 2023
Maintainer Author

agorajek
Mar 20, 2023
Maintainer

shangyian
Mar 22, 2023
Collaborator

shangyian
Mar 22, 2023
Collaborator

agorajek Mar 22, 2023
Maintainer