Skip to content

Commit 1f4e38a

Browse files
committed
Add information about the Merlin DAG
- Define the important terms of the DAG. - Incorporate Karl's information.
1 parent c2f9519 commit 1f4e38a

File tree

12 files changed

+920
-5
lines changed

12 files changed

+920
-5
lines changed

docs/images/graph_schema.svg

Lines changed: 410 additions & 0 deletions
Loading

docs/images/graph_simple.svg

Lines changed: 369 additions & 0 deletions
Loading

docs/requirements-doc.txt

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,5 +18,4 @@ mergedeep<1.4
1818
docker<5.1
1919
PyGithub<1.56
2020
semver>=2,<3
21-
pytest<7.3
22-
coverage<6.6
21+

docs/source/about-dag.md

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# About the Merlin Directed Acyclic Graph
2+
3+
```{contents}
4+
---
5+
depth: 2
6+
local: true
7+
backlinks: none
8+
---
9+
```
10+
11+
Merlin uses a directed acyclic graph (DAG) to represent operations on data such as filtering or bucketing and to represent operations in a recommender system such as creating an ensemble or filtering candidate items during inference.
12+
13+
Understanding the Merlin DAG is helpful if you want to develop your own operator (Op) or building a recommender system with Merlin.
14+
15+
## Graph Terminology
16+
17+
node
18+
: A node in the DAG is a group of columns and at least one _operator_.
19+
The columns are specified with a _column selector_.
20+
A node has an _input schema_ and an _output schema_.
21+
Resolution of the schemas is delayed until you run `fit` or `transform` on a dataset.
22+
23+
column selector
24+
: A column selector specifies the columns to select from a dataset using column names or _tags_.
25+
26+
operator
27+
: An operator performs a transformation on data and return a new _node_.
28+
The data is identified by the _column selector_.
29+
Some simple operators like `+` and `-` add or remove columns.
30+
More complex operations are applied by shifting the operators onto the column selector with the `>>` notation.
31+
32+
schema
33+
: A Merlin schema is metadata that describes the columns in a dataset.
34+
Each column has its own schema that identifies the column name and can specify _tags_ and properties.
35+
36+
tag
37+
: A Merlin tag categorizes information about a column.
38+
Adding a tag to a column enables you to select columns for operations by tag rather than name.
39+
40+
For example, you can add the `USER` and `ITEM` tags to columns.
41+
Modeling and inference operations can use that information to act accordingly on the dataset.
42+
43+
## Understanding Operators, Columns, Nodes, and Schema
44+
45+
Merlin enables you to chain together Operators with the `>>` syntax to create feature-processing workflows.
46+
The `>>` syntax means "take the output of the left-hand side and feed it into the input of the right-hand side."
47+
48+
You can specify an explicit list of columns to run an Operator on just the specified columns.
49+
The following code block shows the syntax for explicit column names:
50+
51+
```python
52+
result = ["col1", "col2",] >> SomeOperator(...)
53+
```
54+
55+
Or, you can use the `>>` syntax between Operators to run one Operator on all the output columns from the preceding Operator:
56+
57+
```python
58+
result = AnOperator(...) >> OtherOperator(...)
59+
```
60+
61+
Chaining Operators together builds a graph.
62+
The following figure shows how each node in the graph has an Operator.
63+
64+
![A directed graph with two nodes. The first node is a Selection Operator and selects columns "col1" and "col2." The second node receives the two columns as its input. The second node has a fictional SomeOperator Operator.](../images/graph_simple.svg)
65+
66+
Each node in a graph has an input schema and an output schema that describe the columns that go into an Operator and the columns that go out of an Operator.
67+
The following figure represents an Operator that adds `colB` to a dataset.
68+
69+
![Part of a directed graph that shows the input schema to a fictional SomeOperator Operator as "colA". The fictional Operator adds adds "colB" and the result is an output schema with "colA" and "colB."](../images/graph_schema.svg)
70+
71+
In practice, the workflow does not know which columns are processed or produced immediately when Merlin builds the graph.
72+
This is for two reasons:
73+
74+
1. Merlin enables you to build graphs that process categories of columns.
75+
The categories are specified by _tags_ instead of an explicit list of column names.
76+
77+
For example, you can select the continuous columns from your dataset with code like the following example:
78+
79+
```python
80+
[Tags.CONTINUOUS] >> Operator(...)
81+
```
82+
83+
1. You can chain Operators together into a graph, such as an NVTabular workflow, before you specify a dataset.
84+
The graph, Operators, and schema do not know which columns will be selected by tag until the software accesses the dataset and determines the column names.

docs/source/about-model-blocks.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# About Merlin Model Blocks
2+
3+
FIXME

docs/source/about-operators.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# About Merlin Operators
2+
3+
## How to Build an Operator
4+
5+
FIXME

docs/source/about-schema.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# About the Merlin Schema
2+
3+
FIXME

docs/source/conf.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,14 @@
118118

119119
autosummary_generate = True
120120

121+
intersphinx_mapping = {
122+
"python": ("https://docs.python.org/3", None),
123+
"merlin-core": ("https://nvidia-merlin.github.io/core/main", None),
124+
"merlin-systems": ("https://nvidia-merlin.github.io/systems/main", None),
125+
"merlin-models": ("https://nvidia-merlin.github.io/models/main", None),
126+
"NVTabular": ("https://nvidia-merlin.github.io/NVTabular/main", None),
127+
}
128+
121129
copydirs_additional_dirs = ["../../examples/", "../../README.md"]
122130

123131
copydirs_file_rename = {

docs/source/technical-concepts.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# Merlin Technical Concepts
2+
3+
The following pages provide a deeper technical understanding of Merlin concepts.
4+
These concepts can help you to develop your own operator to implement a more sophisticated recommender system.

docs/source/toc.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,5 +46,13 @@ subtrees:
4646
title: Deploy the HugeCTR Model with Triton
4747
- file: examples/scaling-criteo/04-Triton-Inference-with-Merlin-Models-TensorFlow.ipynb
4848
title: Deploy the TensorFlow Model with Triton
49+
- title: Merlin Technical Concepts
50+
file: technical-concepts.md
51+
entries:
52+
- file: about-dag.md
53+
title: Graph Concepts
54+
- file: about-schema.md
55+
- file: about-operators.md
56+
- file: about-model-blocks.md
4957
- file: containers.rst
5058
- file: support_matrix/index.rst

0 commit comments

Comments
 (0)