Skip to content

Commit 787e88c

Browse files
committed
Add information about the Merlin DAG
- Define the important terms of the DAG. - Incorporate Karl's information. - Karl's info about Operators and ColumnSelectors. - Karl's info about Dataset.
1 parent 5388a1d commit 787e88c

15 files changed

+2837
-6
lines changed

docs/images/dataset_and_dataframe.svg

Lines changed: 1196 additions & 0 deletions
Loading

docs/images/graph_schema.svg

Lines changed: 410 additions & 0 deletions
Loading

docs/images/graph_simple.svg

Lines changed: 369 additions & 0 deletions
Loading

docs/images/parquet_and_dataset.svg

Lines changed: 555 additions & 0 deletions
Loading

docs/requirements-doc.txt

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,5 +18,4 @@ mergedeep<1.4
1818
docker<5.1
1919
PyGithub<1.56
2020
semver>=2,<3
21-
pytest<7.3
22-
coverage<6.6
21+

docs/source/about-dag.md

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
# About the Merlin Graph
2+
3+
```{contents}
4+
---
5+
depth: 2
6+
local: true
7+
backlinks: none
8+
---
9+
```
10+
11+
## Purpose of the Merlin Graph
12+
13+
Merlin uses a directed acyclic graph (DAG) to represent operations on data such as normalizing or clipping values and to represent operations in a recommender system such as creating an ensemble or filtering candidate items during inference.
14+
15+
Understanding the Merlin DAG is helpful if you want to develop your own Operator or building a recommender system with Merlin.
16+
17+
## Graph Terminology
18+
19+
node
20+
: A node in the DAG is a group of columns and at least one _operator_.
21+
The columns are specified with a _column selector_.
22+
A node has an _input schema_ and an _output schema_.
23+
Resolution of the schemas is delayed until you run `fit` or `transform` on a dataset.
24+
25+
column selector
26+
: A column selector specifies the columns to select from a dataset using column names or _tags_.
27+
28+
operator
29+
: An operator performs a transformation on data and return a new _node_.
30+
The data is identified by the _column selector_.
31+
Some simple operators like `+` and `-` add or remove columns.
32+
More complex operations are applied by shifting the operators onto the column selector with the `>>` notation.
33+
34+
schema
35+
: A Merlin schema is metadata that describes the columns in a dataset.
36+
Each column has its own schema that identifies the column name and can specify _tags_ and properties.
37+
38+
tag
39+
: A Merlin tag categorizes information about a column.
40+
Adding a tag to a column enables you to select columns for operations by tag rather than name.
41+
42+
For example, you can add the `CONTINUOUS` or `CATEGORICAL` tags to columns.
43+
Feature engineering Operators, modeling, and inference operations can use that information to operate accordingly on the dataset.
44+
45+
## Introduction to Operators, Columns, Nodes, and Schema
46+
47+
The NVTabular library uses Operators for feature engineering.
48+
One example of an NVTabular Operator is `Normalize`.
49+
The Operator normalizes continuous variables between `0` and `1`.
50+
51+
The Merlin Systems library uses Operators for building ensembles and performing inference.
52+
The library includes Operators such as `FilterCandidates` and `PredictTensorflow`.
53+
You use these Operators to put your models into production and serve recommendations.
54+
55+
Merlin enables you to chain together Operators with the `>>` syntax to create feature-processing workflows.
56+
The `>>` syntax means "take the output columns from the left-hand side and feed them as the input columns to the right-hand side."
57+
58+
You can specify an explicit list of columns names for an Operator.
59+
The following code block shows the syntax for explicit column names:
60+
61+
```python
62+
result = ["col1", "col2",] >> SomeOperator(...)
63+
```
64+
65+
Or, you can use the `>>` syntax between Operators to run one Operator on all the output columns from the preceding Operator:
66+
67+
```python
68+
result = AnOperator(...) >> OtherOperator(...)
69+
```
70+
71+
Chaining Operators together builds a graph.
72+
The following figure shows how each node in the graph has an Operator.
73+
74+
![A directed graph with two nodes. The first node is a Selection Operator and selects columns "col1" and "col2." The second node receives the two columns as its input. The second node has a fictional SomeOperator Operator.](../images/graph_simple.svg)
75+
76+
```{tip}
77+
After you build an NVTabular workflow or Merlin Systems transform workflow, you can visualize the graph and create an image like the preceding example by running the `graph` method.
78+
```
79+
80+
Each node in a graph has an input schema and an output schema that describe the input columns to the Operator and the output columns produced by the Operator.
81+
The following figure represents an Operator, `SomeOperator`, that adds `colB` to a dataset.
82+
83+
![Part of a directed graph that shows the input schema to a fictional SomeOperator Operator as "colA". The fictional Operator adds adds "colB" and the result is an output schema with "colA" and "colB."](../images/graph_schema.svg)
84+
85+
In practice, when Merlin first builds the graph, the workflow does not initially know which columns are input or output.
86+
This is for two reasons:
87+
88+
1. Merlin enables you to build graphs that process categories of columns.
89+
The categories are specified by _tags_ instead of an explicit list of column names.
90+
91+
For example, you can select the continuous columns from your dataset with code like the following example:
92+
93+
```python
94+
[Tags.CONTINUOUS] >> Operator(...)
95+
```
96+
97+
1. You can chain Operators together into a graph, such as an NVTabular workflow, before you specify a dataset.
98+
The graph, Operators, and schema do not know which columns will be selected by tag until the software accesses the dataset and determines the column names.
99+
100+
## Reference Documentation
101+
102+
- {py:class}`nvtabular.ops.Normalize`
103+
- {py:class}`nvtabular.workflow.workflow.Workflow`
104+
- {py:class}`merlin.systems.dag.ops.workflow.TransformWorkflow`
105+
- {py:class}`merlin.systems.dag.Ensemble`
106+
- {py:class}`merlin.systems.dag.ops.session_filter.FilterCandidates`
107+
- {py:class}`merlin.systems.dag.tensorflow.PredictTensorFlow`

docs/source/about-dataset.md

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# About the Merlin Dataset
2+
3+
```{contents}
4+
---
5+
depth: 2
6+
local: true
7+
backlinks: none
8+
---
9+
```
10+
11+
## On-disk Representation
12+
13+
The Apache Parquet file format is the most-frequently used file format for Merlin datasets.
14+
15+
Parquet is a columnar storage format.
16+
The format arranges the values for each column in a long list.
17+
This format is in contrast with a row-oriented format---such as a comma-separated values format---that arranges all the data for one row together.
18+
19+
As an analogy, columnar storage is like a dictionary of columns instead of row-oriented storage that is like a list of rows.
20+
21+
In most cases, a Parquet dataset includes multiple files in one or more directories.
22+
23+
![The Merlin dataset class can read a directory of Parquet files for data access.](../images/parquet_and_dataset.svg)
24+
25+
The Merlin dataset class, `merlin.io.Dataset`, treats a collection of many Parquet files as a single dataset.
26+
By treating the collection as a single dataset, Merlin simplifies distributing computation over multiple GPUs or multiple machines.
27+
28+
The dataset class is not a copy of the data or a modification of the Parquet files.
29+
An instance of the class is similar to a collection of pointers to the Parquet files.
30+
31+
When you create an instance of the dataset class, Merlin attempts to infer a schema by reading one record of the data.
32+
Merlin attempts to determine the column names and data types.
33+
34+
## Processing Data: Dataset and DataFrame
35+
36+
When you perform a computation on a Merlin dataset, the dataset reads from the files on disk and converts them into a set of DataFrames.
37+
The DataFrames, like Parquet files, use a columnar storage format.
38+
The API for a DataFrame is similar to a Python dictionary---you can reference a column with syntax like `dataframe['col1']`.
39+
40+
![A Merlin dataset reads data from disk and becomes several DataFrames.](../images/dataset_and_dataframe.svg)
41+
42+
Merlin processes each DataFrame individually and aggregates the results across the DataFrames as needed.
43+
There are two kinds of computations that you can perform on a dataset: `fit` and `transform`.
44+
45+
The `fit` computations perform a full pass over the dataset to compute statistics, find unique values, perform grouping, or another operation that requires information from multiple DataFrames.
46+
47+
The `transform` computations process each DataFrame individually.
48+
These computations use the information gathered from `fit` to alter the DataFrame.
49+
For example the `Normalize` and `Clip` Operators compute new values for columns and the `Rename` Operator adds and removes columns.
50+
51+
More information about the `fit` and `transform` methods is provided in [](./about-operators.md).
52+
53+
## Reference Documentation
54+
55+
- {py:class}`merlin.io.Dataset`
56+
- {py:class}`nvtabular.ops.Normalize`
57+
- {py:class}`nvtabular.ops.Clip`
58+
- {py:class}`nvtabular.ops.Rename`

docs/source/about-model-blocks.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# About Merlin Model Blocks
2+
3+
FIXME

docs/source/about-operators.md

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
# About Merlin Operators
2+
3+
```{contents}
4+
---
5+
depth: 2
6+
local: true
7+
backlinks: none
8+
---
9+
```
10+
11+
## Understanding Operators
12+
13+
Merlin uses Operators to perform computation on datasets such as normalizing continuous variables, bucketing continuous variables, clipping variables between minimum and maximum values, and so on.
14+
15+
An Operator implements two key methods:
16+
17+
Fit
18+
: The `fit` method performs any pre-computation steps that are required before modifying the data.
19+
20+
For example, the `Normalize` Operator normalizes the values of a continuous variable between 0 and 1.
21+
The `fit` method determines the minimum and maximum values.
22+
23+
The method is optional.
24+
For example, the `Bucketize` and `Clip` Operators do not implement the method because you specify the bucket boundaries or the minimum and maximum values for clipping.
25+
These Operators do not need to access the data to perform any pre-computation steps.
26+
27+
Transform
28+
: The `transform` method operates on the dataset such as normalizing values, bucketing, or clipping.
29+
This method modifies the data.
30+
31+
Another difference between the two methods is that the `fit` method accepts a Merlin dataset object and the `transform` method accepts a DataFrame object.
32+
The difference is an implementation detail---the `fit` method must access all the data and the `transform` method processes each part of the dataset one at a time.
33+
34+
```{code-block} python
35+
---
36+
emphasize-lines: 5, 12
37+
---
38+
# Typical signature of a fit method.
39+
def fit(
40+
self,
41+
selector: ColumnSelector,
42+
dataset: Dataset
43+
) -> Any
44+
45+
# Typical signature of a transform method.
46+
def transform(
47+
self,
48+
selector: ColumnSelector,
49+
df: DataFrame
50+
) -> DataFrame
51+
```
52+
53+
## Operators and Columns: Column Selector
54+
55+
In most cases, you want an Operator to process a subset of the columns in your input dataset.
56+
Both the `fit` and `transform` methods have a `selector` argument that specifies the columns to operate on.
57+
Merlin uses a `ColumnSelector` class to represent the columns.
58+
59+
The simplest column selector is a list of strings that specify some column names.
60+
In the following sample code, `["col1", "col2"]` become an instance of a `ColumnSelector` class.
61+
62+
```python
63+
result = ["col1", "col2"] >> SomeOperator(...)
64+
```
65+
66+
Column selectors also offer a more powerful and flexible way to specify columns.
67+
You can specify the input columns to an Operator with tags.
68+
In the following sample code, the Operator processes all the continuous variables in a dataset.
69+
70+
```python
71+
result = [Tags.CONTINUOUS] >> SomeOperator(...)
72+
```
73+
74+
Using tags to create a column selector offers the following advantages:
75+
76+
- Enables you to apply several Operators to the same kind of columns, such as categorical or continuous variables.
77+
- Reduces code maintenance by enabling your code to automatically operate on newly added columns in a dataset.
78+
- Simplifies code by avoiding lists of strings for column names.
79+
80+
## How to Build an Operator
81+
82+
Blah.
83+
84+
## Reference Documentation
85+
86+
- {py:class}`merlin.dag.BaseOperator`
87+
- {py:class}`merlin.dag.ColumnSelector`
88+
- {py:class}`merlin.schema.Tags`
89+
- {py:class}`merlin.io.DataSet`

docs/source/about-schema.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# About the Merlin Schema
2+
3+
FIXME

0 commit comments

Comments
 (0)