|
| 1 | +# About the Merlin Directed Acyclic Graph |
| 2 | + |
| 3 | +```{contents} |
| 4 | +--- |
| 5 | +depth: 2 |
| 6 | +local: true |
| 7 | +backlinks: none |
| 8 | +--- |
| 9 | +``` |
| 10 | + |
| 11 | +Merlin uses a directed acyclic graph (DAG) to represent operations on data such as filtering or bucketing and to represent operations in a recommender system such as creating an ensemble or filtering candidate items during inference. |
| 12 | + |
| 13 | +Understanding the Merlin DAG is helpful if you want to develop your own operator (Op) or building a recommender system with Merlin. |
| 14 | + |
| 15 | +## Graph Terminology |
| 16 | + |
| 17 | +node |
| 18 | +: A node in the DAG is a group of columns and at least one _operator_. |
| 19 | + The columns are specified with a _column selector_. |
| 20 | + A node has an _input schema_ and an _output schema_. |
| 21 | + Resolution of the schemas is delayed until you run `fit` or `transform` on a dataset. |
| 22 | + |
| 23 | +column selector |
| 24 | +: A column selector specifies the columns to select from a dataset using column names or _tags_. |
| 25 | + |
| 26 | +operator |
| 27 | +: An operator performs a transformation on data and return a new _node_. |
| 28 | + The data is identified by the _column selector_. |
| 29 | + Some simple operators like `+` and `-` add or remove columns. |
| 30 | + More complex operations are applied by shifting the operators onto the column selector with the `>>` notation. |
| 31 | + |
| 32 | +schema |
| 33 | +: A Merlin schema is metadata that describes the columns in a dataset. |
| 34 | + Each column has its own schema that identifies the column name and can specify _tags_ and properties. |
| 35 | + |
| 36 | +tag |
| 37 | +: A Merlin tag categorizes information about a column. |
| 38 | + Adding a tag to a column enables you to select columns for operations by tag rather than name. |
| 39 | + |
| 40 | + For example, you can add the `USER` and `ITEM` tags to columns. |
| 41 | + Modeling and inference operations can use that information to act accordingly on the dataset. |
| 42 | + |
| 43 | +## Understanding Operators, Columns, Nodes, and Schema |
| 44 | + |
| 45 | +Merlin enables you to chain together Operators with the `>>` syntax to create feature-processing workflows. |
| 46 | +The `>>` syntax means "take the output of the left-hand side and feed it into the input of the right-hand side." |
| 47 | + |
| 48 | +You can specify an explicit list of columns to run an Operator on just the specified columns. |
| 49 | +The following code block shows the syntax for explicit column names: |
| 50 | + |
| 51 | +```python |
| 52 | +result = ["col1", "col2",] >> SomeOperator(...) |
| 53 | +``` |
| 54 | + |
| 55 | +Or, you can use the `>>` syntax between Operators to run one Operator on all the output columns from the preceding Operator: |
| 56 | + |
| 57 | +```python |
| 58 | +result = AnOperator(...) >> OtherOperator(...) |
| 59 | +``` |
| 60 | + |
| 61 | +Chaining Operators together builds a graph. |
| 62 | +The following figure shows how each node in the graph has an Operator. |
| 63 | + |
| 64 | + |
| 65 | + |
| 66 | +Each node in a graph has an input schema and an output schema that describe the columns that go into an Operator and the columns that go out of an Operator. |
| 67 | +The following figure represents an Operator that adds `colB` to a dataset. |
| 68 | + |
| 69 | + |
| 70 | + |
| 71 | +In practice, the workflow does not know which columns are processed or produced immediately when Merlin builds the graph. |
| 72 | +This is for two reasons: |
| 73 | + |
| 74 | +1. Merlin enables you to build graphs that process categories of columns. |
| 75 | + The categories are specified by _tags_ instead of an explicit list of column names. |
| 76 | + |
| 77 | + For example, you can select the continuous columns from your dataset with code like the following example: |
| 78 | + |
| 79 | + ```python |
| 80 | + [Tags.CONTINUOUS] >> Operator(...) |
| 81 | + ``` |
| 82 | + |
| 83 | +1. You can chain Operators together into a graph, such as an NVTabular workflow, before you specify a dataset. |
| 84 | + The graph, Operators, and schema do not know which columns will be selected by tag until the software accesses the dataset and determines the column names. |
0 commit comments