|
| 1 | +# About the Merlin Graph |
| 2 | + |
| 3 | +```{contents} |
| 4 | +--- |
| 5 | +depth: 2 |
| 6 | +local: true |
| 7 | +backlinks: none |
| 8 | +--- |
| 9 | +``` |
| 10 | + |
| 11 | +## Purpose of the Merlin Graph |
| 12 | + |
| 13 | +Merlin uses a directed acyclic graph (DAG) to represent operations on data such as normalizing or clipping values and to represent operations in a recommender system such as creating an ensemble or filtering candidate items during inference. |
| 14 | + |
| 15 | +Understanding the Merlin DAG is helpful if you want to develop your own Operator or building a recommender system with Merlin. |
| 16 | + |
| 17 | +## Graph Terminology |
| 18 | + |
| 19 | +node |
| 20 | +: A node in the DAG is a group of columns and at least one _operator_. |
| 21 | + The columns are specified with a _column selector_. |
| 22 | + A node has an _input schema_ and an _output schema_. |
| 23 | + Resolution of the schemas is delayed until you run `fit` or `transform` on a dataset. |
| 24 | + |
| 25 | +column selector |
| 26 | +: A column selector specifies the columns to select from a dataset using column names or _tags_. |
| 27 | + |
| 28 | +operator |
| 29 | +: An operator performs a transformation on data and return a new _node_. |
| 30 | + The data is identified by the _column selector_. |
| 31 | + Some simple operators like `+` and `-` add or remove columns. |
| 32 | + More complex operations are applied by shifting the operators onto the column selector with the `>>` notation. |
| 33 | + |
| 34 | +schema |
| 35 | +: A Merlin schema is metadata that describes the columns in a dataset. |
| 36 | + Each column has its own schema that identifies the column name and can specify _tags_ and properties. |
| 37 | + |
| 38 | +tag |
| 39 | +: A Merlin tag categorizes information about a column. |
| 40 | + Adding a tag to a column enables you to select columns for operations by tag rather than name. |
| 41 | + |
| 42 | + For example, you can add the `CONTINUOUS` or `CATEGORICAL` tags to columns. |
| 43 | + Feature engineering Operators, modeling, and inference operations can use that information to operate accordingly on the dataset. |
| 44 | + |
| 45 | +## Introduction to Operators, Columns, Nodes, and Schema |
| 46 | + |
| 47 | +The NVTabular library uses Operators for feature engineering. |
| 48 | +One example of an NVTabular Operator is `Normalize`. |
| 49 | +The Operator normalizes continuous variables between `0` and `1`. |
| 50 | + |
| 51 | +The Merlin Systems library uses Operators for building ensembles and performing inference. |
| 52 | +The library includes Operators such as `FilterCandidates` and `PredictTensorflow`. |
| 53 | +You use these Operators to put your models into production and serve recommendations. |
| 54 | + |
| 55 | +Merlin enables you to chain together Operators with the `>>` syntax to create feature-processing workflows. |
| 56 | +The `>>` syntax means "take the output columns from the left-hand side and feed them as the input columns to the right-hand side." |
| 57 | + |
| 58 | +You can specify an explicit list of columns names for an Operator. |
| 59 | +The following code block shows the syntax for explicit column names: |
| 60 | + |
| 61 | +```python |
| 62 | +result = ["col1", "col2",] >> SomeOperator(...) |
| 63 | +``` |
| 64 | + |
| 65 | +Or, you can use the `>>` syntax between Operators to run one Operator on all the output columns from the preceding Operator: |
| 66 | + |
| 67 | +```python |
| 68 | +result = AnOperator(...) >> OtherOperator(...) |
| 69 | +``` |
| 70 | + |
| 71 | +Chaining Operators together builds a graph. |
| 72 | +The following figure shows how each node in the graph has an Operator. |
| 73 | + |
| 74 | + |
| 75 | + |
| 76 | +```{tip} |
| 77 | +After you build an NVTabular workflow or Merlin Systems transform workflow, you can visualize the graph and create an image like the preceding example by running the `graph` method. |
| 78 | +``` |
| 79 | + |
| 80 | +Each node in a graph has an input schema and an output schema that describe the input columns to the Operator and the output columns produced by the Operator. |
| 81 | +The following figure represents an Operator, `SomeOperator`, that adds `colB` to a dataset. |
| 82 | + |
| 83 | + |
| 84 | + |
| 85 | +In practice, when Merlin first builds the graph, the workflow does not initially know which columns are input or output. |
| 86 | +This is for two reasons: |
| 87 | + |
| 88 | +1. Merlin enables you to build graphs that process categories of columns. |
| 89 | + The categories are specified by _tags_ instead of an explicit list of column names. |
| 90 | + |
| 91 | + For example, you can select the continuous columns from your dataset with code like the following example: |
| 92 | + |
| 93 | + ```python |
| 94 | + [Tags.CONTINUOUS] >> Operator(...) |
| 95 | + ``` |
| 96 | + |
| 97 | +1. You can chain Operators together into a graph, such as an NVTabular workflow, before you specify a dataset. |
| 98 | + The graph, Operators, and schema do not know which columns will be selected by tag until the software accesses the dataset and determines the column names. |
| 99 | + |
| 100 | +## Reference Documentation |
| 101 | + |
| 102 | +- {py:class}`nvtabular.ops.Normalize` |
| 103 | +- {py:class}`nvtabular.workflow.workflow.Workflow` |
| 104 | +- {py:class}`merlin.systems.dag.ops.workflow.TransformWorkflow` |
| 105 | +- {py:class}`merlin.systems.dag.Ensemble` |
| 106 | +- {py:class}`merlin.systems.dag.ops.session_filter.FilterCandidates` |
| 107 | +- {py:class}`merlin.systems.dag.tensorflow.PredictTensorFlow` |
0 commit comments