Consider updating API design for clarity

Consider the following:

```
ds = Dataset.from_list(samples)

skills_flow = SynthGroundedSkillsFlow(client, "mixtral", teacher_model).get_flow()
skills_pipe = Pipeline(skills_flow)

sdg = SDG([skills_pipe])
gen_data = sdg.generate(ds)
```

or:

```
ds = Dataset.from_list(samples)

mmlu_flow = MMLUBenchFlow(client, teacher_model).get_flow()
mmlu_pipe = Pipeline(mmlu_flow)
knowledge_flow = SynthKnowledgeFlow(client, teacher_model).get_flow()
knowledge_pipe = Pipeline(knowledge_flow)

sdg = SDG([mmlu_pipe, knowledge_pipe])
gen_data = sdg.generate(ds)
```

Consider the nouns:

- **Dataset** - this is [from Hugging Face's datasets library](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset)
- **Block** - not shown in the code above, but required to understand a flow - a block provides a `generate()` method transforms an input dataset and returns an output dataset
-  **Block config** - a description of how to instantiate and invoke a block
- **Flow** - a class which describes how to render a sequence of block configs from a template
- **Pipeline** - a pipeline is created from a sequence of block configs, and provides a `generate()` method in which it instantiates and invokes blocks in turn, passing the input dataset and collecting the output
- **SDG** - an SDG is created from a list of pipelines, and its `generate()` method calls pipelines in turn


Proposals:

1. Remove `SDG` - we don't need both `SDG` and `Pipeline` since `Pipeline` can already do everything `SDG` can do
2. Model `Flow` as a block config template - it would be more clear if we reinforced the idea that a "flow" is a template of a block config sequence - a `render()` method make sense to me, and an extensible `params` object for the common case of instantiating multiple flows
3. Create a pipeline from a sequence of flows - add a `Pipeline.from_flows()` convenience class method to Pipeline that knows how to render block configs from a sequence of flows

So we could have e.g.

```
ds = Dataset.from_list(samples)

flow_params = FlowParams(client, "mixtral", teacher_model)

block_configs = SynthGroundedSkillsFlow(flow_params).render(params)

skills_pipe = Pipeline(block_configs)

gen_data = skills_pipe.generate(ds)
```

or:

```
ds = Dataset.from_list(samples)

flow_params = FlowParams(client, "mixtral", teacher_model)

block_configs = MMLUBenchFlow(flow_params).render()
block_configs.extend(SynthKnowledgeFlow(flow_params).render())

knowledge_pipe = Pipeline(block_configs)

gen_data = knowledge_pipe.generate(ds)
```
or:

```
ds = Dataset.from_list(samples)

flow_params = FlowParams(client, "mixtral", teacher_model)

knowledge_pipe = Pipeline.from_flows([MMLUBenchFlow, SynthKnowledgeFlow], flow_params)

gen_data = knowledge_pipe.generate(ds)
```

Resolving this issue would require an update to [the design doc](https://github.com/instructlab/dev-docs/blob/main/docs/sdg/sdg-api-interface.md) and a code change

It would definitely be better to do this before users of the API proliferate

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consider updating API design for clarity #61

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consider updating API design for clarity #61

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions