Skip to content

Consider updating API design for clarity #61

@markmc

Description

@markmc

Consider the following:

ds = Dataset.from_list(samples)

skills_flow = SynthGroundedSkillsFlow(client, "mixtral", teacher_model).get_flow()
skills_pipe = Pipeline(skills_flow)

sdg = SDG([skills_pipe])
gen_data = sdg.generate(ds)

or:

ds = Dataset.from_list(samples)

mmlu_flow = MMLUBenchFlow(client, teacher_model).get_flow()
mmlu_pipe = Pipeline(mmlu_flow)
knowledge_flow = SynthKnowledgeFlow(client, teacher_model).get_flow()
knowledge_pipe = Pipeline(knowledge_flow)

sdg = SDG([mmlu_pipe, knowledge_pipe])
gen_data = sdg.generate(ds)

Consider the nouns:

  • Dataset - this is from Hugging Face's datasets library
  • Block - not shown in the code above, but required to understand a flow - a block provides a generate() method transforms an input dataset and returns an output dataset
  • Block config - a description of how to instantiate and invoke a block
  • Flow - a class which describes how to render a sequence of block configs from a template
  • Pipeline - a pipeline is created from a sequence of block configs, and provides a generate() method in which it instantiates and invokes blocks in turn, passing the input dataset and collecting the output
  • SDG - an SDG is created from a list of pipelines, and its generate() method calls pipelines in turn

Proposals:

  1. Remove SDG - we don't need both SDG and Pipeline since Pipeline can already do everything SDG can do
  2. Model Flow as a block config template - it would be more clear if we reinforced the idea that a "flow" is a template of a block config sequence - a render() method make sense to me, and an extensible params object for the common case of instantiating multiple flows
  3. Create a pipeline from a sequence of flows - add a Pipeline.from_flows() convenience class method to Pipeline that knows how to render block configs from a sequence of flows

So we could have e.g.

ds = Dataset.from_list(samples)

flow_params = FlowParams(client, "mixtral", teacher_model)

block_configs = SynthGroundedSkillsFlow(flow_params).render(params)

skills_pipe = Pipeline(block_configs)

gen_data = skills_pipe.generate(ds)

or:

ds = Dataset.from_list(samples)

flow_params = FlowParams(client, "mixtral", teacher_model)

block_configs = MMLUBenchFlow(flow_params).render()
block_configs.extend(SynthKnowledgeFlow(flow_params).render())

knowledge_pipe = Pipeline(block_configs)

gen_data = knowledge_pipe.generate(ds)

or:

ds = Dataset.from_list(samples)

flow_params = FlowParams(client, "mixtral", teacher_model)

knowledge_pipe = Pipeline.from_flows([MMLUBenchFlow, SynthKnowledgeFlow], flow_params)

gen_data = knowledge_pipe.generate(ds)

Resolving this issue would require an update to the design doc and a code change

It would definitely be better to do this before users of the API proliferate

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions