-
Notifications
You must be signed in to change notification settings - Fork 55
Closed
Description
Consider the following:
ds = Dataset.from_list(samples)
skills_flow = SynthGroundedSkillsFlow(client, "mixtral", teacher_model).get_flow()
skills_pipe = Pipeline(skills_flow)
sdg = SDG([skills_pipe])
gen_data = sdg.generate(ds)
or:
ds = Dataset.from_list(samples)
mmlu_flow = MMLUBenchFlow(client, teacher_model).get_flow()
mmlu_pipe = Pipeline(mmlu_flow)
knowledge_flow = SynthKnowledgeFlow(client, teacher_model).get_flow()
knowledge_pipe = Pipeline(knowledge_flow)
sdg = SDG([mmlu_pipe, knowledge_pipe])
gen_data = sdg.generate(ds)
Consider the nouns:
- Dataset - this is from Hugging Face's datasets library
- Block - not shown in the code above, but required to understand a flow - a block provides a
generate()method transforms an input dataset and returns an output dataset - Block config - a description of how to instantiate and invoke a block
- Flow - a class which describes how to render a sequence of block configs from a template
- Pipeline - a pipeline is created from a sequence of block configs, and provides a
generate()method in which it instantiates and invokes blocks in turn, passing the input dataset and collecting the output - SDG - an SDG is created from a list of pipelines, and its
generate()method calls pipelines in turn
Proposals:
- Remove
SDG- we don't need bothSDGandPipelinesincePipelinecan already do everythingSDGcan do - Model
Flowas a block config template - it would be more clear if we reinforced the idea that a "flow" is a template of a block config sequence - arender()method make sense to me, and an extensibleparamsobject for the common case of instantiating multiple flows - Create a pipeline from a sequence of flows - add a
Pipeline.from_flows()convenience class method to Pipeline that knows how to render block configs from a sequence of flows
So we could have e.g.
ds = Dataset.from_list(samples)
flow_params = FlowParams(client, "mixtral", teacher_model)
block_configs = SynthGroundedSkillsFlow(flow_params).render(params)
skills_pipe = Pipeline(block_configs)
gen_data = skills_pipe.generate(ds)
or:
ds = Dataset.from_list(samples)
flow_params = FlowParams(client, "mixtral", teacher_model)
block_configs = MMLUBenchFlow(flow_params).render()
block_configs.extend(SynthKnowledgeFlow(flow_params).render())
knowledge_pipe = Pipeline(block_configs)
gen_data = knowledge_pipe.generate(ds)
or:
ds = Dataset.from_list(samples)
flow_params = FlowParams(client, "mixtral", teacher_model)
knowledge_pipe = Pipeline.from_flows([MMLUBenchFlow, SynthKnowledgeFlow], flow_params)
gen_data = knowledge_pipe.generate(ds)
Resolving this issue would require an update to the design doc and a code change
It would definitely be better to do this before users of the API proliferate
Metadata
Metadata
Assignees
Labels
No labels