-
Notifications
You must be signed in to change notification settings - Fork 24
Block design #24
Block design #24
Conversation
drugilsberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm, should generalize a lot the framework
gabe-l-hart
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple more NITs/tightenings in the block base
fms_dgt/base/block.py
Outdated
| ) -> None: | ||
|
|
||
| if not (isinstance(arg_fields, list) or arg_fields is None): | ||
| raise TypeError(f"arg_fields must be of type 'list'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: No need for the f" f-string prefix since there's no interpolation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, yes good catch
fms_dgt/base/block.py
Outdated
| from datasets import Dataset | ||
| import pandas as pd | ||
|
|
||
| DATASET_ROW_TYPE = Union[Dict, pd.Series] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to the question below, I think the Dict here could be further restricted as Dict[str, _something_]. If there are any restrictions on the types for the value, that _something_ could be itself a big Union or type def, or it could just be Any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll go with Dict[str, Any]. I frequently pass a dictionary with the SDG object I'm building up as one of the values
Co-authored-by: Gabe Goodhart <[email protected]>
Co-authored-by: Gabe Goodhart <[email protected]>
Co-authored-by: Gabe Goodhart <[email protected]>
Co-authored-by: Gabe Goodhart <[email protected]>
|
Hi @mvcrouse and all, I am going through a mental exercise how a module from DPK, say code quality assessment or language identification, may be enabled as a Block or ValidatorBlock. In DPK we standardize module IO data format to be pyarrow tables, which may be mapped to other iterable BLOCK_ROW_TYPE. For ValidatorBlocks, what can I reuse from the base class? |
For a ValidatorBlock, the way we had been using it is to have a |
gabe-l-hart
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super close on the block base stuff! A couple more comments
fms_dgt/base/block.py
Outdated
| result_field: str = None, | ||
| ) -> None: | ||
|
|
||
| if not (isinstance(arg_fields, list) or arg_fields is None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. I'm confused why we have None as a default (sidebar: if None is valid, the type hint needs to be Optional[List[str]]), but then we immediately raise if it's set to None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought was that None would be the default when the data builder configs excluded those fields, and when that would happen we'd have the arg_fields, kwarg_fields, result_field set only in code (like we use in the API-task databuilder here and here. I'd tend to want to keep these optional to have less things be forced into the configs. For me, it's been clearer to have the strings defined for those fields in the same file that they are later used, e.g., here we define what the result field is and here we use that string, so it feels less like a magic string. Though I'm not strongly opposed to the alternative if there's a case for these being required in the config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 I was reading the parens wrong. I thought this was raising if it IS None, not if it's not. I think this could be simplified to if not isinstance(arg_fields, (list, type(None))) which would be a little simpler to read
fms_dgt/base/block.py
Outdated
| class BaseUtilityBlock(BaseBlock): | ||
| pass | ||
|
|
||
|
|
||
| class BaseGeneratorBlock(BaseBlock): | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think we need to reach a conclusion here. If we keep them, they need docstrings explaining why they're here
Could |
Currently we only use |
hickeyma
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mvcrouse for the work on this. I like the general direction
If we are trying to be near to the InstructLab API for integration purposes then I have the following suggestions:
- Should
databuilderbe renamed aspipeline? See https://github.com/instructlab/sdg/blob/main/src/instructlab/sdg/pipeline.py - Should the databuilder/pipeline configuration be nearer to InstructLab configuration. For example
simpledatabuilder config would become something like this:
version: "1.0"
name: simple
blocks:
- name: gen_questions
type: genai
config:
arg_fields:
- prompt
kwarg_fields:
- stop_sequences
result_field: output
temperature: 0.0
max_new_tokens: 512
min_new_tokens: 1
model_id_or_path: mistralai/mixtral-8x7b-instruct-v01
- name: validate_answers
type: rouge_scorer
config:
arg_fields:
- new_toks
- all_toks
result_field: output
filter: true
threshold: 1.0
I think we'll want to maintain a distinction between |
I am thinking the In other words, move away from specifics in databuilder/sdg. |
Ah so I'm viewing a |
That is what I am thinking especially the direction being proposed in the following design docs: instructlab/dev-docs#109 and instructlab/dev-docs#113 |
Gotcha, then I think I'd rather |
gabe-l-hart
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of my comments have been addressed at this point. I was misreading the type checking logic and have one small suggestion to make it easier to read, but otherwise it looks good to go to me!
fms_dgt/base/block.py
Outdated
| result_field: str = None, | ||
| ) -> None: | ||
|
|
||
| if not (isinstance(arg_fields, list) or arg_fields is None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 I was reading the parens wrong. I thought this was raising if it IS None, not if it's not. I think this could be simplified to if not isinstance(arg_fields, (list, type(None))) which would be a little simpler to read
hickeyma
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy with direction of the block design for now. The follow on PR should be around:
- How we deal with the pipeline or workflow of blocks
- How developer add custom data generation
Yep, totally agree |
* updating to new block abstraction --------- Co-authored-by: Max Crouse <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]> Signed-off-by: Max Crouse <[email protected]>
What this PR does / why we need it
Abstracts generators / validators to share a common parent class. Now they are all called the same way and can be specified with shared configs
Special notes for your reviewer
If applicable**