Skip to content
This repository was archived by the owner on Nov 10, 2025. It is now read-only.

Conversation

@mvcrouse
Copy link
Collaborator

@mvcrouse mvcrouse commented Jul 9, 2024

What this PR does / why we need it

Abstracts generators / validators to share a common parent class. Now they are all called the same way and can be specified with shared configs

Special notes for your reviewer

If applicable**

  • this PR contains documentation
  • this PR contains unit tests
  • this PR has been tested for backwards compatibility

Copy link
Collaborator

@drugilsberg drugilsberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm, should generalize a lot the framework

Copy link
Contributor

@gabe-l-hart gabe-l-hart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple more NITs/tightenings in the block base

) -> None:

if not (isinstance(arg_fields, list) or arg_fields is None):
raise TypeError(f"arg_fields must be of type 'list'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: No need for the f" f-string prefix since there's no interpolation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, yes good catch

from datasets import Dataset
import pandas as pd

DATASET_ROW_TYPE = Union[Dict, pd.Series]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to the question below, I think the Dict here could be further restricted as Dict[str, _something_]. If there are any restrictions on the types for the value, that _something_ could be itself a big Union or type def, or it could just be Any.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll go with Dict[str, Any]. I frequently pass a dictionary with the SDG object I'm building up as one of the values

Max Crouse and others added 6 commits July 10, 2024 11:16
@mvcrouse mvcrouse requested a review from gabe-l-hart July 10, 2024 17:35
@yuanchi2807
Copy link
Collaborator

Hi @mvcrouse and all,

I am going through a mental exercise how a module from DPK, say code quality assessment or language identification, may be enabled as a Block or ValidatorBlock.

In DPK we standardize module IO data format to be pyarrow tables, which may be mapped to other iterable BLOCK_ROW_TYPE. For ValidatorBlocks, what can I reuse from the base class?

@mvcrouse
Copy link
Collaborator Author

Hi @mvcrouse and all,

I am going through a mental exercise how a module from DPK, say code quality assessment or language identification, may be enabled as a Block or ValidatorBlock.

In DPK we standardize module IO data format to be pyarrow tables, which may be mapped to other iterable BLOCK_ROW_TYPE. For ValidatorBlocks, what can I reuse from the base class?

For a ValidatorBlock, the way we had been using it is to have a generate method (see here) and a _validate method (see here). The generate contains whatever bulk processing steps are required, then calls _validate method on each input to give it either a true or a false. When I was doing some minor testing with IL, I just called their blocks from our blocks (see here). I imagine a DPK transform would be similar?

Copy link
Contributor

@gabe-l-hart gabe-l-hart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super close on the block base stuff! A couple more comments

result_field: str = None,
) -> None:

if not (isinstance(arg_fields, list) or arg_fields is None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. I'm confused why we have None as a default (sidebar: if None is valid, the type hint needs to be Optional[List[str]]), but then we immediately raise if it's set to None.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought was that None would be the default when the data builder configs excluded those fields, and when that would happen we'd have the arg_fields, kwarg_fields, result_field set only in code (like we use in the API-task databuilder here and here. I'd tend to want to keep these optional to have less things be forced into the configs. For me, it's been clearer to have the strings defined for those fields in the same file that they are later used, e.g., here we define what the result field is and here we use that string, so it feels less like a magic string. Though I'm not strongly opposed to the alternative if there's a case for these being required in the config

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 I was reading the parens wrong. I thought this was raising if it IS None, not if it's not. I think this could be simplified to if not isinstance(arg_fields, (list, type(None))) which would be a little simpler to read

Comment on lines 67 to 72
class BaseUtilityBlock(BaseBlock):
pass


class BaseGeneratorBlock(BaseBlock):
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we need to reach a conclusion here. If we keep them, they need docstrings explaining why they're here

@yuanchi2807
Copy link
Collaborator

For a ValidatorBlock, the way we had been using it is to have a generate method

Could generate method be kept in GeneratorBlock and validate method be kept within the ValidatorBlock? One generated batch may be fed to multiple Validators? Is that a valid use case?

@mvcrouse
Copy link
Collaborator Author

mvcrouse commented Jul 10, 2024

For a ValidatorBlock, the way we had been using it is to have a generate method

Could generate method be kept in GeneratorBlock and validate method be kept within the ValidatorBlock? One generated batch may be fed to multiple Validators? Is that a valid use case?

Currently we only use generate as a way of keeping some notion of API-surface compatibility with IL, rather than using that name to reflect what's really going on within the block. Previously we had all blocks implement a more generically named __call__ function which contained their core logic. All of this to say, we'll probably keep generate for now for all blocks to keep that compatibility, with the understanding that it's a bit of a confusingly named method

Copy link
Collaborator

@hickeyma hickeyma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mvcrouse for the work on this. I like the general direction

If we are trying to be near to the InstructLab API for integration purposes then I have the following suggestions:

  1. Should databuilder be renamed as pipeline? See https://github.com/instructlab/sdg/blob/main/src/instructlab/sdg/pipeline.py
  2. Should the databuilder/pipeline configuration be nearer to InstructLab configuration. For example simple databuilder config would become something like this:
version: "1.0"
name: simple
blocks:
  - name: gen_questions
    type: genai
    config:
      arg_fields:
      - prompt
    kwarg_fields:
      - stop_sequences
    result_field: output
    temperature: 0.0
    max_new_tokens: 512
    min_new_tokens: 1
    model_id_or_path: mistralai/mixtral-8x7b-instruct-v01
  - name: validate_answers
    type: rouge_scorer
    config:
      arg_fields:
      - new_toks
      - all_toks
    result_field: output
    filter: true
    threshold: 1.0

@mvcrouse
Copy link
Collaborator Author

Thanks @mvcrouse for the work on this. I like the general direction

If we are trying to be near to the InstructLab API for integration purposes then I have the following suggestions:

  1. Should databuilder be renamed as pipeline? See https://github.com/instructlab/sdg/blob/main/src/instructlab/sdg/pipeline.py
  2. Should the databuilder/pipeline configuration be nearer to InstructLab configuration. For example simple databuilder config would become something like this:
version: "1.0"
name: simple
blocks:
  - name: gen_questions
    type: genai

I think we'll want to maintain a distinction between databuilder and pipeline, as one is more for code-based SDG and the other is config-based. However, I do think one of the next additions should be to add a pipeline class in. On point (2), I'm not opposed to that, it seems the change would just be having a name field and then treating blocks as a list?

@hickeyma
Copy link
Collaborator

hickeyma commented Jul 11, 2024

I think we'll want to maintain a distinction between databuilder and pipeline, as one is more for code-based SDG and the other is config-based. However, I do think one of the next additions should be to add a pipeline class in. On point (2), I'm not opposed to that, it seems the change would just be having a name field and then treating blocks as a list?

I am thinking the pipeline is just the flow of a request that is processed by blocks. If this is the case, is Block the extension where you add new types with logic in code?

In other words, move away from specifics in databuilder/sdg.

@mvcrouse
Copy link
Collaborator Author

I think we'll want to maintain a distinction between databuilder and pipeline, as one is more for code-based SDG and the other is config-based. However, I do think one of the next additions should be to add a pipeline class in. On point (2), I'm not opposed to that, it seems the change would just be having a name field and then treating blocks as a list?

I am thinking the pipeline is just the flow of a request that is processed by blocks. If this is the case, is Block the extension where you add new types with logic in code?

Ah so I'm viewing a pipeline as a direct chain of blocks, where the output of one block is directly piped into the input of another block. That's more so how they do it in IL, right?

@hickeyma
Copy link
Collaborator

Ah so I'm viewing a pipeline as a direct chain of blocks, where the output of one block is directly piped into the input of another block. That's more so how they do it in IL, right?

That is what I am thinking especially the direction being proposed in the following design docs: instructlab/dev-docs#109 and instructlab/dev-docs#113

@mvcrouse
Copy link
Collaborator Author

Ah so I'm viewing a pipeline as a direct chain of blocks, where the output of one block is directly piped into the input of another block. That's more so how they do it in IL, right?

That is what I am thinking especially the direction being proposed in the following design docs: instructlab/dev-docs#109 and instructlab/dev-docs#113

Gotcha, then I think I'd rather pipeline be its own separate class and keep data builders as they are, that way we can have both code-based frameworks and config-based

Copy link
Contributor

@gabe-l-hart gabe-l-hart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of my comments have been addressed at this point. I was misreading the type checking logic and have one small suggestion to make it easier to read, but otherwise it looks good to go to me!

result_field: str = None,
) -> None:

if not (isinstance(arg_fields, list) or arg_fields is None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 I was reading the parens wrong. I thought this was raising if it IS None, not if it's not. I think this could be simplified to if not isinstance(arg_fields, (list, type(None))) which would be a little simpler to read

@hickeyma hickeyma self-requested a review July 11, 2024 16:49
Copy link
Collaborator

@hickeyma hickeyma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy with direction of the block design for now. The follow on PR should be around:

  1. How we deal with the pipeline or workflow of blocks
  2. How developer add custom data generation

@mvcrouse
Copy link
Collaborator Author

I am happy with direction of the block design for now. The follow on PR should be around:

  1. How we deal with the pipeline or workflow of blocks
  2. How developer add custom data generation

Yep, totally agree

@mvcrouse mvcrouse merged commit c2eae47 into foundation-model-stack:main Jul 11, 2024
@mvcrouse mvcrouse deleted the block_design branch July 11, 2024 16:57
mvcrouse added a commit to mvcrouse/fms-sdg that referenced this pull request Aug 1, 2024
* updating to new block abstraction

---------

Co-authored-by: Max Crouse <[email protected]>
Co-authored-by: Gabe Goodhart <[email protected]>
Signed-off-by: Max Crouse <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants