Block design #24

mvcrouse · 2024-07-09T19:12:50Z

What this PR does / why we need it

Abstracts generators / validators to share a common parent class. Now they are all called the same way and can be specified with shared configs

Special notes for your reviewer

If applicable**

this PR contains documentation
this PR contains unit tests
this PR has been tested for backwards compatibility

drugilsberg

Lgtm, should generalize a lot the framework

gabe-l-hart

A couple more NITs/tightenings in the block base

gabe-l-hart · 2024-07-10T16:06:26Z

fms_dgt/base/block.py

+    ) -> None:
+
+        if not (isinstance(arg_fields, list) or arg_fields is None):
+            raise TypeError(f"arg_fields must be of type 'list'")


NIT: No need for the f" f-string prefix since there's no interpolation

Ha, yes good catch

fms_dgt/base/block.py

gabe-l-hart · 2024-07-10T16:09:23Z

fms_dgt/base/block.py

+from datasets import Dataset
+import pandas as pd
+
+DATASET_ROW_TYPE = Union[Dict, pd.Series]


Related to the question below, I think the Dict here could be further restricted as Dict[str, _something_]. If there are any restrictions on the types for the value, that _something_ could be itself a big Union or type def, or it could just be Any.

I'll go with Dict[str, Any]. I frequently pass a dictionary with the SDG object I'm building up as one of the values

fms_dgt/base/block.py

Co-authored-by: Gabe Goodhart <[email protected]>

yuanchi2807 · 2024-07-10T18:40:43Z

Hi @mvcrouse and all,

I am going through a mental exercise how a module from DPK, say code quality assessment or language identification, may be enabled as a Block or ValidatorBlock.

In DPK we standardize module IO data format to be pyarrow tables, which may be mapped to other iterable BLOCK_ROW_TYPE. For ValidatorBlocks, what can I reuse from the base class?

mvcrouse · 2024-07-10T19:37:35Z

Hi @mvcrouse and all,

I am going through a mental exercise how a module from DPK, say code quality assessment or language identification, may be enabled as a Block or ValidatorBlock.

In DPK we standardize module IO data format to be pyarrow tables, which may be mapped to other iterable BLOCK_ROW_TYPE. For ValidatorBlocks, what can I reuse from the base class?

For a ValidatorBlock, the way we had been using it is to have a generate method (see here) and a _validate method (see here). The generate contains whatever bulk processing steps are required, then calls _validate method on each input to give it either a true or a false. When I was doing some minor testing with IL, I just called their blocks from our blocks (see here). I imagine a DPK transform would be similar?

gabe-l-hart

Super close on the block base stuff! A couple more comments

gabe-l-hart · 2024-07-10T20:25:43Z

fms_dgt/base/block.py

+        result_field: str = None,
+    ) -> None:
+
+        if not (isinstance(arg_fields, list) or arg_fields is None):


Hm. I'm confused why we have None as a default (sidebar: if None is valid, the type hint needs to be Optional[List[str]]), but then we immediately raise if it's set to None.

My thought was that None would be the default when the data builder configs excluded those fields, and when that would happen we'd have the arg_fields, kwarg_fields, result_field set only in code (like we use in the API-task databuilder here and here. I'd tend to want to keep these optional to have less things be forced into the configs. For me, it's been clearer to have the strings defined for those fields in the same file that they are later used, e.g., here we define what the result field is and here we use that string, so it feels less like a magic string. Though I'm not strongly opposed to the alternative if there's a case for these being required in the config

🤦 I was reading the parens wrong. I thought this was raising if it IS None, not if it's not. I think this could be simplified to if not isinstance(arg_fields, (list, type(None))) which would be a little simpler to read

fms_dgt/base/block.py

gabe-l-hart · 2024-07-10T20:27:48Z

fms_dgt/base/block.py

+class BaseUtilityBlock(BaseBlock):
+    pass
+
+
+class BaseGeneratorBlock(BaseBlock):
+    pass


I still think we need to reach a conclusion here. If we keep them, they need docstrings explaining why they're here

yuanchi2807 · 2024-07-10T21:05:52Z

For a ValidatorBlock, the way we had been using it is to have a generate method

Could generate method be kept in GeneratorBlock and validate method be kept within the ValidatorBlock? One generated batch may be fed to multiple Validators? Is that a valid use case?

mvcrouse · 2024-07-10T22:11:48Z

For a ValidatorBlock, the way we had been using it is to have a generate method

Could generate method be kept in GeneratorBlock and validate method be kept within the ValidatorBlock? One generated batch may be fed to multiple Validators? Is that a valid use case?

Currently we only use generate as a way of keeping some notion of API-surface compatibility with IL, rather than using that name to reflect what's really going on within the block. Previously we had all blocks implement a more generically named __call__ function which contained their core logic. All of this to say, we'll probably keep generate for now for all blocks to keep that compatibility, with the understanding that it's a bit of a confusingly named method

hickeyma

Thanks @mvcrouse for the work on this. I like the general direction

If we are trying to be near to the InstructLab API for integration purposes then I have the following suggestions:

Should databuilder be renamed as pipeline? See https://github.com/instructlab/sdg/blob/main/src/instructlab/sdg/pipeline.py
Should the databuilder/pipeline configuration be nearer to InstructLab configuration. For example simple databuilder config would become something like this:

version: "1.0"
name: simple
blocks:
  - name: gen_questions
    type: genai
    config:
      arg_fields:
      - prompt
    kwarg_fields:
      - stop_sequences
    result_field: output
    temperature: 0.0
    max_new_tokens: 512
    min_new_tokens: 1
    model_id_or_path: mistralai/mixtral-8x7b-instruct-v01
  - name: validate_answers
    type: rouge_scorer
    config:
      arg_fields:
      - new_toks
      - all_toks
    result_field: output
    filter: true
    threshold: 1.0

mvcrouse · 2024-07-11T13:28:29Z

Thanks @mvcrouse for the work on this. I like the general direction

If we are trying to be near to the InstructLab API for integration purposes then I have the following suggestions:

Should databuilder be renamed as pipeline? See https://github.com/instructlab/sdg/blob/main/src/instructlab/sdg/pipeline.py

Should the databuilder/pipeline configuration be nearer to InstructLab configuration. For example simple databuilder config would become something like this:
version: "1.0"
name: simple
blocks:
  - name: gen_questions
    type: genai

I think we'll want to maintain a distinction between databuilder and pipeline, as one is more for code-based SDG and the other is config-based. However, I do think one of the next additions should be to add a pipeline class in. On point (2), I'm not opposed to that, it seems the change would just be having a name field and then treating blocks as a list?

hickeyma · 2024-07-11T14:20:07Z

I think we'll want to maintain a distinction between databuilder and pipeline, as one is more for code-based SDG and the other is config-based. However, I do think one of the next additions should be to add a pipeline class in. On point (2), I'm not opposed to that, it seems the change would just be having a name field and then treating blocks as a list?

I am thinking the pipeline is just the flow of a request that is processed by blocks. If this is the case, is Block the extension where you add new types with logic in code?

In other words, move away from specifics in databuilder/sdg.

mvcrouse · 2024-07-11T14:31:50Z

I think we'll want to maintain a distinction between databuilder and pipeline, as one is more for code-based SDG and the other is config-based. However, I do think one of the next additions should be to add a pipeline class in. On point (2), I'm not opposed to that, it seems the change would just be having a name field and then treating blocks as a list?

I am thinking the pipeline is just the flow of a request that is processed by blocks. If this is the case, is Block the extension where you add new types with logic in code?

Ah so I'm viewing a pipeline as a direct chain of blocks, where the output of one block is directly piped into the input of another block. That's more so how they do it in IL, right?

hickeyma · 2024-07-11T15:13:11Z

Ah so I'm viewing a pipeline as a direct chain of blocks, where the output of one block is directly piped into the input of another block. That's more so how they do it in IL, right?

That is what I am thinking especially the direction being proposed in the following design docs: instructlab/dev-docs#109 and instructlab/dev-docs#113

mvcrouse · 2024-07-11T15:16:37Z

Ah so I'm viewing a pipeline as a direct chain of blocks, where the output of one block is directly piped into the input of another block. That's more so how they do it in IL, right?

That is what I am thinking especially the direction being proposed in the following design docs: instructlab/dev-docs#109 and instructlab/dev-docs#113

Gotcha, then I think I'd rather pipeline be its own separate class and keep data builders as they are, that way we can have both code-based frameworks and config-based

gabe-l-hart

All of my comments have been addressed at this point. I was misreading the type checking logic and have one small suggestion to make it easier to read, but otherwise it looks good to go to me!

gabe-l-hart · 2024-07-11T16:14:18Z

fms_dgt/base/block.py

+        result_field: str = None,
+    ) -> None:
+
+        if not (isinstance(arg_fields, list) or arg_fields is None):


🤦 I was reading the parens wrong. I thought this was raising if it IS None, not if it's not. I think this could be simplified to if not isinstance(arg_fields, (list, type(None))) which would be a little simpler to read

hickeyma

I am happy with direction of the block design for now. The follow on PR should be around:

How we deal with the pipeline or workflow of blocks
How developer add custom data generation

mvcrouse · 2024-07-11T16:55:43Z

I am happy with direction of the block design for now. The follow on PR should be around:

How we deal with the pipeline or workflow of blocks

How developer add custom data generation

Yep, totally agree

* updating to new block abstraction --------- Co-authored-by: Max Crouse <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]> Signed-off-by: Max Crouse <[email protected]>

Max Crouse added 24 commits July 3, 2024 09:30

updating blocks

72a2bd9

merge with main

6d6f4d9

update all pub databuilders

7b2d811

caching llm

4ee38bb

updating with main

aea2444

adding compatibility_tests

d859057

merge main

b84e7de

template update

3e1fbc0

template update

aec74ff

rm import

8a215de

remove old return type

52d26f6

add parquet saving / loading

94ffa99

adding utility block

a19116f

adding utility block

e48ec42

rm block suffix

0ebe2a0

remove config argument

1d654dd

demonstrate default vals

6d51079

demonstrate default vals

0d17bb6

remove abstract method to simplify

fba82b6

remove abstract method to simplify

e7db4dd

misc minor changes

af01c4e

call to generate

a94c7eb

call to generate

d06a26d

non base functions

b7fcb59

mvcrouse requested review from drugilsberg, gabe-l-hart and ramon-astudillo July 9, 2024 19:12

mvcrouse requested review from pavan046 and sivasankalpp as code owners July 9, 2024 19:12

drugilsberg approved these changes Jul 9, 2024

View reviewed changes

gabe-l-hart suggested changes Jul 10, 2024

View reviewed changes

Max Crouse and others added 6 commits July 10, 2024 11:16

dataset type

0953d9b

Update fms_dgt/base/block.py

e1a5c23

Co-authored-by: Gabe Goodhart <[email protected]>

Update fms_dgt/base/block.py

fb48faf

Co-authored-by: Gabe Goodhart <[email protected]>

Update fms_dgt/base/block.py

b4088ce

Co-authored-by: Gabe Goodhart <[email protected]>

Update fms_dgt/base/block.py

64f971a

Co-authored-by: Gabe Goodhart <[email protected]>

fixing base block class

41c9144

mvcrouse requested a review from gabe-l-hart July 10, 2024 17:35

consistency

6f5631a

gabe-l-hart reviewed Jul 10, 2024

View reviewed changes

removing empty classes

41009e9

hickeyma suggested changes Jul 11, 2024

View reviewed changes

make blocks a list, easier for duplicate checking

7fb7f67

gabe-l-hart approved these changes Jul 11, 2024

View reviewed changes

simpler type check

63d4ea6

hickeyma self-requested a review July 11, 2024 16:49

hickeyma approved these changes Jul 11, 2024

View reviewed changes

mvcrouse merged commit c2eae47 into foundation-model-stack:main Jul 11, 2024

mvcrouse deleted the block_design branch July 11, 2024 16:57

Block design #24

Block design #24

Uh oh!

Conversation

mvcrouse commented Jul 9, 2024

What this PR does / why we need it

Special notes for your reviewer

If applicable**

Uh oh!

drugilsberg left a comment

Choose a reason for hiding this comment

Uh oh!

gabe-l-hart left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuanchi2807 commented Jul 10, 2024

Uh oh!

mvcrouse commented Jul 10, 2024

Uh oh!

gabe-l-hart left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuanchi2807 commented Jul 10, 2024

Uh oh!

mvcrouse commented Jul 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hickeyma left a comment

Choose a reason for hiding this comment

Uh oh!

mvcrouse commented Jul 11, 2024

Uh oh!

hickeyma commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mvcrouse commented Jul 11, 2024

Uh oh!

hickeyma commented Jul 11, 2024

Uh oh!

mvcrouse commented Jul 11, 2024

Uh oh!

gabe-l-hart left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hickeyma left a comment

Choose a reason for hiding this comment

Uh oh!

mvcrouse commented Jul 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

mvcrouse commented Jul 10, 2024 •

edited

Loading

hickeyma commented Jul 11, 2024 •

edited

Loading