Skip to content
This repository was archived by the owner on Nov 10, 2025. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
72a2bd9
updating blocks
Jul 3, 2024
6d6f4d9
merge with main
Jul 3, 2024
7b2d811
update all pub databuilders
Jul 3, 2024
4ee38bb
caching llm
Jul 3, 2024
aea2444
updating with main
Jul 3, 2024
d859057
adding compatibility_tests
Jul 3, 2024
b84e7de
merge main
Jul 3, 2024
3e1fbc0
template update
Jul 3, 2024
aec74ff
template update
Jul 3, 2024
8a215de
rm import
Jul 3, 2024
52d26f6
remove old return type
Jul 4, 2024
94ffa99
add parquet saving / loading
Jul 5, 2024
a19116f
adding utility block
Jul 5, 2024
e48ec42
adding utility block
Jul 5, 2024
0ebe2a0
rm block suffix
Jul 8, 2024
1d654dd
remove config argument
Jul 8, 2024
6d51079
demonstrate default vals
Jul 8, 2024
0d17bb6
demonstrate default vals
Jul 8, 2024
fba82b6
remove abstract method to simplify
Jul 8, 2024
e7db4dd
remove abstract method to simplify
Jul 8, 2024
af01c4e
misc minor changes
Jul 8, 2024
a94c7eb
call to generate
Jul 9, 2024
d06a26d
call to generate
Jul 9, 2024
b7fcb59
non base functions
Jul 9, 2024
8b03d98
merge with main
Jul 10, 2024
cb29984
registry change
Jul 10, 2024
b49ffa1
genai req bug fix
Jul 10, 2024
c7a49a6
Update fms_dgt/base/block.py
mvcrouse Jul 10, 2024
80b426f
Update fms_dgt/base/block.py
mvcrouse Jul 10, 2024
e2bc1dc
Update fms_dgt/base/block.py
mvcrouse Jul 10, 2024
79278fd
Update fms_dgt/base/block.py
mvcrouse Jul 10, 2024
fed9702
throw type error
Jul 10, 2024
030caad
Merge branch 'block_design' of github.com:mvcrouse/fms-sdg into block…
Jul 10, 2024
ed9e13b
instance methods
Jul 10, 2024
f78b81d
dataset type
Jul 10, 2024
13701af
dataset type
Jul 10, 2024
0953d9b
dataset type
Jul 10, 2024
e1a5c23
Update fms_dgt/base/block.py
mvcrouse Jul 10, 2024
fb48faf
Update fms_dgt/base/block.py
mvcrouse Jul 10, 2024
b4088ce
Update fms_dgt/base/block.py
mvcrouse Jul 10, 2024
64f971a
Update fms_dgt/base/block.py
mvcrouse Jul 10, 2024
41c9144
fixing base block class
Jul 10, 2024
6f5631a
consistency
Jul 10, 2024
41009e9
removing empty classes
Jul 10, 2024
7fb7f67
make blocks a list, easier for duplicate checking
Jul 11, 2024
63d4ea6
simpler type check
Jul 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
136 changes: 136 additions & 0 deletions fms_dgt/base/block.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# Standard
from abc import ABC
from typing import Any, Dict, Iterable, List, Optional, Union

# Third Party
from datasets import Dataset
import pandas as pd

BLOCK_ROW_TYPE = Union[Dict, pd.Series]
BLOCK_INPUT_TYPE = Union[Iterable[BLOCK_ROW_TYPE], pd.DataFrame, Dataset]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth naming this something that's more descriptive of what it is versus how it's used. Something like DATASET_TYPE. I think this would help readers of the class understand that a Block is supposed to operator on an abstract notion of a DATASET (iterable of rows with named columns).



class BaseBlock(ABC):
"""Base Class for all Blocks"""

def __init__(
self,
name: str = None,
arg_fields: List[str] = None,
kwarg_fields: List[str] = None,
result_field: str = None,
) -> None:

self._name = name

# minor type checking
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These type coercions violate the type hints in the function signature. Either the types in the signature should explicitly allow these degenerate cases where a single instance gets converted to a list of size 1, or these should simply do type assertions.

I've personally fallen into this trap a TON: you want to make things as friendly as possible to your users and try to let the user do what feels natural. The problem with this kind of "help the user out" code is that it ultimately results in "magic" which only you know about and prevents users from being able to cleanly operate in the codebase as a self-service experience. My suggestion here is to be explicit about where/how you're helping the user out by guiding them with the type hints OR keep the signature simple and just make the user do the list wrapping.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, that's a good point, I'll just do type assertions here

if type(arg_fields) == str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, any place you have if type(x) == y, it should be if isinstance(x, y). The only reason to do it like you have here is to explicitly disallow derived subclasses from passing the check, and if you're intentionally going for that, you should use is instead of == to do an exact object equivalence on the type itself.

arg_fields = [arg_fields]
if type(kwarg_fields) == str:
kwarg_fields = [kwarg_fields]
if type(result_field) == list:
assert (
len(result_field) == 1
), "Cannot have multiple 'result' fields for {name}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion error is a bit confusing to me since it's not actually interpolating the {name} field.

Also, one thing to think about (outside the scope of this PR): If this library evolves into strong foundation of abstractions, it will be important to think through how error semantics work. In particular, usage of assert should generally be reserved for programming errors. I think this case is exactly the case where a developer wrote the wrong code by sending in a list where a str is expected, but if we run into a situation where the user providing the args may be something that arrived via a service API call, we need to be able to capture 400-class vs 500-class errors (probably by raising some kind of appropriate exception).

result_field = result_field[0]

self._arg_fields = arg_fields
self._kwarg_fields = kwarg_fields
self._result_field = result_field

@property
def name(self):
return self._name

@property
def arg_fields(self):
return self._arg_fields

@property
def kwarg_fields(self):
return self._kwarg_fields

@property
def result_field(self):
return self._result_field

def generate(
self,
inputs: BLOCK_INPUT_TYPE,
arg_fields: Optional[List[str]] = None,
kwarg_fields: Optional[List[str]] = None,
result_field: Optional[str] = None,
):
raise NotImplementedError


class BaseUtilityBlock(BaseBlock):
pass


class BaseGeneratorBlock(BaseBlock):
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look like they're not doing anything. I could see there being value in them for type checks later on (e.g. if not isinstance(generator, BaseGeneratorBlock): ..., but if that's the case, I'd like a nice docstring explaining why these are needed since they just look like extra indirection.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my main thought was for type checking, but if we don't have an immediate use case for it, should we just scrap these two classes?

Copy link
Contributor

@gabe-l-hart gabe-l-hart Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little tricky. I think it may be worth retaining these intermediate base classes for type checking and common subtype functionality, but it would probably need to be borne out by seeing if any of that is needed with the current suite of generators/validators that exist.

I will say that type-checking mixins can get really troublesome. I went a little crazy on this pattern in an internal library and ended up with non-planar inheritance diagrams. Let's steer clear of that!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we need to reach a conclusion here. If we keep them, they need docstrings explaining why they're here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better we just drop them, to be honest. We don't have any type-checking considerations right now, and directory structure provides an equivalent amount of block-categorization when someone is trying to understand the code



class BaseValidatorBlock(BaseBlock):
def __init__(self, filter: bool = False, **kwargs: Any) -> None:
super().__init__(**kwargs)
self._filter_invalids = filter

def generate(
self,
inputs: BLOCK_INPUT_TYPE,
arg_fields: Optional[List[str]] = None,
kwarg_fields: Optional[List[str]] = None,
result_field: Optional[List[str]] = None,
):
outputs = []
for x in inputs:
inp_args, inp_kwargs = get_args_kwargs(
x, arg_fields or self.arg_fields, kwarg_fields or self.kwarg_fields
)
res = self._validate(*inp_args, **inp_kwargs)
if res or not self._filter_invalids:
write_result(x, res, result_field or self.result_field)
outputs.append(x)
return outputs

def _validate(self, *args: Any, **kwargs: Any) -> bool:
raise NotImplementedError


def get_args_kwargs(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is going to be a critical function for the operation of blocks, I have two thoughts:

  1. It definitely needs a docstring explaining what it does
  2. It would probably be cleaner as a @staticmethod on BlockBase to make it clear that it's attached to the operation of a block

inp: BLOCK_ROW_TYPE,
arg_fields: Optional[List[str]] = None,
kwarg_fields: Optional[List[str]] = None,
):
if arg_fields is None:
arg_fields = []
if kwarg_fields is None:
kwarg_fields = []

if type(inp) == dict:
args = [inp.get(arg) for arg in arg_fields]
kwargs = {kwarg: inp.get(kwarg) for kwarg in kwarg_fields}
elif type(inp) in [pd.DataFrame, Dataset]:
args = [inp.get(arg) for arg in arg_fields]
kwargs = {kwarg: inp.get(kwarg) for kwarg in kwarg_fields}
else:
raise ValueError(f"Unexpected input type: {type(inp)}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things that I think would be cleaner here:

  1. It looks like the body of the if and elif clauses are the same, so they can be collapsed
  2. This should use the more idiomatic isinstance(x, types) type checking so that subclasses are respected
Suggested change
if type(inp) == dict:
args = [inp.get(arg) for arg in arg_fields]
kwargs = {kwarg: inp.get(kwarg) for kwarg in kwarg_fields}
elif type(inp) in [pd.DataFrame, Dataset]:
args = [inp.get(arg) for arg in arg_fields]
kwargs = {kwarg: inp.get(kwarg) for kwarg in kwarg_fields}
else:
raise ValueError(f"Unexpected input type: {type(inp)}")
if isinstance(inp, (dict, pd.DataFrame, Dataset):
args = [inp.get(arg) for arg in arg_fields]
kwargs = {kwarg: inp.get(kwarg) for kwarg in kwarg_fields}
else:
raise ValueError(f"Unexpected input type: {type(inp)}")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes good catch, I had initially put that second elif statement in there as a placeholder. But now revisiting it, I suspect we'll just handle it in the same way as the if statement, so collapsing the two is good


return args, kwargs


def write_result(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here about making this a @staticmethod on BlockBase. It may also be worth making it an instance method and making result_field Optional to fall back to self.result_field

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes that was actually my original design, however I was trying to go the free-function route you had mentioned earlier. It was a bit simpler when it was an instance method, as we didn't have to pass the odd result_field or self._result_field as input, perhaps I'll put it back to that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the free-function suggestion was certainly just a thought exercise. I think for these things member/static-member functions are probably good since the type in question is explicitly bound to the base class

inp: BLOCK_ROW_TYPE,
res: Any,
result_field: str,
):
assert result_field is not None, "Result field cannot be None!"

if type(inp) == dict:
inp[result_field] = res
elif type(inp) in [pd.DataFrame, Dataset]:
inp[result_field] = res
else:
raise ValueError(f"Unexpected input type: {type(inp)}")
126 changes: 52 additions & 74 deletions fms_dgt/base/databuilder.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,32 +9,26 @@
# Third Party
from tqdm import tqdm

# First Party
from fms_dgt.base.generator import BaseGenerator
from fms_dgt.base.registry import get_generator, get_validator
# Local
from fms_dgt.base.block import BaseBlock
from fms_dgt.base.registry import get_block
from fms_dgt.base.task import SdgData, SdgTask
from fms_dgt.base.validator import BaseValidator
from fms_dgt.generators.llm import CachingLM, LMGenerator
from fms_dgt.blocks.generators.llm import CachingLM, LMGenerator
from fms_dgt.utils import all_annotations, sdg_logger


@dataclass
class DataBuilderConfig(dict):
# data builder naming/registry
name: Optional[str] = None
generators: Optional[Union[str, list]] = None
validators: Optional[Union[str, list]] = None
blocks: Optional[dict] = None
generation_kwargs: Optional[dict] = None
metadata: Optional[
dict
] = None # by default, not used in the code. allows for users to pass arbitrary info to data builders

def __post_init__(self) -> None:
if self.generation_kwargs is not None:
if "temperature" in self.generation_kwargs:
self.generation_kwargs["temperature"] = float(
self.generation_kwargs["temperature"]
)
pass


TYPE_KEY = "type"
Expand Down Expand Up @@ -66,7 +60,7 @@ def __init__(
self._restart_generation = restart_generation

# initializing generators / validators
self._init_gv(lm_cache=lm_cache)
self._init_blocks(lm_cache=lm_cache)

# TODO: Data loader goes here
self._tasks: List[SdgTask] = [
Expand All @@ -91,68 +85,52 @@ def config(self) -> DataBuilderConfig:
return self._config

@property
def generators(self) -> List[BaseGenerator]:
"""Returns the generators associated with this class."""
return self._generators
def blocks(self) -> List[BaseBlock]:
"""Returns the blocks associated with this class."""
return self._blocks

def _init_blocks(self, lm_cache: str = None):
self._blocks: List[BaseBlock] = []

# TODO: need to handle nested blocks
for obj_name, obj_config in self.config.blocks.items():
obj_kwargs = {**obj_config, "name": obj_name}
sdg_logger.debug(
"Initializing object %s with config %s", obj_name, obj_config
)

@property
def validators(self) -> List[BaseValidator]:
"""Returns the validators associated with this class."""
return self._validators

def _init_gv(self, lm_cache: str = None):
_generators = (
[self.config.generators]
if type(self.config.generators) == str
else self.config.generators
)
_validators = (
[self.config.validators]
if type(self.config.validators) == str
else self.config.validators
)
self._generators: List[BaseGenerator] = []
self._validators: List[BaseValidator] = []

# TODO: need to handle nested generators / validators
for i, info_src in enumerate([_generators, _validators]):
# user may not define a generator / validator
if info_src is not None:
for obj_name, obj_config in info_src.items():
sdg_logger.debug(
"Initializing object %s with config %s", obj_name, obj_config
)
obj = (get_generator if i == 0 else get_validator)(
obj_config[TYPE_KEY]
)(obj_name, obj_config)

if lm_cache is not None and isinstance(obj, LMGenerator):
sdg_logger.info(
"Using cache at %s",
lm_cache + "_rank" + str(obj.rank) + ".db",
)
obj = CachingLM(
obj,
lm_cache
# each rank receives a different cache db.
# necessary to avoid multiple writes to cache at once
+ f"_model{os.path.split(obj.model_id_or_path)[-1]}_rank{obj.rank}.db",
)

type_annotations = all_annotations(type(self))
assert (
obj_name in type_annotations
), f"Object {obj_name} is missing from definition of DataBuilder {self.__class__}"

obj_type = type_annotations[obj_name]

# double check types
assert isinstance(obj, obj_type) or (
isinstance(obj, CachingLM) and isinstance(obj.lm, obj_type)
), f"Type of retrieved object {obj.__class__} for {obj_name} does not match type {obj_type} specified in DataBuilder {self.__class__}"

setattr(self, obj_name, obj)
(self._generators if i == 0 else self._validators).append(obj)
assert (
TYPE_KEY in obj_kwargs
), f"'type' field missing from {obj_name} in data builder config"
obj = get_block(obj_kwargs.pop(TYPE_KEY))(**obj_kwargs)

if lm_cache is not None and isinstance(obj, LMGenerator):
sdg_logger.info(
"Using cache at %s",
lm_cache + "_rank" + str(obj.rank) + ".db",
)
obj = CachingLM(
obj,
lm_cache
# each rank receives a different cache db.
# necessary to avoid multiple writes to cache at once
+ f"_model{os.path.split(obj.model_id_or_path)[-1]}_rank{obj.rank}.db",
)

type_annotations = all_annotations(type(self))
assert (
obj_name in type_annotations
), f"Object {obj_name} is missing from definition of DataBuilder {self.__class__}"

obj_type = type_annotations[obj_name]

# double check types
assert isinstance(obj, obj_type) or (
isinstance(obj, CachingLM) and isinstance(obj.lm, obj_type)
), f"Type of retrieved object {obj.__class__} for {obj_name} does not match type {obj_type} specified in DataBuilder {self.__class__}"

setattr(self, obj_name, obj)
self._blocks.append(obj)

def execute_tasks(self):
# main entry point to task execution
Expand Down
35 changes: 0 additions & 35 deletions fms_dgt/base/generator.py

This file was deleted.

4 changes: 2 additions & 2 deletions fms_dgt/base/instance.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Standard
from dataclasses import dataclass, field
from typing import Any, List, Optional
from dataclasses import dataclass
from typing import Any, Optional


@dataclass
Expand Down
Loading