-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset Reranking #383
Dataset Reranking #383
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ritugala , thanks a lot for this contribution! I've left several comments but I stopped halfway through the review because there are a few recurring comments that would be good to fix throughout the whole PR, and then I can take another look:
- Don't commit large data files to the repo, they can be posted online and I'll help with doing this (message me with any that you want posted)
- Please make sure all functions (at least all public functions) have type annotations, and docstrings that indicate the arguments and return values.
- I made a suggestion that we move data prep scripts out of the main
prompt2model/
library and into a higher levelscripts/
directory. I realize that we did not do this before, but because this PR adds many data processing scripts I think it's probably a good opportunity to do so.
And thanks again for the nice work!
huggingface_data/huggingface_datasets/updated_dataset_index_file.json
Outdated
Show resolved
Hide resolved
prompt2model/dataset_retriever/dataset_index_file/preprocessed_datasets.json
Outdated
Show resolved
Hide resolved
prompt2model/dataset_retriever/dataset_index_file/preprocessing.py
Outdated
Show resolved
Hide resolved
prompt2model/dataset_retriever/dataset_index_file/preprocessing.py
Outdated
Show resolved
Hide resolved
prompt2model/dataset_retriever/dataset_index_file/preprocessing.py
Outdated
Show resolved
Hide resolved
prompt2model/dataset_retriever/dataset_index_file/retrieve_dataset_info_with_configs.py
Outdated
Show resolved
Hide resolved
prompt2model/dataset_retriever/dataset_index_file/retrieve_dataset_info_with_configs.py
Outdated
Show resolved
Hide resolved
prompt2model/dataset_retriever/dataset_index_file/retrieve_dataset_info_with_configs.py
Outdated
Show resolved
Hide resolved
prompt2model/dataset_retriever/dataset_index_file/retrieve_dataset_info_with_configs.py
Outdated
Show resolved
Hide resolved
prompt2model/dataset_retriever/dataset_index_file/retrieve_dataset_info_with_configs.py
Outdated
Show resolved
Hide resolved
Hi @ritugala I can take another look when you've had a moment to finish the above revisions and fix the CI! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ritugala , really sorry it took so long to review this.
Overall this look great! I checked it very briefly and made a few small suggestions. If you can reflect those I'm happy to merge, and if you think any shouldn't be changed it's OK to merge things in anyway.
def automatic_column_selection( | ||
instruction: str, | ||
dataset_name: str, | ||
dataset_description: str, | ||
dataset_columns: str, | ||
example_rows: dict, | ||
dataset_info: dict, | ||
) -> tuple[list[str], str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the previous design of this function is better, as it makes the requirements of what needs to be passed into the function explicit. If we pass in a dict then we don't have guarantees that the dict contains the correct info. Let's revert the changes here.
dataset_description, | ||
train_columns_formatted, | ||
dataset["train"][0], | ||
prompt_spec.instruction, dataset_info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, revert
prompt2model/dataset_retriever/description_dataset_retriever.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Graham Neubig <[email protected]>
Co-authored-by: Graham Neubig <[email protected]>
Co-authored-by: Graham Neubig <[email protected]>
Co-authored-by: Graham Neubig <[email protected]>
Description
These are changes for Automatic Reranking of Datasets + changes for updating the dataset_index file.
Reranking Changes
dataset_description_retriever.py changes:
reranking_prompt notable points:
Tests:
Other Addtions:
Evaluation of Reranking
Currently I have done qualitative eval, along with running on select tasks (linked here) - there seems to be a strong inherent affinity towards either popular datasets or datasets ranked num 1 - but the "confidence" parameter helps here. I played around with minor Chain of Thought prompting, which I think helped but results were inconclusive
I also plan to run reranking across BB lite to see the difference there
Discussion
Dataset Index File Changes:
These are changes for retrieving the dataset in real-time from HuggingFace, which takes 7-10 mins. [now takes a few seconds]
Created its own folder in dataset_retriever.
Contains the light preprocessing steps (preprocessing.py) for the datasets (where we try to filter out as much as possible before heavy processing) followed by the heavy processing (retrieve_dataset_info_with_configs). Light preprocessing was expected to not have any API calls/multiprocessing.
The flow of the heavy processing is as follows
The rest of the code is just multiprocessing (parallelyl processing chunks of datasets, writing them to temp files, and merging the temp files in the end)
Blocked by:
Reranking file should be uploaded to the photron server before this is merged