Dataset Reranking #383

ritugala · 2023-12-01T21:20:41Z

Description

These are changes for Automatic Reranking of Datasets + changes for updating the dataset_index file.

Reranking Changes

dataset_description_retriever.py changes:

Load retrieved datasets from the generated dataset_index file (if it exists!) and process load upto 5 configs, chosen at random. Use these datasets for reranking
Added function canonicalize_dataset_automatically to replace canonicalize_dataset_using_cli. Currently kept both.
[Minor]Generally use this dataset dict everywhere instead of separate variables (Eg in col selection, ccanonicalize_dataset_using_cli )
[Minor] Have dataset retriever return just dataset names for cleaner processing elsewhere
Added functionality for only loading upto 3k rows - otherwise this module would take really long to run for large datasets (like amazon_polarity)
Added functionality for using gated datasets only if the users wants to (Dataset retriever fails when trying to download gated dataset #371)

reranking_prompt notable points:

Currently using incontext learning
Even without incontext learning, dataset reranking prompt is 4k-5k by itself (maybe more in the future because right now reranker has 8-15 datasets instead of 25 unique dataset index)
so we need to use LLMs with context length 16k for dataset reranking. to run
Response expected to return in the format (dataset_name, config_name, confidence)

Tests:

We use a tiny dataset_index_w_configs.json file for mocking HuggingFace calls
Added 2 tests for reranking
Mocked reranking, and updated changes to mock canonicalize_dataset_by_cli (with canonicalize_dataset_automatically)

Other Addtions:

Parsing function for reranker prompt.

Evaluation of Reranking
Currently I have done qualitative eval, along with running on select tasks (linked here) - there seems to be a strong inherent affinity towards either popular datasets or datasets ranked num 1 - but the "confidence" parameter helps here. I played around with minor Chain of Thought prompting, which I think helped but results were inconclusive
I also plan to run reranking across BB lite to see the difference there

Discussion

I'm currently asking it to return confidence in the returned dataset - it return low confidence decent amount of times, but there are probably better ways to do this (eg using log probs)
Let me know if more tests are required, happy to add
Future TODOs include using the tags from HF for reranking as well - this would require analysing the tags a bit because some dataset have a really large number of them

Dataset Index File Changes:

These are changes for retrieving the dataset in real-time from HuggingFace, which takes 7-10 mins. [now takes a few seconds]
Created its own folder in dataset_retriever.
Contains the light preprocessing steps (preprocessing.py) for the datasets (where we try to filter out as much as possible before heavy processing) followed by the heavy processing (retrieve_dataset_info_with_configs). Light preprocessing was expected to not have any API calls/multiprocessing.
The flow of the heavy processing is as follows

Iterate through the dataset, and check if it is valid or not using API.
Get all config information for a dataset: Load a streaming version of the dataset and if it is taking too long to load, or the columns contain an Image/Video fields skip the config. Flatten the dataset at this level itself (streaming datasets don't have huggingface support for this, so I have implemented this)
All this is implemented in a try catch block incase loading the dataset throws an error (eg dataset requires additional pip install of libraries, cant be streamed etc)

The rest of the code is just multiprocessing (parallelyl processing chunks of datasets, writing them to temp files, and merging the temp files in the end)

Blocked by:

Reranking file should be uploaded to the photron server before this is merged

neubig

Hi @ritugala , thanks a lot for this contribution! I've left several comments but I stopped halfway through the review because there are a few recurring comments that would be good to fix throughout the whole PR, and then I can take another look:

Don't commit large data files to the repo, they can be posted online and I'll help with doing this (message me with any that you want posted)
Please make sure all functions (at least all public functions) have type annotations, and docstrings that indicate the arguments and return values.
I made a suggestion that we move data prep scripts out of the main prompt2model/ library and into a higher level scripts/ directory. I realize that we did not do this before, but because this PR adds many data processing scripts I think it's probably a good opportunity to do so.

And thanks again for the nice work!

huggingface_data/huggingface_datasets/updated_dataset_index_file.json

prompt2model/dataset_retriever/dataset_index_file/preprocessed_datasets.json

prompt2model/dataset_retriever/dataset_index_file/preprocessing.py

prompt2model/dataset_retriever/dataset_index_file/retrieve_dataset_info_with_configs.py

neubig · 2023-12-13T12:55:07Z

Hi @ritugala I can take another look when you've had a moment to finish the above revisions and fix the CI!

neubig

Hi @ritugala , really sorry it took so long to review this.
Overall this look great! I checked it very briefly and made a few small suggestions. If you can reflect those I'm happy to merge, and if you think any shouldn't be changed it's OK to merge things in anyway.

neubig · 2023-12-30T17:49:27Z

prompt2model/dataset_retriever/description_dataset_retriever.py

    def automatic_column_selection(
        instruction: str,
-        dataset_name: str,
-        dataset_description: str,
-        dataset_columns: str,
-        example_rows: dict,
+        dataset_info: dict,
    ) -> tuple[list[str], str]:


I think the previous design of this function is better, as it makes the requirements of what needs to be passed into the function explicit. If we pass in a dict then we don't have guarantees that the dict contains the correct info. Let's revert the changes here.

neubig · 2023-12-30T17:50:28Z

prompt2model/dataset_retriever/description_dataset_retriever.py

-                dataset_description,
-                train_columns_formatted,
-                dataset["train"][0],
+                prompt_spec.instruction, dataset_info


Similarly, revert

prompt2model/dataset_retriever/description_dataset_retriever.py

prompt2model/utils/parse_responses.py

scripts/dataset_index/retrieve_dataset_info.py

Co-authored-by: Graham Neubig <[email protected]>

ritugala added 8 commits November 17, 2023 14:25

dataset reranking changes, no verify

5a7a184

added reranking.csv

043e7fb

latest changes for reranking

ecd8261

added tests

7e08fad

Merge remote-tracking branch 'origin/main' into ritu-dataset-reranking

da89952

minor changes

cfa1a77

seprating changes for dataset index file creation

f640b8e

minor fixes to the prompt

2a7b597

ritugala requested review from viswavi and neubig December 1, 2023 21:42

ritugala changed the title ~~Dataset Reranking Changes~~ Dataset Reranking Dec 1, 2023

ritugala marked this pull request as draft December 4, 2023 16:25

ritugala added 2 commits December 4, 2023 16:46

changes for using dataset index file

0951b48

Remove redundant file

127e036

ritugala marked this pull request as ready for review December 4, 2023 22:41

undo minor testing change

1b4f15a

neubig requested changes Dec 7, 2023

View reviewed changes

PR changes

eacfda1

ritugala added 3 commits December 14, 2023 06:19

final changes

a672511

fixing CI test

5a45375

Added reranking dataset index to gitignore

26bceca

ritugala requested a review from neubig December 14, 2023 14:28

ritugala added 2 commits December 20, 2023 03:23

removed print stmts

393f2aa

updated gitignore

069649b

neubig approved these changes Dec 30, 2023

View reviewed changes

ritugala and others added 4 commits January 1, 2024 12:30

Update prompt2model/dataset_retriever/description_dataset_retriever.py

0854a82

Co-authored-by: Graham Neubig <[email protected]>

Update prompt2model/utils/parse_responses.py

07d38f2

Co-authored-by: Graham Neubig <[email protected]>

Update scripts/dataset_index/retrieve_dataset_info.py

105808a

Co-authored-by: Graham Neubig <[email protected]>

Update scripts/dataset_index/retrieve_dataset_info.py

2ba1ca7

Co-authored-by: Graham Neubig <[email protected]>

ritugala and others added 3 commits January 12, 2024 06:26

requested review changes

df16c4b

lint changes

f4511b9

Merge branch 'main' into ritu-dataset-reranking

10e749b

ritugala merged commit f2eabc1 into main Jan 12, 2024
8 checks passed

ritugala deleted the ritu-dataset-reranking branch January 12, 2024 12:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset Reranking #383

Dataset Reranking #383

ritugala commented Dec 1, 2023 •

edited

Loading

neubig left a comment

neubig commented Dec 13, 2023

neubig left a comment

neubig Dec 30, 2023

neubig Dec 30, 2023

Dataset Reranking #383

Dataset Reranking #383

Conversation

ritugala commented Dec 1, 2023 • edited Loading

Description

Reranking Changes

Discussion

Dataset Index File Changes:

Blocked by:

neubig left a comment

Choose a reason for hiding this comment

neubig commented Dec 13, 2023

neubig left a comment

Choose a reason for hiding this comment

neubig Dec 30, 2023

Choose a reason for hiding this comment

neubig Dec 30, 2023

Choose a reason for hiding this comment

ritugala commented Dec 1, 2023 •

edited

Loading