Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Reranking #383

Merged
merged 24 commits into from
Jan 12, 2024
Merged

Dataset Reranking #383

merged 24 commits into from
Jan 12, 2024

Conversation

ritugala
Copy link
Collaborator

@ritugala ritugala commented Dec 1, 2023

Description

These are changes for Automatic Reranking of Datasets + changes for updating the dataset_index file.

Reranking Changes

dataset_description_retriever.py changes:

  1. Load retrieved datasets from the generated dataset_index file (if it exists!) and process load upto 5 configs, chosen at random. Use these datasets for reranking
  2. Added function canonicalize_dataset_automatically to replace canonicalize_dataset_using_cli. Currently kept both.
  3. [Minor]Generally use this dataset dict everywhere instead of separate variables (Eg in col selection, ccanonicalize_dataset_using_cli )
  4. [Minor] Have dataset retriever return just dataset names for cleaner processing elsewhere
  5. Added functionality for only loading upto 3k rows - otherwise this module would take really long to run for large datasets (like amazon_polarity)
  6. Added functionality for using gated datasets only if the users wants to (Dataset retriever fails when trying to download gated dataset #371)

reranking_prompt notable points:

  1. Currently using incontext learning
  2. Even without incontext learning, dataset reranking prompt is 4k-5k by itself (maybe more in the future because right now reranker has 8-15 datasets instead of 25 unique dataset index)
  3. so we need to use LLMs with context length 16k for dataset reranking. to run
  4. Response expected to return in the format (dataset_name, config_name, confidence)

Tests:

  1. We use a tiny dataset_index_w_configs.json file for mocking HuggingFace calls
  2. Added 2 tests for reranking
  3. Mocked reranking, and updated changes to mock canonicalize_dataset_by_cli (with canonicalize_dataset_automatically)

Other Addtions:

  1. Parsing function for reranker prompt.

Evaluation of Reranking
Currently I have done qualitative eval, along with running on select tasks (linked here) - there seems to be a strong inherent affinity towards either popular datasets or datasets ranked num 1 - but the "confidence" parameter helps here. I played around with minor Chain of Thought prompting, which I think helped but results were inconclusive
I also plan to run reranking across BB lite to see the difference there

Discussion

  • I'm currently asking it to return confidence in the returned dataset - it return low confidence decent amount of times, but there are probably better ways to do this (eg using log probs)
  • Let me know if more tests are required, happy to add
  • Future TODOs include using the tags from HF for reranking as well - this would require analysing the tags a bit because some dataset have a really large number of them

Dataset Index File Changes:

These are changes for retrieving the dataset in real-time from HuggingFace, which takes 7-10 mins. [now takes a few seconds]
Created its own folder in dataset_retriever.
Contains the light preprocessing steps (preprocessing.py) for the datasets (where we try to filter out as much as possible before heavy processing) followed by the heavy processing (retrieve_dataset_info_with_configs). Light preprocessing was expected to not have any API calls/multiprocessing.
The flow of the heavy processing is as follows

  1. Iterate through the dataset, and check if it is valid or not using API.
  2. Get all config information for a dataset: Load a streaming version of the dataset and if it is taking too long to load, or the columns contain an Image/Video fields skip the config. Flatten the dataset at this level itself (streaming datasets don't have huggingface support for this, so I have implemented this)
  3. All this is implemented in a try catch block incase loading the dataset throws an error (eg dataset requires additional pip install of libraries, cant be streamed etc)

The rest of the code is just multiprocessing (parallelyl processing chunks of datasets, writing them to temp files, and merging the temp files in the end)

Blocked by:

Reranking file should be uploaded to the photron server before this is merged

@ritugala ritugala requested review from viswavi and neubig December 1, 2023 21:42
@ritugala ritugala changed the title Dataset Reranking Changes Dataset Reranking Dec 1, 2023
@ritugala ritugala marked this pull request as draft December 4, 2023 16:25
@ritugala ritugala marked this pull request as ready for review December 4, 2023 22:41
Copy link
Collaborator

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ritugala , thanks a lot for this contribution! I've left several comments but I stopped halfway through the review because there are a few recurring comments that would be good to fix throughout the whole PR, and then I can take another look:

  1. Don't commit large data files to the repo, they can be posted online and I'll help with doing this (message me with any that you want posted)
  2. Please make sure all functions (at least all public functions) have type annotations, and docstrings that indicate the arguments and return values.
  3. I made a suggestion that we move data prep scripts out of the main prompt2model/ library and into a higher level scripts/ directory. I realize that we did not do this before, but because this PR adds many data processing scripts I think it's probably a good opportunity to do so.

And thanks again for the nice work!

@neubig
Copy link
Collaborator

neubig commented Dec 13, 2023

Hi @ritugala I can take another look when you've had a moment to finish the above revisions and fix the CI!

@ritugala ritugala requested a review from neubig December 14, 2023 14:28
Copy link
Collaborator

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ritugala , really sorry it took so long to review this.
Overall this look great! I checked it very briefly and made a few small suggestions. If you can reflect those I'm happy to merge, and if you think any shouldn't be changed it's OK to merge things in anyway.

Comment on lines 230 to 233
def automatic_column_selection(
instruction: str,
dataset_name: str,
dataset_description: str,
dataset_columns: str,
example_rows: dict,
dataset_info: dict,
) -> tuple[list[str], str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the previous design of this function is better, as it makes the requirements of what needs to be passed into the function explicit. If we pass in a dict then we don't have guarantees that the dict contains the correct info. Let's revert the changes here.

dataset_description,
train_columns_formatted,
dataset["train"][0],
prompt_spec.instruction, dataset_info
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, revert

prompt2model/utils/parse_responses.py Outdated Show resolved Hide resolved
scripts/dataset_index/retrieve_dataset_info.py Outdated Show resolved Hide resolved
scripts/dataset_index/retrieve_dataset_info.py Outdated Show resolved Hide resolved
@ritugala ritugala merged commit f2eabc1 into main Jan 12, 2024
8 checks passed
@ritugala ritugala deleted the ritu-dataset-reranking branch January 12, 2024 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants