Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate new prompt mechanism into training #40

Merged
merged 13 commits into from
Oct 4, 2023

Conversation

jeffreymeetkai
Copy link
Collaborator

  • move train.py to root folder
  • implement token and embedding resizing to add the new tokens
  • fix compute_metrics to use len(tokenizer) instead of tokenizer.vocab_size when reshaping the tensors/arrays
  • integrate prepare_training_inputs into CustomDataset class

train.py Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you said putting it into the root, i thought under "functionary/"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this can create confusion, we should move all training module under functionary/train/

train.py Outdated
replace_llama_attn_with_flash_attn()
from functionary.prompt import EndToken
from train.custom_datasets import CustomDataset, split_data
from train.llama_flash_attn_monkey_patch import replace_llama_attn_with_flash_attn
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transformers must be imported after the patch. Lets revert back the previous order

train.py Outdated
special_tokens_dict = {"additional_special_tokens": added_tokens}
smart_tokenizer_and_embedding_resize(
special_tokens_dict=special_tokens_dict, tokenizer=tokenizer, model=model
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should refactor tokenizer initialization into seperate function

prompt_str = (
"system:\n"
+ generate_schema_from_functions(functions=messages["functions"])
+ "\nsystem:\nA chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. The assistant calls functions with appropriate input when necessary\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should let the get_prompt_from_messages function to prepare the input. So, prepend this as a message dict to the original messages.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with proper role and content etc

"""Prepares a list of messages for the model by calling `prepare_message_for_model` function on each of them and
concatenating the returned input_ids and targets. Also, the function merges the text of the messages.
def prepare_training_inputs(
messages: List[Dict],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure this is a list?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add some CI action to catch these kind of stuff

"input_ids": ret["input_ids"],
"labels": ret["labels"],
"attention_mask": ret["attention_mask"],
"input_ids": ret[1]["input_ids"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this ret[1] ? Can we find a more explicit way. If we cannot, we should at least add some clarification


import torch
import torch.distributed
from llama_flash_attn_monkey_patch import replace_llama_attn_with_flash_attn
import transformers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transformers must be imported after the monkey patch.

self.assertEqual(
final_prompt.strip(), self.final_prompt.strip(), "wrong final prompt from: get_prompt_from_messages"
final_prompt.strip(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why .strip is necessary? Is there a bug

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


def test_prepare_training_inputs(self):
"""this function is used to test function: prepare_training_inputs"""
# note that must set legacy=True, read more: https://github.com/huggingface/transformers/issues/25176
tokenizer = LlamaTokenizer.from_pretrained("musabgultekin/functionary-7b-v1", legacy=True)
tokenizer = LlamaTokenizer.from_pretrained(
"musabgultekin/functionary-7b-v1", legacy=True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this using legacy=True and fast tokenizer while our training is using slow and legacy=False Can we protect the integrity

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've set both to using fast tokenizer with legacy=True.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this file into something much more appropriate

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, let's spend more time on naming things

@@ -5,6 +5,10 @@

import torch

from functionary.schema import generate_schema_from_functions

SYSTEM_MESSAGE = """A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. The assistant calls functions with appropriate input when necessary"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was already a system message in the inference. Either let's delete this and import that one. Or delete the other one.

@musab-mk musab-mk merged commit f12598c into main Oct 4, 2023
@musab-mk musab-mk deleted the integrate-prompt-mechanism branch October 4, 2023 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants