Integrate new prompt mechanism into training #40

jeffreymeetkai · 2023-10-03T05:06:08Z

move train.py to root folder
implement token and embedding resizing to add the new tokens
fix compute_metrics to use len(tokenizer) instead of tokenizer.vocab_size when reshaping the tensors/arrays
integrate prepare_training_inputs into CustomDataset class

…tegrate-prompt-mechanism

musab-mk · 2023-10-03T07:58:43Z

train.py

When you said putting it into the root, i thought under "functionary/"

As this can create confusion, we should move all training module under functionary/train/

musab-mk · 2023-10-03T08:00:30Z

train.py

-replace_llama_attn_with_flash_attn()
+from functionary.prompt import EndToken
+from train.custom_datasets import CustomDataset, split_data
+from train.llama_flash_attn_monkey_patch import replace_llama_attn_with_flash_attn


Transformers must be imported after the patch. Lets revert back the previous order

musab-mk · 2023-10-03T08:02:22Z

train.py

+    special_tokens_dict = {"additional_special_tokens": added_tokens}
+    smart_tokenizer_and_embedding_resize(
+        special_tokens_dict=special_tokens_dict, tokenizer=tokenizer, model=model
+    )


I think we should refactor tokenizer initialization into seperate function

musab-mk · 2023-10-03T08:05:46Z

train/custom_datasets.py

+    prompt_str = (
+        "system:\n"
+        + generate_schema_from_functions(functions=messages["functions"])
+        + "\nsystem:\nA chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. The assistant calls functions with appropriate input when necessary\n"


We should let the get_prompt_from_messages function to prepare the input. So, prepend this as a message dict to the original messages.

with proper role and content etc

musab-mk · 2023-10-03T08:07:04Z

train/custom_datasets.py

-    """Prepares a list of messages for the model by calling `prepare_message_for_model` function on each of them and
-    concatenating the returned input_ids and targets. Also, the function merges the text of the messages.
+def prepare_training_inputs(
+    messages: List[Dict],


are you sure this is a list?

Maybe we should add some CI action to catch these kind of stuff

musab-mk · 2023-10-03T08:10:12Z

train/custom_datasets.py

-            "input_ids": ret["input_ids"],
-            "labels": ret["labels"],
-            "attention_mask": ret["attention_mask"],
+            "input_ids": ret[1]["input_ids"],


Why is this ret[1] ? Can we find a more explicit way. If we cannot, we should at least add some clarification

musab-mk · 2023-10-04T06:32:21Z

functionary/train/train.py


 import torch
 import torch.distributed
-from llama_flash_attn_monkey_patch import replace_llama_attn_with_flash_attn
+import transformers


Transformers must be imported after the monkey patch.

musab-mk · 2023-10-04T06:49:16Z

tests/test_prompt_creation.py

        self.assertEqual(
-            final_prompt.strip(), self.final_prompt.strip(), "wrong final prompt from: get_prompt_from_messages"
+            final_prompt.strip(),


Do you know why .strip is necessary? Is there a bug

musab-mk · 2023-10-04T06:50:30Z

tests/test_prompt_creation.py


    def test_prepare_training_inputs(self):
        """this function is used to test function: prepare_training_inputs"""
        # note that must set legacy=True, read more: https://github.com/huggingface/transformers/issues/25176
-        tokenizer = LlamaTokenizer.from_pretrained("musabgultekin/functionary-7b-v1", legacy=True)
+        tokenizer = LlamaTokenizer.from_pretrained(
+            "musabgultekin/functionary-7b-v1", legacy=True


Why is this using legacy=True and fast tokenizer while our training is using slow and legacy=False Can we protect the integrity

I've set both to using fast tokenizer with legacy=True.

musab-mk · 2023-10-04T06:52:55Z

functionary/train/custom_datasets.py

Can we rename this file into something much more appropriate

In general, let's spend more time on naming things

musab-mk · 2023-10-04T06:54:50Z

functionary/prompt.py

@@ -5,6 +5,10 @@

 import torch

+from functionary.schema import generate_schema_from_functions
+
+SYSTEM_MESSAGE = """A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. The assistant calls functions with appropriate input when necessary"""


There was already a system message in the inference. Either let's delete this and import that one. Or delete the other one.

jeffreymeetkai added 5 commits October 2, 2023 08:25

refactor to run training from root

9f504b2

replace data preprocessing function wip

e5a9716

Merge branch 'main' of https://github.com/MeetKai/functionary into in…

5108086

…tegrate-prompt-mechanism

integrate new prompting mechanism

62ee282

refactor custom_datasets

b64fd6c

jeffreymeetkai requested review from musab-mk and khai-meetkai October 3, 2023 05:06

musab-mk requested changes Oct 3, 2023

View reviewed changes

jeffreymeetkai and others added 7 commits October 3, 2023 08:32

fix prompt creation unittest

b7d5f19

add end-of-system token in prompt creation

9f5d440

Fix modules

ba2cb57

Fix README train script running code

cfc0441

refactor transformers import + tokenizer initialization

b984f58

fixes based on comments

900a88d

fix unit tests

81cb04b

musab-mk requested changes Oct 4, 2023

View reviewed changes

fixes based on comments

96394db

musab-mk approved these changes Oct 4, 2023

View reviewed changes

musab-mk merged commit f12598c into main Oct 4, 2023

musab-mk deleted the integrate-prompt-mechanism branch October 4, 2023 08:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate new prompt mechanism into training #40

Integrate new prompt mechanism into training #40

jeffreymeetkai commented Oct 3, 2023

musab-mk Oct 3, 2023

musab-mk Oct 3, 2023

musab-mk Oct 3, 2023

musab-mk Oct 3, 2023

musab-mk Oct 3, 2023

musab-mk Oct 3, 2023

musab-mk Oct 3, 2023

musab-mk Oct 3, 2023

musab-mk Oct 3, 2023

musab-mk Oct 4, 2023

musab-mk Oct 4, 2023

jeffreymeetkai Oct 4, 2023

musab-mk Oct 4, 2023

jeffreymeetkai Oct 4, 2023

musab-mk Oct 4, 2023

musab-mk Oct 4, 2023

musab-mk Oct 4, 2023

Integrate new prompt mechanism into training #40

Integrate new prompt mechanism into training #40

Conversation

jeffreymeetkai commented Oct 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment