Support phi 3.5 #1800

minhthuc2502 · 2024-10-15T09:39:52Z

No description provided.

BBC-Esq · 2024-10-15T13:04:41Z

After converting to the int8_bfloat16 I get this error when trying to run it in a script;

  File "D:\Scripts\bench_chat\ct2_phi3.py", line 87, in main
    results_batch = generator.generate_batch(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: expected storage to be of type float32, but is of type bfloat16

I also received the same exact message when first converting the model into bfloat16:

    results_batch = generator.generate_batch(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: expected storage to be of type float32, but is of type bfloat16

The latter is really weird because the model card says that it's originally in bfloat16...

BBC-Esq · 2024-10-15T13:26:36Z

See this if it help:

#1792

minhthuc2502 · 2024-10-16T07:43:26Z

I quantized phi3.5 to int8_bfloat16 and didn't have any error like you mentioned above in inference time. Please provide more in detail how to reproduce this and which model that you used.

BBC-Esq · 2024-10-16T13:23:02Z

Sure...

The script I used to convert it is located here:

https://github.com/BBC-Esq/Ctranslate2-Converter/blob/main/Ctranslate2-Converter/convert_ctranslate2.py

And the script I used to run it is as follows:

import os
import ctranslate2
from transformers import AutoTokenizer

model_dir = r"D:\Scripts\bench_chat\models\Phi-3.5-mini-instruct-ct2-bfloat16"

def build_prompt():
    system_message = "You are a helpful AI assistant."
    user_message = "Tell me a short joke."
    
    prompt = f"""<s><|system|>
{system_message}<|end|>
<|user|>
{user_message}<|end|>
<|assistant|>
"""
    return prompt

def main():
    print(f"Loading the model: {os.path.basename(model_dir)}...")
    
    generator = ctranslate2.Generator(
        model_dir,
        device="cuda",
        compute_type="bfloat16"
    )
    
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    
    prompt = build_prompt()
    tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
    
    print("Generating response...")
    results = generator.generate_batch(
        [tokens],
        include_prompt_in_result=False,
        max_batch_size=4096,
        batch_type="tokens",
        beam_size=1,
        num_hypotheses=1,
        max_length=512,
        sampling_temperature=0.00,
    )
    
    output = tokenizer.decode(results[0].sequences_ids[0])
    
    print("\nGenerated response:")
    print(output)

if __name__ == "__main__":
    main()

minhthuc2502 · 2024-10-17T09:08:44Z

It seems like you are running this script on a GPU < 8.x while it does not support compute type bfloat16. Remove this line compute_type="bfloat16" and try again. If you have any futher problem, feel free to open a new issue.

Sure...

The script I used to convert it is located here:

https://github.com/BBC-Esq/Ctranslate2-Converter/blob/main/Ctranslate2-Converter/convert_ctranslate2.py

And the script I used to run it is as follows:

import os
import ctranslate2
from transformers import AutoTokenizer

model_dir = r"D:\Scripts\bench_chat\models\Phi-3.5-mini-instruct-ct2-bfloat16"

def build_prompt():
    system_message = "You are a helpful AI assistant."
    user_message = "Tell me a short joke."
    
    prompt = f"""<s><|system|>
{system_message}<|end|>
<|user|>
{user_message}<|end|>
<|assistant|>
"""
    return prompt

def main():
    print(f"Loading the model: {os.path.basename(model_dir)}...")
    
    generator = ctranslate2.Generator(
        model_dir,
        device="cuda",
        compute_type="bfloat16"
    )
    
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    
    prompt = build_prompt()
    tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
    
    print("Generating response...")
    results = generator.generate_batch(
        [tokens],
        include_prompt_in_result=False,
        max_batch_size=4096,
        batch_type="tokens",
        beam_size=1,
        num_hypotheses=1,
        max_length=512,
        sampling_temperature=0.00,
    )
    
    output = tokenizer.decode(results[0].sequences_ids[0])
    
    print("\nGenerated response:")
    print(output)

if __name__ == "__main__":
    main()

BBC-Esq · 2024-10-17T11:18:53Z

It seems like you are running this script on a GPU < 8.x while it does not support compute type bfloat16. Remove this line compute_type="bfloat16" and try again. If you have any futher problem, feel free to open a new issue.

Sure...
The script I used to convert it is located here:
https://github.com/BBC-Esq/Ctranslate2-Converter/blob/main/Ctranslate2-Converter/convert_ctranslate2.py
And the script I used to run it is as follows:

import os
import ctranslate2
from transformers import AutoTokenizer

model_dir = r"D:\Scripts\bench_chat\models\Phi-3.5-mini-instruct-ct2-bfloat16"

def build_prompt():
    system_message = "You are a helpful AI assistant."
    user_message = "Tell me a short joke."
    
    prompt = f"""<s><|system|>
{system_message}<|end|>
<|user|>
{user_message}<|end|>
<|assistant|>
"""
    return prompt

def main():
    print(f"Loading the model: {os.path.basename(model_dir)}...")
    
    generator = ctranslate2.Generator(
        model_dir,
        device="cuda",
        compute_type="bfloat16"
    )
    
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    
    prompt = build_prompt()
    tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
    
    print("Generating response...")
    results = generator.generate_batch(
        [tokens],
        include_prompt_in_result=False,
        max_batch_size=4096,
        batch_type="tokens",
        beam_size=1,
        num_hypotheses=1,
        max_length=512,
        sampling_temperature=0.00,
    )
    
    output = tokenizer.decode(results[0].sequences_ids[0])
    
    print("\nGenerated response:")
    print(output)

if __name__ == "__main__":
    main()

Are you referring to the cuda compute level? The GPU I'm running it on is an rtx 4090 so it supports compute more than 8...

BBC-Esq · 2024-10-17T11:27:30Z

Also, I just commented out this and same thing...

BBC-Esq · 2024-10-17T11:27:58Z

Do you want me to open a separate issue still? Seems redundant>

support phi 3.5

9b216e2

minhthuc2502 merged commit 100d49c into OpenNMT:master Oct 17, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support phi 3.5 #1800

Support phi 3.5 #1800

minhthuc2502 commented Oct 15, 2024

BBC-Esq commented Oct 15, 2024 •

edited

Loading

BBC-Esq commented Oct 15, 2024

minhthuc2502 commented Oct 16, 2024

BBC-Esq commented Oct 16, 2024

minhthuc2502 commented Oct 17, 2024

BBC-Esq commented Oct 17, 2024

BBC-Esq commented Oct 17, 2024

BBC-Esq commented Oct 17, 2024

Support phi 3.5 #1800

Support phi 3.5 #1800

Conversation

minhthuc2502 commented Oct 15, 2024

BBC-Esq commented Oct 15, 2024 • edited Loading

BBC-Esq commented Oct 15, 2024

minhthuc2502 commented Oct 16, 2024

BBC-Esq commented Oct 16, 2024

minhthuc2502 commented Oct 17, 2024

BBC-Esq commented Oct 17, 2024

BBC-Esq commented Oct 17, 2024

BBC-Esq commented Oct 17, 2024

BBC-Esq commented Oct 15, 2024 •

edited

Loading