Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support phi 3.5 #1800

Merged
merged 1 commit into from
Oct 17, 2024
Merged

Support phi 3.5 #1800

merged 1 commit into from
Oct 17, 2024

Conversation

minhthuc2502
Copy link
Collaborator

No description provided.

@BBC-Esq
Copy link

BBC-Esq commented Oct 15, 2024

After converting to the int8_bfloat16 I get this error when trying to run it in a script;

  File "D:\Scripts\bench_chat\ct2_phi3.py", line 87, in main
    results_batch = generator.generate_batch(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: expected storage to be of type float32, but is of type bfloat16

I also received the same exact message when first converting the model into bfloat16:

    results_batch = generator.generate_batch(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: expected storage to be of type float32, but is of type bfloat16

The latter is really weird because the model card says that it's originally in bfloat16...

image

@BBC-Esq
Copy link

BBC-Esq commented Oct 15, 2024

See this if it help:

#1792

@minhthuc2502
Copy link
Collaborator Author

I quantized phi3.5 to int8_bfloat16 and didn't have any error like you mentioned above in inference time. Please provide more in detail how to reproduce this and which model that you used.

@BBC-Esq
Copy link

BBC-Esq commented Oct 16, 2024

Sure...

The script I used to convert it is located here:

https://github.com/BBC-Esq/Ctranslate2-Converter/blob/main/Ctranslate2-Converter/convert_ctranslate2.py

And the script I used to run it is as follows:

import os
import ctranslate2
from transformers import AutoTokenizer

model_dir = r"D:\Scripts\bench_chat\models\Phi-3.5-mini-instruct-ct2-bfloat16"

def build_prompt():
    system_message = "You are a helpful AI assistant."
    user_message = "Tell me a short joke."
    
    prompt = f"""<s><|system|>
{system_message}<|end|>
<|user|>
{user_message}<|end|>
<|assistant|>
"""
    return prompt

def main():
    print(f"Loading the model: {os.path.basename(model_dir)}...")
    
    generator = ctranslate2.Generator(
        model_dir,
        device="cuda",
        compute_type="bfloat16"
    )
    
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    
    prompt = build_prompt()
    tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
    
    print("Generating response...")
    results = generator.generate_batch(
        [tokens],
        include_prompt_in_result=False,
        max_batch_size=4096,
        batch_type="tokens",
        beam_size=1,
        num_hypotheses=1,
        max_length=512,
        sampling_temperature=0.00,
    )
    
    output = tokenizer.decode(results[0].sequences_ids[0])
    
    print("\nGenerated response:")
    print(output)

if __name__ == "__main__":
    main()

@minhthuc2502
Copy link
Collaborator Author

It seems like you are running this script on a GPU < 8.x while it does not support compute type bfloat16. Remove this line compute_type="bfloat16" and try again. If you have any futher problem, feel free to open a new issue.

Sure...

The script I used to convert it is located here:

https://github.com/BBC-Esq/Ctranslate2-Converter/blob/main/Ctranslate2-Converter/convert_ctranslate2.py

And the script I used to run it is as follows:

import os
import ctranslate2
from transformers import AutoTokenizer

model_dir = r"D:\Scripts\bench_chat\models\Phi-3.5-mini-instruct-ct2-bfloat16"

def build_prompt():
    system_message = "You are a helpful AI assistant."
    user_message = "Tell me a short joke."
    
    prompt = f"""<s><|system|>
{system_message}<|end|>
<|user|>
{user_message}<|end|>
<|assistant|>
"""
    return prompt

def main():
    print(f"Loading the model: {os.path.basename(model_dir)}...")
    
    generator = ctranslate2.Generator(
        model_dir,
        device="cuda",
        compute_type="bfloat16"
    )
    
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    
    prompt = build_prompt()
    tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
    
    print("Generating response...")
    results = generator.generate_batch(
        [tokens],
        include_prompt_in_result=False,
        max_batch_size=4096,
        batch_type="tokens",
        beam_size=1,
        num_hypotheses=1,
        max_length=512,
        sampling_temperature=0.00,
    )
    
    output = tokenizer.decode(results[0].sequences_ids[0])
    
    print("\nGenerated response:")
    print(output)

if __name__ == "__main__":
    main()

@minhthuc2502 minhthuc2502 merged commit 100d49c into OpenNMT:master Oct 17, 2024
13 checks passed
@BBC-Esq
Copy link

BBC-Esq commented Oct 17, 2024

It seems like you are running this script on a GPU < 8.x while it does not support compute type bfloat16. Remove this line compute_type="bfloat16" and try again. If you have any futher problem, feel free to open a new issue.

Sure...
The script I used to convert it is located here:
https://github.com/BBC-Esq/Ctranslate2-Converter/blob/main/Ctranslate2-Converter/convert_ctranslate2.py
And the script I used to run it is as follows:

import os
import ctranslate2
from transformers import AutoTokenizer

model_dir = r"D:\Scripts\bench_chat\models\Phi-3.5-mini-instruct-ct2-bfloat16"

def build_prompt():
    system_message = "You are a helpful AI assistant."
    user_message = "Tell me a short joke."
    
    prompt = f"""<s><|system|>
{system_message}<|end|>
<|user|>
{user_message}<|end|>
<|assistant|>
"""
    return prompt

def main():
    print(f"Loading the model: {os.path.basename(model_dir)}...")
    
    generator = ctranslate2.Generator(
        model_dir,
        device="cuda",
        compute_type="bfloat16"
    )
    
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    
    prompt = build_prompt()
    tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
    
    print("Generating response...")
    results = generator.generate_batch(
        [tokens],
        include_prompt_in_result=False,
        max_batch_size=4096,
        batch_type="tokens",
        beam_size=1,
        num_hypotheses=1,
        max_length=512,
        sampling_temperature=0.00,
    )
    
    output = tokenizer.decode(results[0].sequences_ids[0])
    
    print("\nGenerated response:")
    print(output)

if __name__ == "__main__":
    main()

Are you referring to the cuda compute level? The GPU I'm running it on is an rtx 4090 so it supports compute more than 8...

@BBC-Esq
Copy link

BBC-Esq commented Oct 17, 2024

Also, I just commented out this and same thing...

image

@BBC-Esq
Copy link

BBC-Esq commented Oct 17, 2024

Do you want me to open a separate issue still? Seems redundant>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants