본문 바로가기
Python

[Python] How to Run 70B LLMs on a Single 4GB GPU

by YJHTPII 2024. 4. 25.
반응형

 

 

https://generativeai.pub/how-to-run-70b-llms-on-a-single-4gb-gpu-d1c61ed5258c

 

How to Run 70B LLMs on a Single 4GB GPU

Have you ever dreamed of using the state-of-the-art large language models (LLMs) for your natural language processing (NLP) tasks, but felt…

generativeai.pub

 

 

 

#https://generativeai.pub/how-to-run-70b-llms-on-a-single-4gb-gpu-d1c61ed5258c

from airllm import AutoModel

MAX_LENGTH = 128

# load the model from the Hugging Face hub
model = AutoModel.from_pretrained("garage-bAInd/Platypus2-70B-instruct")

# or load the model from a local path
# model = AutoModel.from_pretrained("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")

# prepare the input text
input_text = [
    'What is the capital of United States?',
]

# tokenize the input text
input_tokens = model.tokenizer(input_text,
    return_tensors="pt",
    return_attention_mask=False,
    truncation=True,
    max_length=MAX_LENGTH,
    padding=False)

# generate the output text
generation_output = model.generate(
    input_tokens['input_ids'].cuda(),
    max_new_tokens=20,
    use_cache=True,
    return_dict_in_generate=True)

# decode the output text
output = model.tokenizer.decode(generation_output.sequences[0])

# print the output text
print(output)
반응형

댓글