I want to run LLama2 on a GPU since it takes forever to create answers with CPU. I have access to a nvidia a6000 through a jupyter notebook. I have installed everything and the responses are fine but it takes so long and isn't fast enough for my research purposes.
import torchimport transformersfrom transformers import LlamaForCausalLM, LlamaTokenizerimport setGPUmodel_dir = "llama/llama-2-7b-chat-hf"model = LlamaForCausalLM.from_pretrained(model_dir)tokenizer = LlamaTokenizer.from_pretrained(model_dir)pipeline = transformers.pipeline("text-generation", model=model, tokenizer=tokenizer, torch_dtype=torch.float16,)sequences = pipeline('I wanna hear some news. What is up today', do_sample=True, top_k=10, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id, max_length=400,)for seq in sequences: print(f"{seq['generated_text']}")This is my current code. When I try nvidia-smi in terminal, the GPU is always at 0% whereas the CPU RAM increases extremely, so it's 100% running on the CPU. How can I make it run on the GPU?
I have made this tutorial directly from meta: https://ai.meta.com/blog/5-steps-to-getting-started-with-llama-2/