Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 23131

GPU/RAM out of Memory PyTorch/Transformers

$
0
0

Context: I have six 3070 tis (48gb vram) and 8gb-32gb of ram along with 150gb ssd storage on my ubuntu 20.0.4 desktop. I am attempting to fine-tune meta-llama/Llama-2-7b-chat-hf with a very small dataset as a proof-of-concept. I took the first few entries from this dataset on kaggle.

When I was initially running my script utilizing DataParallel with 8gb of ram, I would run out of memory and it would fail (no surprise). I added a few sticks to get me up to 32gb and was finally met with this error

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate64.00 MiB. GPU 0 has a total capacty of 7.58 GiB of which 17.88 MiB is free. Including non-PyTorch memory, this process has 7.53 GiB memoryin use. Of the allocated memory 7.37 GiB is allocated by PyTorch, and5.71 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoidfragmentation. See documentation for Memory Management andPYTORCH_CUDA_ALLOC_CONF

Using glances I saw that it was only using my first GPU, rather than spreading the tasks out to all of them. I decided to try my hand at using DistributedDataParallel instead and ended up running out of memory even with 32gb.

I know there are many ways to optimize training/fine-tuning, which is why I came here for help. Would you recommend ZeRO, DP, DDP, PP, TP, or a combination of them such as in case 3 in the huggingface documentation? Accelerate? Optimum (NVIDIA or others)? TrainingArguments?

There are just so many different options and I'd love for someone to point me in the right direction and give an example of some working code I could test on my computer. To recreate environment:

mkdir test && cd testpython3 -m venv venvsource venv/bin/activatepip install torch transformers# create file# download dataset to same directory

Here is the almost working python code for just DP

from transformers import AutoModelForCausalLM, AutoTokenizerimport torchfrom torch.utils.data import Dataset, DataLoaderfrom torch.optim import AdamWimport json# Load dataset from JSON filewith open('small_en_medical_dialog.json', 'r') as file:    dataset = json.load(file)# Format the dataformatted_data = []for item in dataset:    input_text = "Description: " + item["Description"] +" Patient: " + item["Patient"]    target_text = item["Doctor"]    formatted_data.append({"input": input_text, "target": target_text})# Define the custom Dataset classclass DoctorPatientDataset(Dataset):    def __init__(self, data, tokenizer, max_length=512):        self.tokenizer = tokenizer        self.data = data        self.max_length = max_length    def __len__(self):        return len(self.data)    def __getitem__(self, idx):        item = self.data[idx]        input_encoding = self.tokenizer(            item['input'],            truncation=True,            padding='max_length',            max_length=self.max_length,            return_tensors='pt'        )        target_encoding = self.tokenizer(            item['target'],            truncation=True,            padding='max_length',            max_length=self.max_length,            return_tensors='pt'        )        return {'input_ids': input_encoding['input_ids'].flatten(),'attention_mask': input_encoding['attention_mask'].flatten(),'labels': target_encoding['input_ids'].flatten()        }# Initialize the model and tokenizermodel_name = "meta-llama/Llama-2-7b-chat-hf"tokenizer = AutoTokenizer.from_pretrained(model_name)# Set padding token if not already setif tokenizer.pad_token is None:    tokenizer.pad_token = tokenizer.eos_tokenmodel = AutoModelForCausalLM.from_pretrained(model_name)device = torch.device("cuda" if torch.cuda.is_available() else "cpu")model.to(device)model = torch.nn.DataParallel(model)# Training setupoptimizer = AdamW(model.parameters(), lr=5e-5)num_epochs = 3# Create dataset and DataLoadertrain_dataset = DoctorPatientDataset(formatted_data, tokenizer)train_dataloader = DataLoader(train_dataset, batch_size=3, shuffle=True)# Training loopfor epoch in range(num_epochs):    model.train()    total_loss = 0    for batch in train_dataloader:        optimizer.zero_grad()        outputs = model(input_ids=batch['input_ids'],                        attention_mask=batch['attention_mask'],                        labels=batch['labels'])        loss = outputs.loss        loss.backward()        optimizer.step()        total_loss += loss.item()    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {total_loss/len(train_dataloader)}")# Save the fine-tuned modelmodel.module.save_pretrained("my_finetuned_model")# Inferencemodel.eval()query = "Hello, how are you?"input_ids = tokenizer.encode(query, return_tensors='pt')with torch.no_grad():    output = model(input_ids=input_ids)    response_ids = output.logits.argmax(-1)    response = tokenizer.decode(response_ids, skip_special_tokens=True)print(response)

Viewing all articles
Browse latest Browse all 23131

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>