GPU/RAM out of Memory PyTorch/Transformers

Context: I have six 3070 tis (48gb vram) and 8gb-32gb of ram along with 150gb ssd storage on my ubuntu 20.0.4 desktop. I am attempting to fine-tune meta-llama/Llama-2-7b-chat-hf with a very small dataset as a proof-of-concept. I took the first few entries from this dataset on kaggle.

When I was initially running my script utilizing DataParallel with 8gb of ram, I would run out of memory and it would fail (no surprise). I added a few sticks to get me up to 32gb and was finally met with this error

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate64.00 MiB. GPU 0 has a total capacty of 7.58 GiB of which 17.88 MiB is free. Including non-PyTorch memory, this process has 7.53 GiB memoryin use. Of the allocated memory 7.37 GiB is allocated by PyTorch, and5.71 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoidfragmentation. See documentation for Memory Management andPYTORCH_CUDA_ALLOC_CONF

Using glances I saw that it was only using my first GPU, rather than spreading the tasks out to all of them. I decided to try my hand at using DistributedDataParallel instead and ended up running out of memory even with 32gb.

I know there are many ways to optimize training/fine-tuning, which is why I came here for help. Would you recommend ZeRO, DP, DDP, PP, TP, or a combination of them such as in case 3 in the huggingface documentation? Accelerate? Optimum (NVIDIA or others)? TrainingArguments?

There are just so many different options and I'd love for someone to point me in the right direction and give an example of some working code I could test on my computer. To recreate environment:

mkdir test && cd testpython3 -m venv venvsource venv/bin/activatepip install torch transformers# create file# download dataset to same directory

Here is the almost working python code for just DP

from transformers import AutoModelForCausalLM, AutoTokenizerimport torchfrom torch.utils.data import Dataset, DataLoaderfrom torch.optim import AdamWimport json# Load dataset from JSON filewith open('small_en_medical_dialog.json', 'r') as file:    dataset = json.load(file)# Format the dataformatted_data = []for item in dataset:    input_text = "Description: " + item["Description"] +" Patient: " + item["Patient"]    target_text = item["Doctor"]    formatted_data.append({"input": input_text, "target": target_text})# Define the custom Dataset classclass DoctorPatientDataset(Dataset):    def __init__(self, data, tokenizer, max_length=512):        self.tokenizer = tokenizer        self.data = data        self.max_length = max_length    def __len__(self):        return len(self.data)    def __getitem__(self, idx):        item = self.data[idx]        input_encoding = self.tokenizer(            item['input'],            truncation=True,            padding='max_length',            max_length=self.max_length,            return_tensors='pt'        )        target_encoding = self.tokenizer(            item['target'],            truncation=True,            padding='max_length',            max_length=self.max_length,            return_tensors='pt'        )        return {'input_ids': input_encoding['input_ids'].flatten(),'attention_mask': input_encoding['attention_mask'].flatten(),'labels': target_encoding['input_ids'].flatten()        }# Initialize the model and tokenizermodel_name = "meta-llama/Llama-2-7b-chat-hf"tokenizer = AutoTokenizer.from_pretrained(model_name)# Set padding token if not already setif tokenizer.pad_token is None:    tokenizer.pad_token = tokenizer.eos_tokenmodel = AutoModelForCausalLM.from_pretrained(model_name)device = torch.device("cuda" if torch.cuda.is_available() else "cpu")model.to(device)model = torch.nn.DataParallel(model)# Training setupoptimizer = AdamW(model.parameters(), lr=5e-5)num_epochs = 3# Create dataset and DataLoadertrain_dataset = DoctorPatientDataset(formatted_data, tokenizer)train_dataloader = DataLoader(train_dataset, batch_size=3, shuffle=True)# Training loopfor epoch in range(num_epochs):    model.train()    total_loss = 0    for batch in train_dataloader:        optimizer.zero_grad()        outputs = model(input_ids=batch['input_ids'],                        attention_mask=batch['attention_mask'],                        labels=batch['labels'])        loss = outputs.loss        loss.backward()        optimizer.step()        total_loss += loss.item()    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {total_loss/len(train_dataloader)}")# Save the fine-tuned modelmodel.module.save_pretrained("my_finetuned_model")# Inferencemodel.eval()query = "Hello, how are you?"input_ids = tokenizer.encode(query, return_tensors='pt')with torch.no_grad():    output = model(input_ids=input_ids)    response_ids = output.logits.argmax(-1)    response = tokenizer.decode(response_ids, skip_special_tokens=True)print(response)

GPU/RAM out of Memory PyTorch/Transformers

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...