I am trying to encode some text using HuggingFace's mt5-base model. I am using the model as shown below
from transformers import MT5EncoderModel, AutoTokenizermodel = MT5EncoderModel.from_pretrained("google/mt5-base")tokenizer = AutoTokenizer.from_pretrained("google/mt5-base")def get_t5_embeddings(texts): last_hidden_state = model(input_ids=tokenizer(texts, return_tensors="pt", padding=True).input_ids).last_hidden_state pooled_sentence = torch.max(last_hidden_state, dim=1) return pooled_sentence[0].detach().numpy()I was doing some experiments when I noticed that the same text had a low cosine similarity score with itself. I did some digging and realized that the model was returning very different embeddings if I did the encoding in batches. To validate this, I ran a small experiment that generated embeddings for Hello and a list of 10 Hellos incrementally. and checking the embeddings of the Hello and the first Hello in the list (both of which should be same).
for i in range(1, 10): print(i, (get_t5_embeddings(["Hello"])[0] == get_t5_embeddings(["Hello"]*i)[0]).sum())This will return the number of values in the embeddings that match each other.This was the result:
1 7682 7683 7684 7685 7686 7687 7688 279 27Every time I run it, I get mismatches if the batch size is more than 768.
Why am I getting different embeddings and how do I fix this?