Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 23160

Why do I get different embeddings when I perform batch encoding in huggingface MT5 model?

$
0
0

I am trying to encode some text using HuggingFace's mt5-base model. I am using the model as shown below

from transformers import MT5EncoderModel, AutoTokenizermodel = MT5EncoderModel.from_pretrained("google/mt5-base")tokenizer = AutoTokenizer.from_pretrained("google/mt5-base")def get_t5_embeddings(texts):    last_hidden_state = model(input_ids=tokenizer(texts, return_tensors="pt", padding=True).input_ids).last_hidden_state    pooled_sentence = torch.max(last_hidden_state, dim=1)    return pooled_sentence[0].detach().numpy()

I was doing some experiments when I noticed that the same text had a low cosine similarity score with itself. I did some digging and realized that the model was returning very different embeddings if I did the encoding in batches. To validate this, I ran a small experiment that generated embeddings for Hello and a list of 10 Hellos incrementally. and checking the embeddings of the Hello and the first Hello in the list (both of which should be same).

for i in range(1, 10):    print(i, (get_t5_embeddings(["Hello"])[0] == get_t5_embeddings(["Hello"]*i)[0]).sum())

This will return the number of values in the embeddings that match each other.This was the result:

1 7682 7683 7684 7685 7686 7687 7688 279 27

Every time I run it, I get mismatches if the batch size is more than 768.

Why am I getting different embeddings and how do I fix this?


Viewing all articles
Browse latest Browse all 23160

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>