How to add new tokens to an existing Huggingface AutoTokenizer?
Canonically, there's this tutorial from Huggingface https://huggingface.co/learn/nlp-course/chapter6/2 but it ends on the note of "quirks when using existing tokenizers". And then it points to the train_new_from_iterator()
function in Chapter 7 but I can't seem to find reference to how to use it to extend the tokenizer without re-training it.
I've tried the solution from Training New AutoTokenizer Hugging Face that uses train_new_from_iterator()
but that will re-train a tokenizer, but it is not extending it, the solution would replace the existing token indices. Training New AutoTokenizer Hugging Face
import pandas as pddef batch_iterator(batch_size=3, size=8): df = pd.DataFrame({"note_text": ['foobar', 'helloworld']}) for x in range(0, size, batch_size): yield df['note_text'].to_list()old_tokenizer = AutoTokenizer.from_pretrained('roberta-base')training_corpus = batch_iterator()new_tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 32000)print(len(old_tokenizer))print(old_tokenizer( ['foobarzz', 'helloworld'] ))print(new_tokenizer( ['foobarzz', 'hello world'] ))
[out]:
50265{'input_ids': [[0, 21466, 22468, 7399, 2], [0, 20030, 1722, 39949, 2]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}{'input_ids': [[0, 275, 2], [0, 276, 2]], 'attention_mask': [[1, 1, 1], [1, 1, 1]]}
Note: The reason why the new tokens starts from 275 and 276 is because there are reserved tokens from ids 0-274.
The expected behavior of new_tokenizer( ['foo bar', 'hello word'] )
is to have IDs beyond the tokenizer vocab size (i.e. 50265 for the roberta-base
model) and it should look like this:
{'input_ids': [[0, 50265, 2], [0, 50266, 2]], 'attention_mask': [[1, 1, 1], [1, 1, 1]]}