Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 13861

How to add new tokens to an existing Huggingface tokenizer?

$
0
0

How to add new tokens to an existing Huggingface AutoTokenizer?

Canonically, there's this tutorial from Huggingface https://huggingface.co/learn/nlp-course/chapter6/2 but it ends on the note of "quirks when using existing tokenizers". And then it points to the train_new_from_iterator() function in Chapter 7 but I can't seem to find reference to how to use it to extend the tokenizer without re-training it.

I've tried the solution from Training New AutoTokenizer Hugging Face that uses train_new_from_iterator() but that will re-train a tokenizer, but it is not extending it, the solution would replace the existing token indices. Training New AutoTokenizer Hugging Face

import pandas as pddef batch_iterator(batch_size=3, size=8):        df = pd.DataFrame({"note_text": ['foobar', 'helloworld']})        for x in range(0, size, batch_size):            yield df['note_text'].to_list()old_tokenizer = AutoTokenizer.from_pretrained('roberta-base')training_corpus = batch_iterator()new_tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 32000)print(len(old_tokenizer))print(old_tokenizer( ['foobarzz', 'helloworld'] ))print(new_tokenizer( ['foobarzz', 'hello world'] ))

[out]:

50265{'input_ids': [[0, 21466, 22468, 7399, 2], [0, 20030, 1722, 39949, 2]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}{'input_ids': [[0, 275, 2], [0, 276, 2]], 'attention_mask': [[1, 1, 1], [1, 1, 1]]}

Note: The reason why the new tokens starts from 275 and 276 is because there are reserved tokens from ids 0-274.

The expected behavior of new_tokenizer( ['foo bar', 'hello word'] ) is to have IDs beyond the tokenizer vocab size (i.e. 50265 for the roberta-base model) and it should look like this:

{'input_ids': [[0, 50265, 2], [0, 50266, 2]], 'attention_mask': [[1, 1, 1], [1, 1, 1]]}

Viewing all articles
Browse latest Browse all 13861

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>