For my Question Answering Kaggle competition, I wanted to experiment replacing the BERT model with RoBERTa. This means I need to reencode and retokenize the entire Natural Questions dataset into TFRecords. This process was already taking hours with the WordPiece tokenizer used for the BERT models. RoBERTa uses a faster and language agnostic tokenizer called SentencePiece. However, in my experiment, SentencePiece tokenizer was significantly slower and would have took close to 12 hours to complete if I had let it continue.
Fortunately, I came across HuggingFace’s Rust tokenizer which was 10x faster but it is still in the early days and doesn't support Custom Tokens out of the box. While you might (rightly) think I clickbaited you (as I didn't actually write the 10x Rust tokenizer), I did wrote a wrapper on top of the Rust implementation to support Custom Tokens for separating Questions [Q]
and Answers [A]
. This might seem straightforward but can really tricky implement it right, if you're not aware how all these tokenizers actually work underneath the abstractions. Hope this helps someone.
import json
from transformers import RobertaTokenizer
from tokenizers import Tokenizer, pre_tokenizers, decoders
from tokenizers.models import BPE, WordPiece
class CustomRobertaTokenizer:
def __init__(self, path):
self.tokenizer = RobertaTokenizer.from_pretrained(path)
vocab = path+"/vocab.json"
merges = path+"/merges.txt"
# Create a Tokenizer using BPE
self.rust = Tokenizer(BPE.from_files(vocab, merges))
# Use ByteLevel PreTokenizer
self.rust.pre_tokenizer = pre_tokenizers.ByteLevel.new(add_prefix_space=True)
# Use ByteLevel Decoder
self.rust.decoder = decoders.ByteLevel.new()
with open(path+'/added_tokens.json', 'r') as f:
self.added_token = json.load(f)
special_tokens = {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "sep_token": "</s>", "pad_token": "<pad>", "cls_token": "<s>", "mask_token": "<mask>"}
# self.special_token_map = {v: k for k, v in special_tokens.items()}
for k, v in special_tokens.items():
self.added_token[v] = k
setattr(self, k, v)
def tokenize(self, txt, add_prefix_space=True):
streams = []
tmp = []
for w in txt.split():
# print(w, tmp, streams)
# if w in self.added_token or w in self.special_token_map:
if w in self.added_token:
if len(tmp) != 0:
streams.append(tmp)
tmp = []
streams.append(w)
else:
tmp.append(w)
# print(streams)
if len(tmp) != 0:
streams.append(tmp)
tmp = []
bpes = [" " + " ".join(x) for x in streams if isinstance(x, list)]
# print("bpes", bpes)
bpes_result = self.rust.encode_batch(bpes)
# print(bpes_result)
final_result = []
for w in streams:
if isinstance(w, list):
final_result.extend(bpes_result.pop(0).tokens)
else:
final_result.append(w)
return final_result
def convert_tokens_to_ids(self, tokens):
return self.tokenizer.convert_tokens_to_ids(tokens)
And you can use it as below:
# Provide the custom tokens [Q] and [A] in added_tokens.json
txt = "<s> [Q] Who founded Google? [A] Google was founded on September 4, 1998, by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University in California. </s>"
#Default tokenizer
tokenizer = RobertaTokenizer.from_pretrained('./nq-vocab')
# With Custom Tokenizer support
tokenizer = CustomRobertaTokenizer('./nq-vocab')
print(tokenizer.tokenize(txt))
Also, remember that the SentencePiece tokenizer requires additional effort for post processing as we don't know what encoding rules were used. We have to use some heuristics like longest-common-substring algorithm to map the outputs to the input tokens but this is not guaranteed to work everytime. I'm not sure how others have solved this problem.
You can get an excellent overview of various tokenizers from here and here.