MAP-Neo Explain

Date: March 01, 2012

Tokenizer

base on sentencepece to achieve
use BPE to train
the training data contain 50B(5,000,000,000) samples from the pre-training corpus
the maximum length is 64K
assign higher sampling weights to code, math and high-quality academic data.
set vocabulary size to 64000, and the max sentence-piece length was set to 16
set remove extra whitespaces = False