MAP-Neo Explain
Date:
Tokenizer
- base on sentencepece to achieve
- use BPE to train
- the training data contain 50B(5,000,000,000) samples from the pre-training corpus
- the maximum length is 64K
- assign higher sampling weights to code, math and high-quality academic data.
- set vocabulary size to 64000, and the max sentence-piece length was set to 16
- set remove extra whitespaces = False