Learn NLP by CS11-711(CMU)
- Subword Models (Sentence Piece是一种工具)
- Byte Pair Encoding
incrementally combine the most frequent token pairs
- Unigram Models Cons: hard to use multilingual and arbitrariness
- Byte Pair Encoding
- Continuous Word Embeddings
- Continuous Bag of Words (CBOW)