Summary
- Tokenization
- Lemmatization and Stemming
- Stopword
- Regular Expression
- Text Preprocessing Tools for Korean Text
1. Tokenization
-
Tokenization : dividing a given corpus into tokens, which are units defined for a specific purpose.
-
Sentence Tokenization : The unit of the token is a sentence.
- A dot (.) usually serves as a boundary between sentences, but It is often used as abbreivation, such as ‘Ph.D.’
-
Word Tokenization : The criterion of the token is a word.
- In English, word tokens can be distinguished by cutting them by space (" “), but in Korean, this is not the case.
-
Subword Tokenizer (Morpheme Tokenization) : the standard of the token is a morpheme. In the case of Korean, morpheme tokenization is preferred for the following reasons:
- Postposition (조사) : various postpositions are attached directly without spaces and are easily recognized as different words. ex) ‘그는’, ‘그가’
- Inaccurate spacing : Compared to English, spacing tends to be poor
-
1.1 Byte Pair Encoding (BPE)
-
Out of vocabulary : When a word that’s not in the training set occurs in real data, this causes a OOV problem.
-
BPE : is a simple form of data compression algorithm in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur in that data.
-
Ex) Suppose we have data
aaabdaaabacwhich needs to be encoded (compressed).-
The byte pair
aaoccurs most often, so we will replace it withZas Z does not occur in our data. So we now haveZabdZabacwhereZ = aa. -
The next common byte pair is
abso let’s replace it withY. We now haveZYdZYacwhereZ = aaandY = ab. -
We can use recursive byte pair encoding to encode
ZYasX. Our data has now transformed intoXdXacwhereX = ZY, Y = ab,andZ = aa.
-
-
1.2 Word Piece Tokenization
-
Word Piece Tokenization : is similar to BPE in that it recognizes frequently occurring strings as tokens. However, the criterion for merging was frequency in BPE, Wordpiece proceeds merging by likelihood.
-
Ex)
Jet makers feud over seat width with big orders at stake=> Word Piece Tokenization =>_J et _makers _fe ud _over _seat _width _with _big _orders _at _stake-
"_"is space for input sentence -
" "for separate tokens
-
-
1.3 SentencePiece
- SentencePiece : an open-source subword tokenizer by Google that operates directly on the raw input stream — no pre-tokenization (no language-specific word splitting) is required. This makes it language-agnostic and especially well-suited to Korean, Japanese, and Chinese where word boundaries are not marked by spaces.
- Spaces are preserved by replacing them with the meta character
▁(U+2581) before encoding, so detokenization is lossless: a simple replace of▁with" "recovers the original text. - Supports two training algorithms:
- BPE : same merge-by-frequency rule as section 1.1.
- Unigram LM (default) : starts from a large seed vocabulary and iteratively prunes pieces that least hurt the corpus likelihood under a unigram model. Produces a probabilistic vocabulary that can sample multiple segmentations of the same input (useful for subword regularization).
- Used by T5, ALBERT, mBART, XLNet, and most multilingual transformers.
- Ex)
Hello world→▁He llo ▁world(each token may start with▁to mark a leading space).
- Spaces are preserved by replacing them with the meta character
2. Lemmatization and Stemming
- Lemmatization : Reduces the vocabulary size by mapping words in different surface forms to their canonical root (lemma). e.g. the lemma of “am”, “are”, “is” is “be”.
- Stemming : Cuts off word endings according to a fixed set of rules (rule-based). The most well-known is the Porter algorithm, available as
nltk.stem.PorterStemmer.
3. Stopword
- Stopword : Frequently occurring words (e.g. “I”, “my”, “me”, “the”, “a”) that contribute little to semantic analysis and are typically removed. You can also define a custom list.
from nltk.corpus import stopwords
print(stopwords.words('english')[:10]) # >> ['i', 'me', 'my', 'myself' ...
4. Regular Expression
- Regular Expression : A pattern syntax used to describe strings that follow specific rules.
- In Python, regular expressions are available through the
remodule, as shown below.
import re
text = """100 John PROF
101 James STUD
102 Mac STUD"""
# '\s' matches whitespace; '+' means one or more occurrences.
re.split('\s+', text) # >> ['100', 'John', 'PROF', '101', 'James', 'STUD'..
- RegexpTokenizer : Regex-based tokenization is also possible via
nltk.tokenize.RegexpTokenizerorpyspark.ml.feature.RegexTokenizer.
from nltk.tokenize import RegexpTokenizer
tokenizer=RegexpTokenizer("[\s]+", gaps=True)
print(tokenizer.tokenize("Don't be fooled by the dark sounding name")
# >> ["Don't", 'be', 'fooled', 'by', 'the', 'dark', ...
5. Text Preprocessing Tools for Korean Text
- PyKoSpacing (pykospacing) : Package for correcting Korean word spacing.
- Py-Hanspell (hanspell) : Package for Korean spell-checking.
- SOYNLP (soynlp) : POS tagging and unsupervised word tokenization — useful for neologisms and out-of-dictionary words not registered in a morphological analyzer.
- Customized KoNLPy : Lets you use KoNLPy with user-defined words added to the morphological analyzer.