NLP #1 | Text Preprocessing


Summary

  • Tokenization
  • Lemmatization and Stemming
  • Stopword
  • Regular Expression
  • Text Preprocessing Tools for Korean Text


1. Tokenization

  • Tokenization : dividing a given corpus into tokens, which are units defined for a specific purpose.

    • Sentence Tokenization : The unit of the token is a sentence.

      • A dot (.) usually serves as a boundary between sentences, but It is often used as abbreivation, such as ‘Ph.D.’
    • Word Tokenization : The criterion of the token is a word.

      • In English, word tokens can be distinguished by cutting them by space (" “), but in Korean, this is not the case.
    • Subword Tokenizer (Morpheme Tokenization) : the standard of the token is a morpheme. In the case of Korean, morpheme tokenization is preferred for the following reasons:

      1. Postposition (조사) : various postpositions are attached directly without spaces and are easily recognized as different words. ex) ‘그는’, ‘그가’
      2. Inaccurate spacing : Compared to English, spacing tends to be poor

1.1 Byte Pair Encoding (BPE)

  • Out of vocabulary : When a word that’s not in the training set occurs in real data, this causes a OOV problem.

  • BPE : is a simple form of data compression algorithm in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur in that data.

    • Ex) Suppose we have data aaabdaaabac which needs to be encoded (compressed).

      • The byte pair aa occurs most often, so we will replace it with Z as Z does not occur in our data. So we now have ZabdZabac where Z = aa.

      • The next common byte pair is ab so let’s replace it with Y. We now have ZYdZYac where Z = aa and Y = ab.

      • We can use recursive byte pair encoding to encode ZY as X. Our data has now transformed into XdXac where X = ZY, Y = ab, and Z = aa.


1.2 Word Piece Tokenization

  • Word Piece Tokenization : is similar to BPE in that it recognizes frequently occurring strings as tokens. However, the criterion for merging was frequency in BPE, Wordpiece proceeds merging by likelihood.

    • Ex) Jet makers feud over seat width with big orders at stake => Word Piece Tokenization =>_J et _makers _fe ud _over _seat _width _with _big _orders _at _stake

      • "_" is space for input sentence

      • " " for separate tokens


1.3 SentencePiece

  • SentencePiece : an open-source subword tokenizer by Google that operates directly on the raw input stream — no pre-tokenization (no language-specific word splitting) is required. This makes it language-agnostic and especially well-suited to Korean, Japanese, and Chinese where word boundaries are not marked by spaces.
    • Spaces are preserved by replacing them with the meta character (U+2581) before encoding, so detokenization is lossless: a simple replace of with " " recovers the original text.
    • Supports two training algorithms:
      • BPE : same merge-by-frequency rule as section 1.1.
      • Unigram LM (default) : starts from a large seed vocabulary and iteratively prunes pieces that least hurt the corpus likelihood under a unigram model. Produces a probabilistic vocabulary that can sample multiple segmentations of the same input (useful for subword regularization).
    • Used by T5, ALBERT, mBART, XLNet, and most multilingual transformers.
    • Ex) Hello world▁He llo ▁world (each token may start with to mark a leading space).



2. Lemmatization and Stemming

  • Lemmatization : Reduces the vocabulary size by mapping words in different surface forms to their canonical root (lemma). e.g. the lemma of “am”, “are”, “is” is “be”.
  • Stemming : Cuts off word endings according to a fixed set of rules (rule-based). The most well-known is the Porter algorithm, available as nltk.stem.PorterStemmer.


3. Stopword

  • Stopword : Frequently occurring words (e.g. “I”, “my”, “me”, “the”, “a”) that contribute little to semantic analysis and are typically removed. You can also define a custom list.
from nltk.corpus import stopwords
print(stopwords.words('english')[:10]) # >> ['i', 'me', 'my', 'myself' ... 


4. Regular Expression

  • Regular Expression : A pattern syntax used to describe strings that follow specific rules.
  • In Python, regular expressions are available through the re module, as shown below.
import re
text = """100 John  PROF
                    101 James STUD
                    102 Mac   STUD"""
# '\s' matches whitespace; '+' means one or more occurrences.
re.split('\s+', text) # >> ['100', 'John', 'PROF', '101', 'James', 'STUD'..
  • RegexpTokenizer : Regex-based tokenization is also possible via nltk.tokenize.RegexpTokenizer or pyspark.ml.feature.RegexTokenizer.
from nltk.tokenize import RegexpTokenizer
tokenizer=RegexpTokenizer("[\s]+", gaps=True)
print(tokenizer.tokenize("Don't be fooled by the dark sounding name")
# >> ["Don't", 'be', 'fooled', 'by', 'the', 'dark', ...


5. Text Preprocessing Tools for Korean Text

  • PyKoSpacing (pykospacing) : Package for correcting Korean word spacing.
  • Py-Hanspell (hanspell) : Package for Korean spell-checking.
  • SOYNLP (soynlp) : POS tagging and unsupervised word tokenization — useful for neologisms and out-of-dictionary words not registered in a morphological analyzer.
  • Customized KoNLPy : Lets you use KoNLPy with user-defined words added to the morphological analyzer.