NLP #1 | Text Preprocessing

2021-01-05 3. Natural Language Comments

Summary

Tokenization
Lemmatization and Stemming
Stopword
Regular Expression
Text Preprocessing Tools for Korean Text

1. Tokenization

Tokenization : dividing a given corpus into tokens, which are units defined for a specific purpose.
- Sentence Tokenization : The unit of the token is a sentence.
  - A dot (.) usually serves as a boundary between sentences, but It is often used as abbreivation, such as ‘Ph.D.’
- Word Tokenization : The criterion of the token is a word.
  - In English, word tokens can be distinguished by cutting them by space (" “), but in Korean, this is not the case.
- Subword Tokenizer (Morpheme Tokenization) : the standard of the token is a morpheme. In the case of Korean, morpheme tokenization is preferred for the following reasons:
  1. Postposition (조사) : various postpositions are attached directly without spaces and are easily recognized as different words. ex) ‘그는’, ‘그가’
  2. Inaccurate spacing : Compared to English, spacing tends to be poor

1.1 Byte Pair Encoding (BPE)

Out of vocabulary : When a word that’s not in the training set occurs in real data, this causes a OOV problem.
BPE : is a simple form of data compression algorithm in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur in that data.
- Ex) Suppose we have data aaabdaaabac which needs to be encoded (compressed).
  - The byte pair aa occurs most often, so we will replace it with Z as Z does not occur in our data. So we now have ZabdZabac where Z = aa.
  - The next common byte pair is ab so let’s replace it with Y. We now have ZYdZYac where Z = aa and Y = ab.
  - We can use recursive byte pair encoding to encode ZY as X. Our data has now transformed into XdXac where X = ZY, Y = ab, and Z = aa.

1.2 Word Piece Tokenization

Word Piece Tokenization : is similar to BPE in that it recognizes frequently occurring strings as tokens. However, the criterion for merging was frequency in BPE, Wordpiece proceeds merging by likelihood.
- Ex) Jet makers feud over seat width with big orders at stake => Word Piece Tokenization =>_J et _makers _fe ud _over _seat _width _with _big _orders _at _stake
  - "_" is space for input sentence
  - " " for separate tokens

1.3 SentencePiece

SentencePiece : an open-source subword tokenizer by Google that operates directly on the raw input stream — no pre-tokenization (no language-specific word splitting) is required. This makes it language-agnostic and especially well-suited to Korean, Japanese, and Chinese where word boundaries are not marked by spaces.
- Spaces are preserved by replacing them with the meta character ▁ (U+2581) before encoding, so detokenization is lossless: a simple replace of ▁ with " " recovers the original text.
- Supports two training algorithms:
  - BPE : same merge-by-frequency rule as section 1.1.
  - Unigram LM (default) : starts from a large seed vocabulary and iteratively prunes pieces that least hurt the corpus likelihood under a unigram model. Produces a probabilistic vocabulary that can sample multiple segmentations of the same input (useful for subword regularization).
- Used by T5, ALBERT, mBART, XLNet, and most multilingual transformers.
- Ex) Hello world → ▁He llo ▁world (each token may start with ▁ to mark a leading space).

2. Lemmatization and Stemming

Lemmatization : Reduces the vocabulary size by mapping words in different surface forms to their canonical root (lemma). e.g. the lemma of “am”, “are”, “is” is “be”.
Stemming : Cuts off word endings according to a fixed set of rules (rule-based). The most well-known is the Porter algorithm, available as nltk.stem.PorterStemmer.

3. Stopword

Stopword : Frequently occurring words (e.g. “I”, “my”, “me”, “the”, “a”) that contribute little to semantic analysis and are typically removed. You can also define a custom list.

from nltk.corpus import stopwords
print(stopwords.words('english')[:10]) # >> ['i', 'me', 'my', 'myself' ...

4. Regular Expression

Regular Expression : A pattern syntax used to describe strings that follow specific rules.
In Python, regular expressions are available through the re module, as shown below.

import re
text = """100 John  PROF
                    101 James STUD
                    102 Mac   STUD"""
# '\s' matches whitespace; '+' means one or more occurrences.
re.split('\s+', text) # >> ['100', 'John', 'PROF', '101', 'James', 'STUD'..

RegexpTokenizer : Regex-based tokenization is also possible via nltk.tokenize.RegexpTokenizer or pyspark.ml.feature.RegexTokenizer.

from nltk.tokenize import RegexpTokenizer
tokenizer=RegexpTokenizer("[\s]+", gaps=True)
print(tokenizer.tokenize("Don't be fooled by the dark sounding name")
# >> ["Don't", 'be', 'fooled', 'by', 'the', 'dark', ...

5. Text Preprocessing Tools for Korean Text

PyKoSpacing (pykospacing) : Package for correcting Korean word spacing.
Py-Hanspell (hanspell) : Package for Korean spell-checking.
SOYNLP (soynlp) : POS tagging and unsupervised word tokenization — useful for neologisms and out-of-dictionary words not registered in a morphological analyzer.
Customized KoNLPy : Lets you use KoNLPy with user-defined words added to the morphological analyzer.