Tokenization, lemmatization, frequency calculation
This module includes:
1. Tokenization
-
Tokenization is the process of converting a raw text string into a sequence of smaller units called tokens. Tokens typically include words, punctuation marks, numbers, and other meaningful elements in a text.
-
This step is essential in text processing because it allows computers to break down and analyze text in a structured way.
-
There are many methods for tokenizing text, and the choice of method can vary depending on the goals of your analysis or the linguistic features you want to investigate.
▷ Simple .split()
text = "Hello, world! This is Python."
# Naïve split:
tokens = text.split(" ")
print(tokens)
# → ['Hello,', 'world!', 'This', 'is', 'Python.']
Notice punctuation sticks to words. We can insert spaces around punctuation then split:
def tokenize_simple(text):
for p in [".", ",", "!", "?", ";", ":"]:
text = text.replace(p, f" {p} ")
return text.lower().split()
print(tokenize_simple(text))
# → ['hello', ',', 'world', '!', 'this', 'is', 'python', '.']
▷ Regex-based tokenization (optional)
Regular expressions (regex) are patterns that describe sets of strings. They let you match, search, and split text using rules (e.g., “match one or more word characters” or “match any punctuation mark.”)
import re
def tokenize_regex(text):
# \w+ matches words, [^\w\s] matches any punctuation
return re.findall(r"\w+|[^\w\s]", text.lower())
print(tokenize_regex(text))
# → ['hello', ',', 'world', '!', 'this', 'is', 'python', '.']
So with regex, you don’t need to manually insert spaces—punctuation and words can be separated using pattern matching rules.
▷ Toolkit-based tokenization 1 (NLTK)
NLTK includes a robust tokenizer that handles edge cases like abbreviations and contractions:
import nltk
from nltk.tokenize import word_tokenize
# On first run, download the tokenizer models:
# nltk.download('punkt')
def tokenize_nltk(text):
return word_tokenize(text.lower())
text = "Hello, world! This is Python."
print(tokenize_nltk(text))
# → ['hello', ',', 'world', '!', 'this', 'is', 'python', '.']
▷ Toolkit-based tokenization2 (spaCy)
spaCy is another NLP library that provides fast, accurate tokenization (and full pipelines for tagging, parsing, and more). Since we will continue to use spaCy in later tutorials for additional tasks, I recommend getting comfortable with this approach now.
# Install spaCy and the small English model
pip install spacy
python -m spacy download en_core_web_sm
import spacy
# Load the English model
nlp = spacy.load("en_core_web_sm")
def tokenize_spacy(text):
"""
Tokenize text using spaCy's rule-based tokenizer.
Returns a list of token strings.
"""
doc = nlp(text)
return [token.text for token in doc]
text = "Hello, world! This is Python."
print(tokenize_spacy(text))
# → ['Hello', ',', 'world', '!', 'This', 'is', 'Python', '.']
2. Lemmatization
- Lemmatization means converting each token to its base (dictionary) form.
▷ Dictionary-based lemmatizer
# Suppose lemma_dict = {"is":"be", "ran":"run", ...}
def lemmatize_dict(tokens, lemma_dict):
return [lemma_dict.get(tok, tok) for tok in tokens]
tokens = ["this", "is", "running"]
lemma_dict = {"is":"be", "running":"run"}
print(lemmatize_dict(tokens, lemma_dict))
# → ['this', 'be', 'run']
▷ NLTK’s WordNetLemmatizer
import nltk
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
print(wnl.lemmatize("running", pos="v")) # → 'run'
print(wnl.lemmatize("better", pos="a")) # → 'good'
▷ spaCy Lemmatizer
import spacy
# Load English model (once)
nlp = spacy.load("en_core_web_sm")
def lemmatize_spacy(text):
"""
Tokenize and lemmatize with spaCy.
Returns list of lemmas in input order.
"""
doc = nlp(text)
return [token.lemma_ for token in doc]
sample = "Running faster makes you better."
print(lemmatize_spacy(sample))
# → ['run', 'fast', 'make', 'you', 'good', '.']
3. Frequency Calculation
Text data analysis often starts with counting the number of tokens in the text, which means counting how many times each word token (or lemma) appears.
▷ Using collections.Counter
from collections import Counter
tokens = ["a", "b", "a", ".", "a", "b"]
freq = Counter(tokens)
print(freq)
# → Counter({'a': 3, 'b': 2, '.': 1})
# Top 3 most common:
print(freq.most_common(3))
# → [('a', 3), ('b', 2), ('.', 1)]
▷ Simple loop version
def freq_simple(tokens):
counts = {}
for tok in tokens:
counts[tok] = counts.get(tok, 0) + 1
return counts
print(freq_simple(tokens))
End-to-End: Building a corpus_freq
function
Combine all steps to process every .txt
file.
import glob
from collections import Counter
import spacy
# Load English spaCy model once
nlp = spacy.load("en_core_web_sm")
def tokenize_and_lemmatize(text):
"""
Tokenize and lemmatize using spaCy.
Returns a list of lemmas, skipping spaces and punctuation.
"""
doc = nlp(text)
return [token.lemma_.lower() for token in doc if not token.is_space and not token.is_punct]
def corpus_freq(filepath):
"""
Reads a single .txt file and builds a frequency Counter of lemmas.
"""
total = Counter()
with open(filepath, encoding="utf-8", errors="ignore") as f:
text = f.read()
lems = tokenize_and_lemmatize(text)
total.update(lems)
return total
# Example usage:
freq = corpus_freq("freq_test.txt")
print(freq.most_common(10))
4. Exercises
In this exercise, you’ll learn how to download and process a raw text from Project Gutenberg, clean it, tokenize it, and compute basic statistics like word frequencies. You’ll be working with Text #2554. Follow the steps below and answer the required questions. Submit your work as a Python notebook, including all code and processed outputs.
[1] Download the text Use the following code to download the full text from Project Gutenberg.
from urllib.request import urlopen
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
raw = urlopen(url).read().decode('utf-8')
print(type(raw))
print(len(raw))
print(f"This text is about: {raw[54:75]}")
[2] Trim the header and footer
Remove the Project Gutenberg metadata and keep only the actual story content.
start = raw.find("PART I")
end = raw.rfind("End of Project Gutenberg's Crime")
raw = raw[start:end]
[3] Tokenize the text
[4] Create a text object
[5] Calculate lemmatized word frequencies
[6] Print (1) the top 20 most common words and (2) the least frequent lemmas (bottom 20)
[7] What observations can you make by comparing the most and least frequent lemmas?