Tokenization, lemmatization, frequency calculation

This module includes:

Tokenization
Lemmatization
Frequency calculation
Exercises

1. Tokenization

Tokenization is the process of converting a raw text string into a sequence of smaller units called tokens. Tokens typically include words, punctuation marks, numbers, and other meaningful elements in a text.
This step is essential in text processing because it allows computers to break down and analyze text in a structured way.
There are many methods for tokenizing text, and the choice of method can vary depending on the goals of your analysis or the linguistic features you want to investigate.

▷ Simple .split()

text = "Hello, world! This is Python."
# Naïve split:
tokens = text.split(" ")
print(tokens)
# → ['Hello,', 'world!', 'This', 'is', 'Python.']

Notice punctuation sticks to words. We can insert spaces around punctuation then split:

def tokenize_simple(text):
    for p in [".", ",", "!", "?", ";", ":"]:
        text = text.replace(p, f" {p} ")
    return text.lower().split()

print(tokenize_simple(text))
# → ['hello', ',', 'world', '!', 'this', 'is', 'python', '.']

▷ Regex-based tokenization (optional)

Regular expressions (regex) are patterns that describe sets of strings. They let you match, search, and split text using rules (e.g., “match one or more word characters” or “match any punctuation mark.”)

import re

def tokenize_regex(text):
    # \w+ matches words, [^\w\s] matches any punctuation
    return re.findall(r"\w+|[^\w\s]", text.lower())

print(tokenize_regex(text))
# → ['hello', ',', 'world', '!', 'this', 'is', 'python', '.']

So with regex, you don’t need to manually insert spaces—punctuation and words can be separated using pattern matching rules.

▷ Toolkit-based tokenization 1 (NLTK)

NLTK includes a robust tokenizer that handles edge cases like abbreviations and contractions:

import nltk
from nltk.tokenize import word_tokenize

# On first run, download the tokenizer models:
# nltk.download('punkt')

def tokenize_nltk(text):
    return word_tokenize(text.lower())

text = "Hello, world! This is Python."
print(tokenize_nltk(text))
# → ['hello', ',', 'world', '!', 'this', 'is', 'python', '.']

▷ Toolkit-based tokenization2 (spaCy)

spaCy is another NLP library that provides fast, accurate tokenization (and full pipelines for tagging, parsing, and more). Since we will continue to use spaCy in later tutorials for additional tasks, I recommend getting comfortable with this approach now.

# Install spaCy and the small English model
pip install spacy
python -m spacy download en_core_web_sm

import spacy

# Load the English model
nlp = spacy.load("en_core_web_sm")

def tokenize_spacy(text):
    """
    Tokenize text using spaCy's rule-based tokenizer.
    Returns a list of token strings.
    """
    doc = nlp(text)
    return [token.text for token in doc]

text = "Hello, world! This is Python."
print(tokenize_spacy(text))
# → ['Hello', ',', 'world', '!', 'This', 'is', 'Python', '.']

2. Lemmatization

Lemmatization means converting each token to its base (dictionary) form.

▷ Dictionary-based lemmatizer

# Suppose lemma_dict = {"is":"be", "ran":"run", ...}
def lemmatize_dict(tokens, lemma_dict):
    return [lemma_dict.get(tok, tok) for tok in tokens]

tokens = ["this", "is", "running"]
lemma_dict = {"is":"be", "running":"run"}
print(lemmatize_dict(tokens, lemma_dict))
# → ['this', 'be', 'run']

▷ NLTK’s WordNetLemmatizer

import nltk
from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()
print(wnl.lemmatize("running", pos="v"))   # → 'run'
print(wnl.lemmatize("better", pos="a"))    # → 'good'

▷ spaCy Lemmatizer

import spacy

# Load English model (once)
nlp = spacy.load("en_core_web_sm")

def lemmatize_spacy(text):
    """
    Tokenize and lemmatize with spaCy.
    Returns list of lemmas in input order.
    """
    doc = nlp(text)
    return [token.lemma_ for token in doc]

sample = "Running faster makes you better."
print(lemmatize_spacy(sample))
# → ['run', 'fast', 'make', 'you', 'good', '.']

3. Frequency Calculation

Text data analysis often starts with counting the number of tokens in the text, which means counting how many times each word token (or lemma) appears.

▷ Using collections.Counter

from collections import Counter

tokens = ["a", "b", "a", ".", "a", "b"]
freq = Counter(tokens)
print(freq)
# → Counter({'a': 3, 'b': 2, '.': 1})

# Top 3 most common:
print(freq.most_common(3))
# → [('a', 3), ('b', 2), ('.', 1)]

▷ Simple loop version

def freq_simple(tokens):
    counts = {}
    for tok in tokens:
        counts[tok] = counts.get(tok, 0) + 1
    return counts

print(freq_simple(tokens))

End-to-End: Building a `corpus_freq` function

Combine all steps to process every .txt file.

import glob
from collections import Counter
import spacy

# Load English spaCy model once
nlp = spacy.load("en_core_web_sm")

def tokenize_and_lemmatize(text):
    """
    Tokenize and lemmatize using spaCy.
    Returns a list of lemmas, skipping spaces and punctuation.
    """
    doc = nlp(text)
    return [token.lemma_.lower() for token in doc if not token.is_space and not token.is_punct]

def corpus_freq(filepath):
    """
    Reads a single .txt file and builds a frequency Counter of lemmas.
    """
    total = Counter()
    with open(filepath, encoding="utf-8", errors="ignore") as f:
        text = f.read()
    lems = tokenize_and_lemmatize(text)
    total.update(lems)
    return total

# Example usage:
freq = corpus_freq("freq_test.txt")
print(freq.most_common(10))

✅ Download freq_text.txt

4. Exercises

In this exercise, you’ll learn how to download and process a raw text from Project Gutenberg, clean it, tokenize it, and compute basic statistics like word frequencies. You’ll be working with Text #2554. Follow the steps below and answer the required questions. Submit your work as a Python notebook, including all code and processed outputs.

[1] Download the text Use the following code to download the full text from Project Gutenberg.

from urllib.request import urlopen

url = "http://www.gutenberg.org/files/2554/2554-0.txt"
raw = urlopen(url).read().decode('utf-8')
print(type(raw))    
print(len(raw))     
print(f"This text is about: {raw[54:75]}")

[2] Trim the header and footer

Remove the Project Gutenberg metadata and keep only the actual story content.

start = raw.find("PART I")
end = raw.rfind("End of Project Gutenberg's Crime")
raw = raw[start:end]

[3] Tokenize the text

[4] Create a text object

[5] Calculate lemmatized word frequencies

[6] Print (1) the top 20 most common words and (2) the least frequent lemmas (bottom 20)

[7] What observations can you make by comparing the most and least frequent lemmas?

1. Tokenization

2. Lemmatization

3. Frequency Calculation

End-to-End: Building a corpus_freq function

4. Exercises

End-to-End: Building a `corpus_freq` function