Collocation
This module includes:
1. Collocation analysis
-
As introduced in the concordance section, analyzing word sequences reveals how language is naturally used. This connects to collocations (i.e., word pairs or phrases that frequently occur together) often captured using n-grams, or contiguous sequences of n words.
-
Studying collocations via n-gram analysis helps to (1) assess native-like fluency (e.g., make a decision vs. do a decision, which one sounds more natural?) and (2) extract informative n-grams for language modeling and information retrieval.
-
To build this function, simlar to what we did for the concordance analysis, we can break down the analysis in the following steps:
- Tokenize the text
- Generate word frequency across the entire corpus
- Generate word frequency within contexts
- Calculate association strength between word pairs using statistical measures (e.g., Mutual information [MI])
▷ Tokenize
We will use an improved version of the tokenize() function that handles a more comprehensive set of punctuation marks and normalizes whitespace. In addition, we will implement a lemmatize() function using spaCy, which leverages part-of-speech information for more accurate lemmatization.
# setup (one-time)
!pip install spacy
!python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load("en_core_web_sm")
def tokenize_and_lemmatize(text):
# Normalize line breaks and tabs
text = text.replace("\n", " ").replace("\t", " ").replace("\r", " ")
# Process the text with spaCy
doc = nlp(text.lower())
# Return lemmatized tokens (excluding whitespace tokens)
return [token.lemma_ for token in doc if not token.is_space]
text = "Cats are running! They're chasing mice."
lemmas = tokenize_and_lemmatize(text)
print("Lemmas:", lemmas) # → ['cat', 'be', 'run', '!', 'they', 'be', 'chase', 'mouse', '.']
▷ Calculate frequency
In the previous module, we built a simple frequency function that created a new dictionary and counted the occurrences of each word. In this tutorial, we’ll slightly modify that approach by defining a function that updates an existing frequency dictionary rather than creating a new one.
def freq_update(tok_list, freq_dict):
for token in tok_list:
if token not in freq_dict:
freq_dict[token] = 1
else:
freq_dict[token] += 1
It works like this:
sampd = {"a" : 1}
print(sampd) # → {"a" : 1}
samp_list = ["a","b"] #new list of items
freq_update(samp_list,sampd) #update dictioary based on the new list
print(sampd) # → {'a': 2, 'b': 1}
▷ Calculate context frequency
Now, we will combine the tokenize_and_lemmatize()
and freq_update()
functions with parts of the concord()
and concord_regex()
function from the previous module to create a single function: context_freq()
.
This function will calculate:
- Frequency of each target word (or pattern)
- Frequency of left and right collocates
- Combined collocate frequency
- Full corpus word frequency
We’ll use Python’s built-in type()
function to check whether the target is a string or a list.
import re
def context_freq(tok_list, target, nleft=10, nright=10):
left_freq = {}
right_freq = {}
combined_freq = {}
target_freq = {}
corp_freq = {}
for idx, token in enumerate(tok_list):
# Update overall corpus frequency
freq_update([token], corp_freq)
# Check if token matches target
hit = False
if isinstance(target, str) and re.compile(target).match(token):
hit = True
elif isinstance(target, list) and token in target:
hit = True
if hit:
# Left context
left = tok_list[max(0, idx - nleft):idx]
freq_update(left, left_freq)
freq_update(left, combined_freq)
# Target word
freq_update([token], target_freq)
# Right context
right = tok_list[idx + 1:idx + nright + 1]
freq_update(right, right_freq)
freq_update(right, combined_freq)
return {
"left_freq": left_freq,
"right_freq": right_freq,
"combined_freq": combined_freq,
"target_freq": target_freq,
"corp_freq": corp_freq
}
Let’s test it on a sample text (generated by ChatGPT) and explore collocates of the word “golf”:
sample = """
Last weekend, I visited a small bookstore downtown. I picked up a mystery book and started reading it at a nearby café.
The book was so engaging that I lost track of time. I usually prefer reading fiction, but this nonfiction book about creativity really surprised me.
Later, I recommended the book to two of my friends who also enjoy thoughtful reads.
"""
# Run context-based frequency analysis
book_freqs = context_freq(tokenize_and_lemmatize(sample), ["book"], nleft=4, nright=4)
# View results
print("Target frequency:", book_freqs["target_freq"])
print("Left collocates:", book_freqs["left_freq"])
What we found? In the sample passage, the word book appears four times. Among the words that occur within four words to the left of book, the most frequent collocate is “the”, which reflect common article use in noun phrases.
▷ Sort the results
As we continue building functions that return frequency, we will often want to preview the top results. The head() function below allows you to (1) sort and preview the top n items in a dictionary and (2) operationally return the sorted list or save it to a file.
import operator
def head(stat_dict, hits=20, hsort=True, output=False, filename=None, sep="\t"):
sorted_list = sorted(stat_dict.items(), key=operator.itemgetter(1), reverse=hsort)[:hits]
if not output and filename is None:
for item, value in sorted_list:
print(f"{item}{sep}{value}")
elif filename:
with open(filename, "w") as outf:
outf.write("item\tstatistic")
for item, value in sorted_list:
outf.write(f"\n{item}{sep}{value}")
if output:
return sorted_list
# Print top 10 most frequent left-context collocates for "book"
head(book_freqs["left_freq"], hits=10)
▷ Analyzing across a corpus
So far, the context_freq()
function has worked on a single text. Now, we’ll extend it to handle entire corpora—collections of multiple .txt files in a folder—by creating a new function called corpus_context_freq().
This function will:
- Loop through all .txt files in a given folder
- Tokenize and lemmatize each file
- Identify matches to the target (a word list or regular expression)
- Count collocates to the left and right of each match
- Return 5 frequency dictionaries:
- left_freq: collocates in the left context
- right_freq: collocates in the right context
- combined_freq: collocates from both sides
- target_freq: all matched target items
- corp_freq: frequency of all words in the corpus
We also set default values (nleft=5, nright=5) to make the function easier to use.
import re
import glob
def corpus_context_freq(dir_name, target, nleft=5, nright=5):
left_freq, right_freq, combined_freq = {}, {}, {}
target_freq, corp_freq = {}, {}
filenames = glob.glob(dir_name + "/*.txt")
for filename in filenames:
text = open(filename, errors="ignore").read()
tok_list = tokenize_(text)
for idx, token in enumerate(tok_list):
freq_update([token], corp_freq)
hit = False
if isinstance(target, str) and re.compile(target).match(token):
hit = True
elif isinstance(target, list) and token in target:
hit = True
if hit:
left = tok_list[max(0, idx - nleft):idx]
right = tok_list[idx + 1:idx + nright + 1]
freq_update(left, left_freq)
freq_update(right, right_freq)
freq_update(left + right, combined_freq)
freq_update([token], target_freq)
return {
"left_freq": left_freq,
"right_freq": right_freq,
"combined_freq": combined_freq,
"target_freq": target_freq,
"corp_freq": corp_freq
}
Download the sample corpus folder mini_corpus, unzip it, and place it in your working directory.
mini_context_freq = corpus_context_freq("./mini_corpus", "make.*")
# View most frequent matched targets
head(mini_context_freq["target_freq"])
And the left and right context frequencies specifically by using the head() function with “left_freq” and “right_freq” respectively.
We can also check the overall frequency of words in the corpus:
#get ten most frequent words in the corpus
head(mini_context_freq["corp_freq"],hits = 10)
2. Strength of association
In the previous section, we observed that many high-frequency collocates of investigate (e.g., the, of) were also just frequent words overall. This raises a key issue: raw co-occurrence alone doesn’t always reflect meaningful association.
To address this, we’ll introduce a function called soa()
that calculates the strength of association between target words and their collocates—controlling for how often each word appears in the corpus.
▷ SOA function
The soa()
function takes:
freq_dict
: the output fromcorpus_context_freq()
cut_off
: a minimum frequency threshold (default = 5) to filter out rare collocates
It returns a dictionary of dictionaries, each containing a different strength-of-association measure:
Key | Description | |
---|---|---|
mi |
Mutual Information – favors rare but exclusive collocates | |
tscore |
T-score – favors frequent and reliable collocates | |
deltap_coll_cue |
Delta P (target / collocate) – how much more likely the target is to occur when the collocate is present | |
deltap_target_cue |
Delta P (collocate / target) – how much more likely the collocate is to occur when the target is present |
import math
def soa(freq_dict, cut_off=5):
mi = {}
tscore = {}
deltap_coll_cue = {}
deltap_target_cue = {}
corpus_size = sum(freq_dict["corp_freq"].values())
target_freq = sum(freq_dict["target_freq"].values())
for collocate in freq_dict["combined_freq"]:
observed = freq_dict["combined_freq"][collocate]
collocate_freq = freq_dict["corp_freq"][collocate]
if observed >= cut_off:
expected = (target_freq * collocate_freq) / corpus_size
mi[collocate] = math.log2(observed / expected)
tscore[collocate] = (observed - expected) / math.sqrt(observed)
a = observed
b = collocate_freq - a
c = target_freq - a
d = corpus_size - (a + b + c)
deltap_coll_cue[collocate] = (a / (a + b)) - (c / (c + d))
deltap_target_cue[collocate] = (a / (a + c)) - (b / (b + d))
return {
"mi": mi,
"tscore": tscore,
"deltap_coll_cue": deltap_coll_cue,
"deltap_target_cue": deltap_target_cue
}
Let’s apply soa() to the mini corpus results from earlier:
mini_soa = soa(mini_context_freq)
# View top 10 collocates ranked by Mutual Information
head(mini_soa["mi"], hits=10)
hamster 4.8413926852368645
to 4.533947986242736
on 4.324552905435161
of 4.1599222036623615
then 4.050987832109198
and 3.8781512352469822
they 3.615948114392044
. 3.5492591926098553
the 3.411833231024445
, 3.3770203253292967
Key insights?
- "hamster" (MI = 4.84):
This unusually high score likely reflects a small number of highly restricted co-occurrences. You may want to confirm this using a concordance search. - Prepositions like "to" (4.53), "on" (4.32), and "of" (4.16):
These are typical verb–preposition constructions (e.g., go to school, go on vacation), explaining their strong association. - "then" (4.05):
Likely occurs in sequential expressions such as go then stop. - "they" (3.62):
May reflect frequent subject–verb patterns (e.g., they go home). - Punctuation and function words:
Tokens like ".", ",", and "the" appear due to their high frequency and statistical exclusivity—though they may not be meaningful collocates. - ⚠️ MI Caveat:
Mutual Information favors rare but exclusive co-occurrences. The high MI for "hamster" may reflect quirky or meme-like usages rather than consistent collocational behavior.
3. Exercise
Use the Brown corpus (brown_corpus
folder) for the following exercises. (If you have other text that you want to investigate, feel free to use it!)
- Investigate collocations for 3 targets of your choice and compute association scores.
- Choose your targets (mixture of different POSs encouraged), for example:
- Noun (e.g., decision, policy)
- Verb (e.g., make)
- A morphological pattern (recall how to use regex) (e.g., .ation, investigat.)
- You may experiment with selecting different the sub-corpus in Brown corpus:
- Use Brown (all categories) or a single category (e.g., news, fiction).