This module includes:


1. Concordance

  • Concordancing is a method in corpus linguistics that involves identifying and displaying occurrences of a target linguistic item (usually a word or phrase) along with its surrounding context.

  • It allows researchers to examine how language is used in authentic texts. For example, by seeing how the word “investigate” is used in different contexts, we can understand both its meaning and function in real-world discourse. (KWIC: Key Word in Context)

  • To build function, we might need to think about several steps:

    • Tokenize the text
    • Locate every hit of the target item
      • Support case-insensitive matching
      • Store the index of each matching token for later slicing
    • Extract context windows
      • For each hit, slice window tokens to the left and right.
      • Join the slice plus the keyword into a single KWIC line.
    • Handle large corpora gracefully
      • If the use request sample=N, randomly select N hits before formatting (so results stay manageable and replicable)
    • Return and/or save the results
      • Return a list of KWIC strings so the notebook can disply them immeditately.
      • Save the output into a .txt file (one KWIC line per row) for later analysis or citation.

First, we’ll hand-code the functions; next, we’ll switch to NLTK’s built-in concordance tools.

▷ Tokenize

def tokenize(text):

    # You can expand these lists as needed
    punct_list   = [".", "?", "!", ",", "'"]
    replace_list = ["\n", "\t"]
    ignore_list  = [""]                 # drop empty strings after split

    # Separate punctuation with a leading space
    for p in punct_list:
        text = text.replace(p, " " + p)

    # Normalise line breaks, tabs, etc.
    for r in replace_list:
        text = text.replace(r, " ")

    # Lowercase (simple English-only normalisation)
    text = text.lower()

    # Split on spaces
    raw_tokens = text.split(" ")

    # Remove unwanted items
    tokens = [tok for tok in raw_tokens if tok not in ignore_list]

    return tokens

▷ Locate every hit of the target item and extract context windows

Now that we can have a tokenized list with the previous function, we can build a function to find every occurrence of a target word and extract its left and right context.

This function takes:

  • tok_list: a tokenized list of strings (e.g., output from tokenize())
  • target: a list of one or more target words
  • nleft: number of words to include on the left (preceding context)
  • nright: number of words to include on the right (following context)
def concord(tok_list, target, nleft, nright):
    hits = []  # empty list to store matches

    for idx, x in enumerate(tok_list):  # loop through tokens with index
        if x in target:  # if current token matches one of our targets

            # Get left context
            if idx < nleft:
                left = tok_list[:idx]  # if near the start, just take what's available
            else:
                left = tok_list[idx - nleft:idx]  # otherwise, take nleft tokens before

            t = x  # matched keyword (center item)

            # Get right context
            right = tok_list[idx + 1:idx + nright + 1]

            # Save the triple
            hits.append([left, t, right])

    return hits

Let’s test the function useing example sentences. We’ll look for the word “coffee” and extract 4 words before and after each hit.

sample_list = (
    "Some people drink coffee every morning. "
    "I prefer tea, but I enjoy iced coffee in the summer."
).lower().split(" ")

samp_hits = concord(sample_list, ["coffee"], 4, 4)  # target must be a list

for hit in samp_hits:
    print(hit)

▷ Using regex?

For some applications, it will make more sense to conduct concordance searches using regular expressions.

import re  # Regular expressions module

def concord_regex(tok_list, target_regex, nleft, nright):
    hits = []

    for idx, token in enumerate(tok_list):
        if re.compile(target_regex).match(token):
            # Get left context
            left = tok_list[max(0, idx - nleft):idx]
            # Target token
            target = token
            # Get right context
            right = tok_list[idx + 1:idx + nright + 1]
            # Append as [left, target, right]
            hits.append([left, target, right])

    return hits
sample_text = """
In my free time, I enjoy playing soccer with friends. I used to play the guitar and often watched videos of famous guitarists on YouTube, but now I prefer watching movies or reading books. Still, I sometimes play board games on weekends.
""".lower().split()

# Look for words that start with "play" using regex
play_hits = concord_regex(sample_text, "play.*", 4, 4)

# Print results
for hit in play_hits:
    print(hit)

▷ Handle large corpora gracefully

When dealing with multiple documents (or a large corpus), it’s important to keep your concordance results manageable and reproducible. One way to do this is by generating a random sample of hits using random.sample() function.

import random

The random module lets us take a random sample from any list. We’ll use random.sample(population, k), where population is the list to draw from and k is the number of items we want.

By default, Python seeds its random-number generator with the current time, so each run yields a different sample. To get the same sample every time (handy for debugging or sharing results), fix the seed once at the start:

random.seed(42)   # choose any integer you like
sample_list = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20] #sample list to sample from

r_samp = random.sample(sample_list,5) #get a random sample from sample_list with five hits
print(r_samp)           # → [4, 1, 9, 8, 17]

Now, let’s combine everything we’ve learned into a larger function called corp_conc(). This function does the following:

  • Reads all .txt files from a specified folder (our corpus)
  • Tokenizes the text in each file
  • Locates every hit of the target word(s) using concord()
  • Returns a random sample of those hits (or all, if fewer than requested
import glob  # required to list files in a folder

def corp_conc(corp_folder, target, nhits, nleft, nright):
    hits = []

    filenames = glob.glob(corp_folder + "/*.txt")  # get all .txt files in the folder

    for filename in filenames:
        text = tokenize(open(filename).read())  # read and tokenize each file

        # Use concord() to find hits and add them to the global list
        for x in concord(text, target, nleft, nright):
            hits.append(x)

    # Handle sampling logic
    if len(hits) <= nhits:
        print("Search returned " + str(len(hits)) + " hits.\nReturning all hits.")
        return hits
    else:
        print("Search returned " + str(len(hits)) + " hits.\nReturning a random sample of " + str(nhits) + " hits.")
        return random.sample(hits, nhits)

▷ Return the results

Download the sample corpus folder brown_corpus, unzip it, and place it in your working directory.

Then run the following code to search for 25 random hits of the word “develop” and its variants, with 5 words of context on each side:

target_list = [
    "develop", "develops", "developed",
    "developing", "development", "developments"
]

develop_conc_25 = corp_conc("brown_corpus", target_list, 25, 5, 5)

for hit in develop_conc_25:
    print(hit)

▷ Save the results

def write_concord(outname, conc_list):
    with open(outname, "w", encoding="utf-8") as outf:
        outf.write("left_context\tnode_word\tright_context")  # optional header
        for x in conc_list:
            left = " ".join(x[0])     # left context as string
            target = x[1]             # the keyword
            right = " ".join(x[2])    # right context as string
            outf.write(f"\n{left}\t{target}\t{right}")
write_concord("develop_conc.txt", develop_conc_25)

2. Exercises

Use the Brown corpus (brown_corpus folder) for the following exercises

[1] Exploring polysemy

  • Create a sample of 50 concordance lines for the word “record” using the corp_conc() function. Set the context window to 10 words on each side. Then write the results to a file.

  • After saving the file, open it and identify at least two distinct senses of the word “record” (e.g., noun vs. verb). Include at least two examples of each sense. Report your findings using # comments:

[2] Regex concordance

  • Update your corp_conc() function so that it uses the concord_regex() function instead of concord(). Call the new version corp_conc_regex().

  • Then, generate 50 concordance lines for any words that begin with “repe” (e.g., repeat, repeated, repetition). Write your output to a file.

  • Report the most frequent root word in your sample. Use comments to explain.

[3] Nominalization Search — Words ending in “ation”

  • Using regex and the corp_conc_regex() function, create a concordance sample of 50 words that contain the “-ation” nominalizing suffix. Include plural forms (e.g., “narrations”), but try to exclude non-transparent forms like “nation.”

3. Search word usage with concordance

Now that we’ve built our own concordance functions, we’ll switch to NLTK’s built-in concordancing tools (NLTK, p. 40).

import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg
from nltk.text import Text

# Load the tokenized words and wrap them as an NLTK Text object
emma_words = gutenberg.words('austen-emma.txt')
emma_text = Text(emma_words)
  • Once you have a Text object, you can use .concordance() to search for any word and display it in context.
emma_text.concordance("surprize")

▷ Apply it to our own corpus

from nltk.text import Text
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab')

# Read and tokenize one file
with open("brown_corpus/editorial.txt", encoding="utf8") as f:
    raw_text = f.read()

tokens = word_tokenize(raw_text)  # or use your own tokenizer
brown_local_text = Text(tokens)  # Tokenized words are wrapped in an nltk.Text object, you can search just like before
brown_local_text.concordance("record")