Solution Exercise 43

To solve this exercise you will first need to download thisĀ text file. It is the full book of Alice in Wonderland by Lewis Carroll. Write a program that reads the entire file and counts the number of times each word occurs in the text. You must consider removing punction characters, like periods, commas and so on from the text, without touching things like apostrophes found inside a word. When finished the program shall present a sorted list of the 100 most frequent words in the text. It should also present the longest word in the text.

Hint:
Consider using the Counter from the collections module and a regex expression to remove all unwanted punctuation marks.

When you have solved it you can, for some extra challenge, try to make your program more compact. The solution is only 26 lines long including code that formats a nice output. Can you beat that?

from collections import Counter
from re import split
def main():
    counter = Counter()
    with open("alice.txt", "rU") as f:
        for line in f:
            line = line.strip().lower()
            if not line:
                continue
            counter.update(x for x in split("[^a-z]+", line) if x)
    freqSorted = sorted(counter, key=counter.get, reverse=True)
    longestWord = max(counter, key=len)
    wordLength = len(longestWord)
    
    count = 0
    print("Word".rjust(16),"|","Count".ljust(16))
    print("="*35)
    for word in freqSorted:
        print(word.title().rjust(16),"|",str(counter[word]).ljust(16))
        count += 1
        if count >= 100:
            break
    print("The longest word in the text is", longestWord.title(), "with", wordLength, "characters")

if __name__ == '__main__':
    main()