Reading Through Shakespeare's Text Files

Practicing the ins-and-outs of managing files and reading through them, featuring the works of Shakespeare.
This assignment is due on Tuesday, January 26
8 exercises
6.0 possible points
Create a subfolder named 0004-shakefiles inside your compciv-2016/exercises folder.

Summary

This set of exercises is less meant as a challenge and more a test to make sure you know how to navigate your own file system, and also a tour of how file operations work “under the hood”. Virtually all of the answers to the exercises are given and explained in this separate guide.

Table of contents

The Checklist

In your compciv-2016 Git repository create a subfolder and name it:

     exercises/0004-shakefiles

The folder structure will look like this (not including any subfolders such as `tempdata/`:

        compciv-2016
        └── exercises
            └── 0004-shakefiles
               ├── a.py
               ├── b.py
               ├── c.py
               ├── d.py
               ├── e.py
               ├── f.py
               ├── g.py
               ├── h.py
    
a.py 0.5 points Create the `tempdata` directory idempotently
b.py 0.5 points Download the zip file of Shakespearean texts to the tempdata directory
c.py 0.5 points Unzip the contents of the Shakespearean zip file into tempdata
d.py 0.5 points Print the first 5 lines of the Hamlet text
e.py 0.5 points Read through and count each line in the Hamlet text, then print the total number of lines
f.py 1.0 points Print the final 5 lines of Romeo and Juliet
g.py 1.5 points Print the final 5 lines for all of Shakespeare's tragedies
h.py 1.0 points Count and print the number of non-blank lines for each of Shakespeare's tragedies

Background information

This is an expansive, overly verbose set of exercises that not only cover a fairly boring topic – how to organize and read files – but also attempts to introduce software design concepts, such as how to write a program by tackling its smallest problem, and then stepping backwards through the process – rather than the typical top-down approach.

Although these exercises follow basic patterns that will apply to virtually everything else we'll do, don't worry about memorizing the details. Make sure that you can actually get the code to work on your computer. And make sure you can reason through it. Because all of the finished code and answers are basically just given to you, I'm expecting that you actually take the time to write it out, and not just copy-paste it.

The finished programs are fairly intimidating at first glance. However, even just typing out the code and changing up the variables will slow you down enough to see how everything fits together. Try re-arranging or tidying up the code on your own.

For example, I'm often verbose in my solutions so that you can follow the process, line-by-line:

DATA_DIR = 'tempdata'
filepattern = join(DATA_DIR, '**', '*')
filenames = glob(filepattern)

But if you're wondering, "Well, that seems like it could all be one line" – then do it yourself, and see what happens:

filenames = glob(join('tempdata', '**', '*'))

Don't take my solutions as gospel – that's not the way programming works. You should try things out that seem to make sense to you. In later exercises, I will not be doing as much hand-holding, and it's going to be more of a free-for-all in terms of how I feel like naming things and organizing my code. Since the provided code and answers should just "work", you should take the time to be confident with not just rewriting the code, but altering it to your tastes.

Before you start this lesson

Create a .gitignore

Your Git repository should be properly configured. In your compciv-2016 folder, create a text file named .gitignore. It should contain this:

.DS_Store
creds_*
tempdata/
__pycache__/
*.py[cod]

Here's an example of what it would look like in your Github repo.

The point of this file is to keep you from pushing tempdata into your Github repository. The tempdata directory, during the exercise, contains downloaded files that you work with, but never actually alter. Thus, I don't need to ever see this directory – as I can recreate it on my own.

Basically, there's no point in everyone pushing Shakespeare's complete works into their online repos. The upshot is that you will never see tempdata when doing any of the git commands, such as git status. This is the point of .gitignore.

Reading about the fundamentals

Though I've created guides on how to complete every one of these exercises, it's expected that you've read these following guides so that you're familiar with the basics:

Conventions for depicting system shell and Python commands
A reference for the conventions I use to differentiate between system shell commands, Python code, and the interactive Python prompt.
For-loop fundamentals
How to repeatedly execute code, over and over, for a specified number of times.
Conditional branching fundamentals
How to use if/else statements to create branches of code in your program that may or may not actually execute.
Opening files and reading from files
How to opening files and read from files and avoid annoying mistakes when reading files
Downloading files with the Requests library
Using the Requests library for the 95% of the kinds of files that we want to download.
Opening files and writing to files
How to open files and write to files and avoid catastrophic mistakes when writing to files.

An overview of the new functions

Here are the specific modules and functions you'll practice in these exercises:

The Exercises

0004-shakefiles/a.py » Create the `tempdata` directory idempotently

0004-shakefiles/a.py
Create the `tempdata` directory idempotently
0.5 points

For many of the assignments, you will be stashing downloaded files and data into a local directory named tempdata. Write a Python program to create that directory. This function should be “smart” enough not to crash/error-out if the tempdata directory already exists.

Expectations

When you run a.py from the command-line:

0004-shakefiles $ python a.py
  • The program should not output anything to screen.
  • The program creates this file path: tempdata (directory)
  • The program must not crash if the tempdata directory already exists.

Some takeaways from this exercise:
  • idempotent is a fun word to use. It’s also a “feature” that is useful to design towards, as a programmer. You never know how many times your program will be executed, or under what circumstances.

  • It’s kind of neat how os.makedirs() will throw an error if you try to use it to create an existing directory, and you leave out the exist_ok argument. However, other file-system changing functions will not be nearly as careful by default…

0004-shakefiles/b.py » Download the zip file of Shakespearean texts to the tempdata directory

0004-shakefiles/b.py
Download the zip file of Shakespearean texts to the tempdata directory
0.5 points

Write the Python commands to download the file from the following URL:

http://stash.compciv.org/scrapespeare/matty.shakespeare.tar.gz

And save it to:

tempdata/matty.shakespeare.tar.gz

You don’t need to unzip it, just worry about downloading it and saving it to disk.

Expectations

When you run b.py from the command-line:

0004-shakefiles $ python b.py
  • The program's output to screen should be:
    Downloading: http://stash.compciv.org/scrapespeare/matty.shakespeare.tar.gz
    Writing file: tempdata/matty.shakespeare.tar.gz
    
  • The program creates this file path: tempdata/matty.shakespeare.tar.gz
  • The program accesses this remote file: http://stash.compciv.org/scrapespeare/matty.shakespeare.tar.gz
Some takeaways from this exercise:
  • Downloading a file, then saving it to disk, is significantly more complicated than it is through the browser.

  • This program is idempotent. If the file has already been downloaded, it will just be re-downloaded. Sometimes, that’s a good thing. Later on, for truly massive files that just never change, we will probably introduce a conditional statement so that our programs download files only when needed.

0004-shakefiles/c.py » Unzip the contents of the Shakespearean zip file into tempdata

0004-shakefiles/c.py
Unzip the contents of the Shakespearean zip file into tempdata
0.5 points

Like downloading files, unzipping files is more complicated when you do it programmatically. The zip file might not unpack its contents where you thought it would…

Expectations

When you run c.py from the command-line:

0004-shakefiles $ python c.py
  • The program's output to screen should be:
    Unpacked tempdata/matty.shakespeare.tar.gz into: tempdata
    
  • The program creates this file path: tempdata/comedies (directory)
  • The program creates this file path: tempdata/histories (directory)
  • The program creates this file path: tempdata/poetry (directory)
  • The program creates this file path: tempdata/tragedies (directory)
Some takeaways from this exercise:
  • You might have assumed that unzipping tempdata/matty.shakespeare.tar.gz would unpack the contents of the zip file into tempdata. But when you execute this particular program (i.e. c.py), you are outside the tempdata directory. Unless you tell it otherwise, Python assumes you want things done relative to where you executed the script.

  • We’ve been keeping things simple but it is very easy to not know where “you” are when you executed a script.

0004-shakefiles/d.py » Print the first 5 lines of the Hamlet text

0004-shakefiles/d.py
Print the first 5 lines of the Hamlet text
0.5 points

From the text file at tempdata/tragedies/hamlet, read and print the first 5 lines of text.

Expectations

When you run d.py from the command-line:

0004-shakefiles $ python d.py
  • The program's output to screen should be:
    HAMLET
    
    
    DRAMATIS PERSONAE
    
Some takeaways from this exercise:
  • A filename is not an actual file. It’s just a string that represents the human-readable name of a file, e.g. tempdata/tragedies/hamlet

  • Opening a file, by calling the open() function on a filename, does not actually read the file. It just gives us access to a stream object, which has several methods for reading data from the “stream”, including all-at-once or line-by-line.

  • By default, the open() function will attempt to read a file and will throw an error if that file doesn’t exist. This is much, much preferable to the situation when you open an existing file to write to it – which will immediately wipe out that file.

  • Each line of text in a file has a newline character. That’s what makes it separate from the next line. Keeping in mind that a line of text is, well, a string – you can use its strip() method to remove whitespace from both sides of the text, including newlines.

  • It’s considered good manners to invoke a file stream’s close() method when you’re done with the file. Imagine a scenario in which other programs are trying to open that file…

0004-shakefiles/e.py » Read through and count each line in the Hamlet text, then print the total number of lines

0004-shakefiles/e.py
Read through and count each line in the Hamlet text, then print the total number of lines
0.5 points

Re-open the tempdata/tragedies/hamlet file as before, but read through the entire file, line-by-line, and print the total count of the number of lines in the file.

Expectations

When you run e.py from the command-line:

0004-shakefiles $ python e.py
  • The program's output to screen should be:
    tempdata/tragedies/hamlet has 6045 lines
Some takeaways from this exercise:
  • Opening and reading files via programming is so cumbersome at first. But it’s worth doing, over-and-over, until it becomes routine and reflex, as there is a lot of nuance that can come into play. Think about how Excel, or even your plain text editor, will bring your system down to a halt when you have it open a massive file. You don’t want that happening in your scripts.

0004-shakefiles/f.py » Print the final 5 lines of Romeo and Juliet

0004-shakefiles/f.py
Print the final 5 lines of Romeo and Juliet
1.0 points

Open the file at tempdata/tragedies/romeoandjuliet and read and print the final 5 lines.

This seems like the same exercise as d.py – except that we read from Romeo and Juliet instead of Hamlet. And that we read the final 5 lines instead of the first 5 lines.

That first difference is easy to do; that second one is a much different problem to tackle.

This tutorial walks through the process.

Pay special attention to the expected output, particularly:

  • There is no space between the line number and the colon, e.g. 2: not 2 :
  • The last line ends at 4766. Make sure you’re not off-by-one.

Having trouble with adding a number to a string, i.e. 1 and ":" to make "1:"? Try using the str() function to convert a number to a string.

Expectations

When you run f.py from the command-line:

0004-shakefiles $ python f.py
  • The program's output to screen should be:
    4762: Some shall be pardon'd, and some punished:
    4763: For never was a story of more woe
    4764: Than this of Juliet and her Romeo.
    4765:
    4766: [Exeunt]
    
Some takeaways from this exercise:
  • The range() function is an easy way to generate a list of numbers to loop through.

  • Combining strings and other data values in order to generate a pre-defined format of string is common situation and extremely annoying if all you know is how to add strings together via the + operator. Stay on the lookout for other methods, as compliciated as they first seem.

0004-shakefiles/g.py » Print the final 5 lines for all of Shakespeare's tragedies

0004-shakefiles/g.py
Print the final 5 lines for all of Shakespeare's tragedies
1.5 points

For each file in tempdata/tragedies/:

  • Count and print the number of lines in the file.
  • Print the text of the final 5 lines, along with the corresponding line number.
Expectations

When you run g.py from the command-line:

0004-shakefiles $ python g.py
  • The program's first 6 lines of output to screen should be:
    tempdata/tragedies/antonyandcleopatra has 5998 lines
    5994: In solemn show attend this funeral;
    5995: And then to Rome. Come, Dolabella, see
    5996: High order in this great solemnity.
    5997:
    5998: [Exeunt]
    
  • The program's last 6 lines of output to screen should be:
    tempdata/tragedies/titusandronicus has 3767 lines
    3763: By whom our heavy haps had their beginning:
    3764: Then, afterwards, to order well the state,
    3765: That like events may ne'er it ruinate.
    3766:
    3767: [Exeunt]
    
Some takeaways from this exercise:
  • The syntax for glob.glob() seems awkward, doesn’t it? Consider using the from glob import glob style of import statement.

  • Repeating this exercise for all of the Shakespeare files would be very easy.

0004-shakefiles/h.py » Count and print the number of non-blank lines for each of Shakespeare's tragedies

0004-shakefiles/h.py
Count and print the number of non-blank lines for each of Shakespeare's tragedies
1.0 points

You already know how to collect the list of filenames in a single directory. And you know how to read through a file and count the lines. This exercise combines both problems and adds a few twists:

  • Loop through all of the Shakespeare files (there are 42 of them)
  • When reading the lines, track of how many of the lines are non-blank.
  • Print the number of non-blank lines versus total lines in each text.
  • At the end, print the total count of non-blank lines and all lines.

“Non-blank” for this exercise is defined as: a line that has at least one non-whitespace character. However, it’s hard to distinguish a completely empty line, and one that is full of invisible whitespace characters. So use the strip() function.

Expectations

When you run h.py from the command-line:

0004-shakefiles $ python h.py
  • The program's first 3 lines of output to screen should be:
    tempdata/comedies/allswellthatendswell has 3164 non-blank lines out of 4515 total lines
    tempdata/comedies/asyoulikeit has 2904 non-blank lines out of 4122 total lines
    tempdata/comedies/comedyoferrors has 2112 non-blank lines out of 2937 total lines
    
  • The program's last 3 lines of output to screen should be:
    tempdata/tragedies/titusandronicus has 2837 non-blank lines out of 3767 total lines
    All together, Shakespeare's 42 text files have:
    125097 non-blank lines out of 172948 total lines