Text interpretation in Bash

An overview of how Bash interprets text, both literally and symbolically.

Text is consider a "universal interface" for Unix systems. As you can already tell, Bash has a certain way of interpreting the text that we send it.

We can't, for instance, just type, "Create a new directory named 'Documents'", and expect Bash to know what's going on:

user@host:~$ Create a new directory named 'Documents'
Create: command not found

Bash expects text to come Some words, like mkdir seem to refer to programs. And some symbols, such as $, ~, and *, will be interpreted by Bash to mean something much more expansive than just single characters.

How can Bash tell what the difference between commands, symbols, and "just text"

It just does. Bash has a syntax which defines how it will interpret the text characters we send it. Just as English has a syntax in which the two following phrases have the same words, but different interpretation based on the punctuation:

"That's what," he said.

That's what he said.

But just as it's a bad idea to teach children their first language by focusing on the rules of grammar, it's not productive to just learn Bash through memorizing its particular grammar and syntax – you should be writing programs and seeing what happens.

However, it's helpful to explain some of the initial concepts of how Bash interprets our commands and data, as a way to prepare you for the seemingly rudimentary way that Unix handles text. Most of these concepts will make more sense after you've read about pipes and redirection and variables.

Literal values

In programming, a literal value can be thought of as: what you see is what you get.

In the sequence of commands below, I call the mkdir command three times separately. However, it will not just create 3 directories:

user@host:~$ mkdir 42
user@host:~$ mkdir apples oranges
user@host:~$ mkdir "42 bottles of beer"

In the animated GIF below, I'm running these commands on OS X so you can see how it affects the filesystem, graphically:

So what were the characters, or strings of text, that were interpreted by the shell as literal values?

42
apples
oranges
42 bottles of beer (including the space characters)

And which text characters were not interpreted as literal values?

mkdir - this was interpreted as the command to make a new directory
The space characters between mkdir and the directory names passed to it
The space character between apples and oranges, hence, the creation of two separate directories
The quotation marks around 42 bottles of beer

Space-separated values

If you're coming from a modern operating system, like Windows or OS X, you've probably seen that it's possible to make files or directories with space characters in the name, e.g. the My Documents and Settings directory on your C:\\ drive.

So how does mkdir know that I wanted to make two separate directories instead of one called apples oranges? It didn't. We have to explicitly specify that particular directory name by enclosing it in quotes, either single or double:

user@host:~$ mkdir 'apples and oranges' "sunshine and lollipops"

Without the use of quotes, Bash will interpret each space-separated word as a separate "word", or token. So mkdir dogs cats will be treated as three different tokens: the command mkdir, and the two arguments dogs and cats

Quotes for enclosing literal values

Both apostrophes (single quotes) and quotation marks (double quotes) can be used to denote a text string (whether it contains spaces or new lines) as a single literal value. Whichever one you start with, make sure to end with it:

user@host:~$ echo 'Jimmy says "Hello"'
Jimmy says "Hello"
user@host:~$ echo "Jimmy's friend does not respond"
Jimmy's friend does not respond

Single vs double quotes

When using double quotes, however, certain special characters, such as the dollar signs that denote a variable, will be interpreted by the shell and expanded.

In the single-quote version, the entire text string passed into echo is interpreted literally:

user@host:~$ some_number=42
user@host:~$ echo 'There are $s bottles of beer'
There are $some_number bottles of beer

In the double-quote version, the shell sees the $ and replaces the variable some_number with its actual value, 42:

user@host:~$ some_number=42
user@host:~$ echo "There are $some_number bottles of beer"
There are 42 bottles of beer

Space comedy

A technical aside: In the olden days of computing, it was easy to assume that filenames (and the names of programs and commands) would never have a space in them. Now, that's changed. So most programs and commands designed for Unix-like system still adhere to this "no fancy filenames" mindset – quite sensibly, in my opinion – while allowing users to use the aforementioned quotation marks to delineate fancy filenames

For the most part, most of the exercises in this course will work on filenames and references that are safe and simple. But keep in mind the real world is not so simple, and not knowing that can lead to a lot of problems. For example, watch me create four new directories on my OS X system via mkdir:

~ $ mkdir dogs cats
~ $ mkdir "This is the end, my friend"
~ $ mkdir "Don't ever
> ever
> ever name a directory like
> this."
~ $ ls
Dont ever?ever?ever name a directory like?this.
This is the end, my friend
cats
dogs
# Note: I've removed the apostrophe from the output here for 
# formatting purposes

As animated GIF:

Suffice to say, most programmers do not expect a filename to contain newlines, and that assumption is the source of many comical or critical (and sometimes both) system errors. Which is why later on in this course, we move to more sophisticated text-handling environments, e.g. Python.

The importance of double quotes

One vital purpose of double quotes will be evident in later examples of variable usage. If a variable contains a space-separated value, such as Documents and Settings, wrapping a variable in double-quotes prevents the variable's space-separated values from being interpreted separately, which can lead to nasty unexpected effects.

Again, this will make more sense when we look at how variables are used. But pretend that the variable dir_name has been set to "Documents and Settings". And compare the effects of the three mkdir calls below:

user@host:~$ dir_name='Documents and Settings'
user@host:~$ echo $dir_name
Documents and Settings
user@host:~$ mkdir '$dir_name'
user@host:~$ mkdir "$dir_name"
user@host:~$ mkdir $dir_name

The first call is just plain wrong: wrapping a variable in single quotes causes mkdir to create a directory with the literal name of $dir_name
The second call, with $dir_name inside double-quotes behaves as expected. The shell expands $dir_name to the string, Documents and Settings, and a single directory with that name is created.
The third call, with $dir_name being passed as an unquoted argument to mkdir, causes three directories to be made: Documents, and, Settings

Here's an animated GIF showing which directories are unexpectedly created as a result of a variable containing a value with spaces:

Line-by-line interpretation

So with the interactive command-line, the shell typically expects to execute a command every time you press Enter (i.e. send a newline character)

sunet_id@corn30:~$ echo Hello
Hello
sunet_id@corn30:~$

There are a few exceptions, such as when quoted values include newline characters (i.e., what happens when you press Enter). And there are special characters we can use to change up the line-by-line interaction, though these are more or less for human-readability purposes.

Using backslashes to split a command over multiple lines

For a single command that contains so many characters that it causes a line wrap, it's helpful – again, for human-readability, as the computer doesn't care either way – to split it over multiple lines.

Ending a line with a backslash will tell the shell that the command continues onto the next line (notice how the prompt changes into a right-angle-bracket):

sunet_id@corn30:~$ echo Hello \
> world
Hello world

Note: make sure that the backslash is the very last character of the line you wish to continue, i.e. hit Enter immediately after the backslash, don't put a space or any other character after the backslash on the same line.

Unintended multi-line commands

Using the backslash at the end of a line is how we explicitly tell Bash, "Hey, don't do anything yet, we're continuing this command on the next line". However, it's fairly easy for typos to make us accidentally carry-over commands. This happens most often with unclosed quote-marks or parentheses:

sunet_id@corn30:~$ echo "How are you world?
> 
> ksdfljsadklfj
> "
How are you world?

ksdfljsadklfj

Tip: If you unintentionally run into this situation and can't figure how to get out, hit Ctrl-C to break out of the limbo and to return to the standard prompt.

Semicolons to separate short commands in a single line

When you have multiple commands that are so short that they don't seem to merit their own lines, you can use the semicolon to separate the commands, and Bash will still execute the command as if you had put the commands on their own lines:

user@host:/tmp$ pwd; mkdir stuff; cd stuff; pwd
/tmp
/tmp/stuff

As a GIF:

Double-ampersands to run commands conditionally

The use of the double ampersand will let you join commands on a single line. However, how && differs from ; is that if the first command fails, the subsequent command will not run:

user@host:/$ pwd && mkdir stuff && cd stuff && pwd
/
user@host: cannot create directory 'stuff': Permission denied

As a GIF:

The use of double-ampersands is considered a good practice when doing something destructive right after a command that may not succeed. Consider these two commands (but do not run them on your own system):

# Dangerous:
user@host:/$ cd junk; rm -f *
# Safe:
user@host:/$ cd junk && rm -f *

What happens when the junk directory exists? The cd (change directory) command will be successful and then the rm command will remove all files in it. But what happens when junk doesn't exist? Where is the program when cd fails? And where will rm be unexpectedly be doing its business?

Comments with the pound sign

This feature won't be particularly helpful to you until you start writing shell script files. But the pound sign can be used to tell Bash to ignore every character to the right of the pound sign. This can be used to annotate your code:

user@host:/tmp/x$ # I hope this works
user@host:/tmp/x$ mkdir new_dir
user@host:/tmp/x$ # hopefully that worked

Multi-line data

The line-by-line nature of how Bash processes data makes it an inelegant system for processing data that spans more than one line.

For example, in the example HTML snippet below:

<h1>This is a headline</h1>

It is trivial (though clunky) to extract the text, This is a headline, between the h1 tags using grep (with Perl-standard regex):

echo '<h1>This, is a headline</h1>' | grep -oP '(?<=<h1>)(.+?)(?=</h1>)'

However, if the data looks like this:

<h1>This is 
    a headline
</h1>

Then things get trickier. The standard grep, for instance, won't work with text patterns that have newline characters in them, though you do have access to the awk and sed text-processing tools.

Heredocs

While it's possible to use quotation marks to enclose multi-line strings:

echo "hey
you
what's going on?"

– this quickly becomes cumbersome when the strings themselves contain literal quotation marks, as in the case of HTML:

echo "
<p class=\"note\">
  John told me, \"This site is the 
  <a href=\"http://example.com\" target=\"_blank\">
    best\" 
  </a>
</p>
"

By using a "Heredoc string", we can specify that some other delimiter be used to denote the beginning and the end of a string (note that we use cat now, instead of echo). Heredocs are a great way to include multi-line text, such as data rows, right alongside our script file.

cat <<EOF
<p class="note">
  John told me, "This site is the
  <a href="http://example.com" target="_blank">
    best"
  </a>
</p>
EOF

The "limit string", which in the above case is EOF, is traditionally used to delimit the string, though sequence of characters can be used, as long as these conditions are met:

The limit string is immediately preceded by the <<
When you've reached the Heredoc-string, the limit string is on its own line with no whitespace between it and the beginning of the line.
You want the limit string to be unique enough that it doesn't have a chance of appearing in the Heredoc-string itself.

So this is good:

cat <<THISISMYHEREDOC
 hello
   there
THISISMYHEREDOC

The following examples are both wrong:

cat <<EOF
 hello
   there
 EOF

cat << EOF
 hello
   there
EOF

Sending a Heredoc to a file

The notation is a little weird, but think of it as cat feeding into stuff.html what it gets from the << operator:

cat > stuff.html <<EOF
<html>
    <h1>
      <a href="http://example.com">An example</a>
    </h1>
</html>
EOF

Making a literal Heredoc that doesn't get interpreted

By default, a Heredoc that contains special symbols and sequences, such as $ before a variable name, will have those sequences expanded, just as they would be in a normal double-quoted string. To prevent this, put the EOF inside of single-quotes:

world="LADEEDAH"
# This gets interpreted

cat <<EOF
    Hello $world
EOF
# Output:
#         Hello LADEEDAH

# Prevent interpretation:
cat <<'EOF'
    Hello $world
EOF

# Output:
#         Hello $world

Previously, I said that the limit string, e.g. EOF, has to be exactly the same at the start and the beginning of the heredoc. The exception is for certain special symbols, such as single-quotes…in other words, you can begin a heredoc with 'EOF' and end it with EOF

Assigning a Heredoc to a variable

Use the read command (read this elaboration on StackOverflow):

read -r -d '' some_variable <<'EOF'
<html>
    <h1>
      <a href="http://example.com">An example</a>
    </h1>
</html>
EOF