Nelle Nemo, a marine biologist, has just returned from a six-month survey of the North Pacific Gyre, where she has been sampling gelatinous marine life in the Great Pacific Garbage Patch. She has 1520 samples that she’s run through an assay machine to measure the relative abundance of 300 proteins. She needs to run these 1520 files through an imaginary program called goostats
she inherited. On top of this huge task, she has to write up results by the end of the month so her paper can appear in a special issue of Aquatic Goo Letters.
The bad news is that if she has to run goostats
by hand using a GUI, she’ll have to select and open a file 1520 times. If goostats
takes 30 seconds to run each file, the whole process will take more than 12 hours of Nelle’s attention. With the shell, Nelle can instead assign her computer this mundane task while she focuses her attention on writing her paper.
The next few lessons will explore the ways Nelle can achieve this. More specifically, they explain how she can use a command shell to run the goostats
program, using loops to automate the repetitive steps of entering file names, so that her computer can work while she writes her paper.
The general form of a loop:
for thing in list_of_things
do
operation_using $thing # Indentation within the loop is not required, but aids legibility
done
This exercise refers to the data-shell/molecules
directory. ls
gives the following output:
cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb
What is the output of the following code?
$ for datafile in *.pdb
> do
> ls *.pdb
> done
Now, what is the output of the following code?
$ for datafile in *.pdb
> do
> ls $datafile
> done
Why do these two loops give different outputs?
Solution
The first code block gives the same output on each iteration through the loop. Bash expands the wildcard *.pdb
within the loop body (as well as before the loop starts) to match all files ending in .pdb
and then lists them using ls
. The expanded loop would look like this:
$ for datafile in cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb
> do
> ls cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb
> done
cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb
cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb
cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb
cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb
cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb
cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb
The second code block lists a different file on each loop iteration. The value of the datafile variable is evaluated using $datafile, and then listed using ls.
cubane.pdb
ethane.pdb
methane.pdb
octane.pdb
pentane.pdb
propane.pdb
What would be the output of running the following loop in the data-shell/molecules
directory?
$ for filename in c*
> do
> ls $filename
> done
- No files are listed.
- All files are listed.
- Only
cubane.pdb
,octane.pdb
andpentane.pdb
are listed. - Only
cubane.pdb
is listed.
Solution
4 is the correct answer. *
matches zero or more characters, so any file name starting with the letter c, followed by zero or more other characters will be matched.
How would the output differ from using this command instead?
$ for filename in *c*
> do
> ls $filename
> done
- The same files would be listed.
- All the files are listed this time.
- No files are listed this time.
- The files cubane.pdb and octane.pdb will be listed.
- Only the file octane.pdb will be listed.
Solution
4 is the correct answer. *
matches zero or more characters, so a file name with zero or more characters before a letter c and zero or more characters after the letter c will be matched.
In the data-shell/molecules
directory, what is the effect of this loop?
for alkanes in *.pdb
do
echo $alkanes
cat $alkanes > alkanes.pdb
done
- Prints
cubane.pdb
,ethane.pdb
,methane.pdb
,octane.pdb
,pentane.pdb
andpropane.pdb
, and the text frompropane.pdb
will be saved to a file calledalkanes.pdb
. - Prints
cubane.pdb
,ethane.pdb
, andmethane.pdb
, and the text from all three files would be concatenated and saved to a file calledalkanes.pdb
. - Prints
cubane.pdb
,ethane.pdb
,methane.pdb
,octane.pdb
, andpentane.pdb
, and the text frompropane.pdb
will be saved to a file calledalkanes.pdb
. - None of the above.
Solution
1 is the correct answer. The text from each file in turn gets written to the alkanes.pdb
file. However, the file gets overwritten on each loop interation, so the final content of alkanes.pdb
is the text from the propane.pdb
file.
Also in the data-shell/molecules
directory, what would be the output of the following loop?
for datafile in *.pdb
do
cat $datafile >> all.pdb
done
- All of the text from
cubane.pdb
,ethane.pdb
,methane.pdb
,octane.pdb
, andpentane.pdb
would be concatenated and saved to a file calledall.pdb
. - The text from
ethane.pdb
will be saved to a file calledall.pdb
. - All of the text from
cubane.pdb
,ethane.pdb
,methane.pdb
,octane.pdb
,pentane.pdb
andpropane.pdb
would be concatenated and saved to a file calledall.pdb
. - All of the text from
cubane.pdb
,ethane.pdb
,methane.pdb
,octane.pdb
,pentane.pdb
andpropane.pdb
would be printed to the screen and saved to a file calledall.pdb
.
Solution
3 is the correct answer. >>
appends to a file, rather than overwriting it with the redirected output from a command. Given the output from the cat
command has been redirected, nothing is printed to the screen.
A loop is a way to do many things at once — or to make many mistakes at once if it does the wrong thing. One way to check what a loop would do is to echo
the commands it would run instead of actually running them.
Suppose we want to preview the commands the following loop will execute without actually running those commands:
$ for datafile in *.pdb
> do
> cat $datafile >> all.pdb
> done
What is the difference between the two loops below, and which one would we want to run?
# Version 1
$ for datafile in *.pdb
> do
> echo cat $datafile >> all.pdb
> done
# Version 2
$ for datafile in *.pdb
> do
> echo "cat $datafile >> all.pdb"
> done
Solution
The second version is the one we want to run. This prints to screen everything enclosed in the quote marks, expanding the loop variable name because we have prefixed it with a dollar sign.
The first version appends the output from the command echo cat $datafile
to the file, all.pdb
. This file will just contain the list; cat cubane.pdb
, cat ethane.pdb
, cat methane.pdb
etc.
Try both versions for yourself to see the output! Be sure to open the all.pdb
file to view its contents.
Suppose we want to set up a directory structure to organize some experiments measuring reaction rate constants with different compounds and different temperatures. What would be the result of the following code:
$ for species in cubane ethane methane
> do
> for temperature in 25 30 37 40
> do
> mkdir $species-$temperature
> done
> done
Solution
We have a nested loop, i.e. contained within another loop, so for each species in the outer loop, the inner loop (the nested loop) iterates over the list of temperatures, and creates a new directory for each combination.
Try running the code for yourself to see which directories are created!