Analyze Song Lyrics with a Shell Script, Part II
In my last article, I began exploring song lyrics. Not so you could have an epic Karaoke night, but more in the sense of analyzing song lyrics and word usage therein. The specific question that sparked my curiosity was an article that claimed prolific song-writing duo Paul McCartney and John Lennon mentioned the word "love" in Beatles songs 160 times.
How do you test that assertion? You do it by pulling the lyrics from a Web site that specializes in song lyrics—in this case MLDb—and analyzing them with a shell script.
I wrote the first part in my last article, which was a script that extracted links for every published song lyric attributed to The Beatles, stepping through the every-30 pagination structure of the site. In total, the site lists 240 songs by the band. Out of 240 songs, they mentioned "love" only 160 times? I'm skeptical.
In this article, I expand on the idea by downloading the lyrics to each and every one of those songs, then use some basic command-line tools to analyze word usage and frequency.
Tell Me What You SeeThe output of the script from my last article is a set of files that have the following contents:
<a href="song-32476-i-am-the-walrus.html">I Am The Walrus
<a href="song-32520-come-together.html">Come Together
<a href="song-32461-yellow-submarine.html">Yellow Submarine
<a href="song-32585-day-tripper.html">Day Tripper
<a href="song-32557-let-it-be.html">Let It Be
Preface the site domain, make it a fully qualified URL, and each song page address looks like this: http://www.mldb.org/song-32520-come-together.html.
Let's go back into the source code and see how the lines are being extracted, because stitching together a better URL and saving its output as a song lyric source file should be easy, right?
Here's the line in question:
curl -s "$url&from=$start" | sed 's/</\
</g' | grep 'href="song-' > $output$start
Instead of just writing it to the output file, however, what if I built a proper URL and handed it to a subroutine that could use that to extract lyrics? Sounds easy, but keep in mind that the above produces a list of 30 songs, not a single song match.
In fact, the easiest solution is to change the code to stick with the output file, but make it a temp file, as it's just for internal use. Then I can step through the file line by line as desired.
First, the simple change in the curl
statement:
curl -s "$url&from=$start" | sed 's/</\
</g' | grep 'href="song-' > $tempfile
Next, here's code that can go through the output file, making line-by-line calls to a shell script function:
while read lineofdata
do
songnum=$(echo $lineofdata | cut -d\" -f2 | cut -d- -f2)
fullurl="http://www.mldb.org/$(echo $lineofdata | \
cut -d\" -f2)"
savelyrics "$songnum" "$fullurl"
done < $tempfile
Why am I saving the song number separately? Because it makes for an easy file output name, as I want to save the lyrics to each and every one of the matching songs. Yes, I could put them in one massive file, but somehow that doesn't seem right.
The work is all done by the savelyrics
function, and
here's how
I've written it, having spent some time fine-tuning the filtering and
transformation:
function savelyrics
{
# extract just the lyrics and save them
songnum="$1"
fullurl="$2"
curl -s "$fullurl" | sed -n '/songtext/,/\/table/p' | \
sed 's/>/\
/g;s/\<\/p>//g' | grep -E "(<br|</p)" | \
sed 's/\<br \///g;s/\<\/p//g' | uniq > $output$songnum.txt
return 0
}
The curl
statement gets the web page with the full
song lyrics, which are
roughly delineated by a CSS class ID of songtext
and are
contained in a crude HTML table, so the last line of the lyric appears
prior to the table closing: </table>
.
As I've mentioned before, sed
is your friend when you want to extract
well delineated passages of text. Use sed -n
to stop its usual
behavior of echoing everything seen and
/start/,/end/p
to print just the
lines between those two patterns.
The problem is that even when you convert every closing angle bracket into a
carriage return (to break the source file into a ton of separate lines for
further processing), it's still a bit messy. Most all lyric lines end
with the sequence <br />
, but the very last line
of the lyrics has a </p>
instead.
To catch both those lines and screen out everything else,
grep
has the
handy -E
flag, which lets you specify a regular expression. Regular
expressions are a world unto themselves (which I've delved into in
prior columns), but suffice it to say a pattern of the form
(A|B)
produces
lines that have either pattern A or pattern B, exactly as you'd hope.
That's really all the work. The third sed
in the pipe simply removes
the fragmentary remnant HTML code:
sed 's/\<br \///g;s/\<\/p//g'
(Remember, the format is s/old/new/g
for a global
substitution. This just
looks more complex because "/" is part of the source pattern. The
";" lets you put two sed
command sequences on the same line for
convenience.)
Do a quick uniq
to minimize blank lines, and you're done, ready to save. A
sample song lyric output:
$ head lyrics.32586.txt
Try to see it my way
Do I have to keep on talking till I can't go on
While you see it your way
Run the risk of knowing that our love may soon be gone
We can work it out, we can work it out
Think of what you're saying
You can get it wrong and still you think that it's alright
Think of what I'm saying
Know the song? Hear it in your head now? I can definitely keep going with the rest of the lyrics if switching to Karaoke at this point.
Try to See It My Way
I made one more tweak to the script so that the status output as it runs
would be interesting. This now appears just before the call to
savelyrics
:
echo "$lineofdata ($songnum)" | cut -d\> -f2
And so, when run, the script has this sort of output:
$ sh getsongs.sh
I Am The Walrus (32476)
Across The Universe (32554)
Come Together (32520)
Yellow Submarine (32461)
Day Tripper (32585)
. . .
Maggie Mae (61310)
Back In The USSR (61300)
When I'm Sixty-Four (61299)
Good Morning Good Morning (61286)
Got To Get You Into My Life (61285)
Looks good. Here's a quick double-check:
$ ls lyrics.* | wc -l
240
Got all 240 songs, so let's do some analysis. First off, how many songs have the word "love" in their title? With the new improved script output, that's easy:
$ sh getsongs.sh | grep -i love | wc -l
13
Looking across all the songs, how many lyric lines have the word "love"?
$ cat lyrics.* | grep -i love | wc -l
445
That's a whole lot more than 160! But, what about lines that have the word love more than once? They'd be counted only once. In fact, a more traditional word analysis could be fun and interesting. Let's start with just a single song, however, the cheerily titled "I'm A Loser":
$ cat lyrics.61278.txt | tr ' ' '\
' | tr '[[:upper:]]' '[[:lower:]]' | sort | \
uniq -c | sort -rn | head
17 i
13 a
12 i'm
9 and
8 to
8
7 loser
6 have
5 what
4 not
Notice that the first tr
translates all spaces to carriage returns, the
second ensures everything's in lower case (using ANSI set notation for
portability), then I simply sort
all the words, use
uniq -c
to generate
counts, then reverse sort
by numeric count and examine the top ten matches.
"I" is the most common word in this song lyric, followed by
"a". Not surprising. Notice that "loser" shows up
only seven times in the song (all in the reprise, actually).
And, what about if I examine every single song lyric en masse? Here's a surprisingly similar command-line invocation:
$ cat lyrics.*.txt | tr ' ' '\
' | tr '[[:upper:]]' '[[:lower:]]' | sort | \
uniq -c | sort -rn | head
5990
1728 you
1475 i
1060 the
862 to
781 me
769 and
765 a
438 in
432 my
These are all what are generally considered "noise words" in
semantic analysis, so let's expand the head
to include more matches and
I'll hand-edit this final result for your reading pleasure:
1728 you
781 me
399 love
366 know
250 she
205 her
There are lots more, but now there's an answer, ladies and gentlemen! I now can say definitively that the word love occurs exactly 399 times in The Beatles songs and 13 times in the group's song titles too (as revealed earlier).
Hello GoodbyeIt took a while to get to the solution, but this analysis is a splendid example of what in game theory they call divide and conquer. Take a big problem and keep breaking it down into smaller and smaller parts until you can start to understand how to solve the little pieces. Then build it all back up so you can solve the big challenge.
Now, what about The Monkees? How often did they actually reference monkeys in their song lyrics? Hmm....