Work the Shell - Our Twitter Autoresponder Goes Live!
I can't believe it, this is my 52nd column. That means I've been writing for Linux Journal for almost four and a half years. Hopefully, you've been reading my column just as long and enjoying our monthly forays into the world of shell script programming. On the tech side, quite a bit has changed in the last four and a half years. But on the Linux/shell side, it's surprisingly similar to how it was when I wrote my first column.
Last month, we continued to build a Twitter autoresponder script that could read and parse Twitter messages (aka tweets). We got it working and wrapped up the column by realizing we actually needed to capture the unique tweet ID in addition to name and message, so we could ensure that the script kept track of what it had or hadn't answered.
The script keeps track of tweets by ID and knows both how to parse the incoming Twitter stream and how to remember if it has seen a one-word tweet request or not. Run it once, and I see:
Twitter user @jlight asked for the time @jlight the time on our server is LOCALTIME
The next time I run it, just a few minutes later, I see:
Twitter user @truss asked for the time @truss the time on our server is LOCALTIME Twitter user @tlady asked what our address in tweet 7395272164 @tlady we're located at 123 University Avenue, Anywhere USA
It looks good, but there's a problem in the script, because one of the output diagnostic lines is:
Twitter user @ asked for the time @ the time on our server is LOCALTIME
Somehow it's not identifying the user ID for this particular user. After a quick analysis of the actual Twitter.com data, it appears that the first tweet comes out of the parser section without an associated user ID.
To debug this, first get a copy of the script to follow along (the script from last month is at ftp.linuxjournal.com/pub/lj/listings/issue191/10695.tgz). In the while loop, I'll add this line to aid in debugging:
echo got name = $name, id = $id, and msg = $msg
Now when I run the script, here's what I see:
got name = , id = 7395437583, and msg = VERY cool got name = spin, id = 7395333666, and msg = time got name = astrong, id = 7395281516, and msg = time got name = truss, id = 7395281011, and msg = time
Clearly something's wrong, but what?
One reason I like to use temp files in scripts rather than having incredibly long and complicated pipes is for debugging this sort of problem.
Recall that the main parsing work is done by curl feeding its output to grep, then a sequence of sed invocations and finally a quick call to awk:
$curl -u "davetaylor:$pw" $inurl | \ grep -E '(<screen_name>|<text>|<id>)' | \ sed 's/@DaveTaylor //;s/ <text>//;s/<\/text>//' | \ sed 's/ *<screen_name>//;s/<\/screen_name>//' | \ sed 's/ *<id>//;s/<\/id>//' | \ awk '{ if (NR % 4 == 0) { printf ("name=%s; ", $0) } else if (NR % 4 == 1) { printf ("id=%s; ",$0) } else if (NR % 4 == 2) { print "msg=\"" $0 "\"" } }' > $temp
Adding the command more $temp immediately after this means we can eyeball the data stream and see what's different about the first and second lines (as the second is parsed properly). Here's what I see:
id=7395681235; msg="African or European?" name=jeffrey; id=7395672894; msg="North Hall IStage"
Note that there's no name= field on the first message. My theory? There's a logic error in the awk statement that's causing it to skip the first entry somehow.
To test that assumption, I'll temporarily replace the entire awk script with another that outputs the record number (mod 4) followed by the data line:
awk '{ print (NR % 4), $0 }' > $temp
The result is exactly what we were expecting, which is a bit confusing:
1 7395934047 2 we are at the MGM as well! 3 14171725 0 sideline 1 7395681235 2 African or European? 3 14712874 0 jeffrey
Here, Twitter user sideline has sent “we are at the MGM as well!”, and jeffrey sent the message “African or European?”.
The problem isn't that the data is being eaten, it's that the awk script is pairing the name information with the wrong tweet. Let's re-examine the awk script:
awk '{ if (NR % 4 == 0) { printf ("name=%s; ", $0) } else if (NR % 4 == 1) { printf ("id=%s; ",$0) } else if (NR % 4 == 2) { print "msg=\"" $0 "\"" } }'
NR%4=0 is correctly tagged as the name, NR%4=1 as the message ID, NR%4=2 as the msg, and NR%4=3 is skipped. (It's the Twitter user ID, not the tweet ID. It might be useful in a different context, but not for what we're doing.)
The problem is subtle, but it becomes obvious when you compare what the parser is generating against the actual tweets in the Twitter stream. We saw the first two like this:
id=7395681235; msg="African or European?" name=jeffrey; id=7395672894; msg="North Hall IStage"
But in fact, the tweet “African or European?” was sent by jeffrey, and the “North Hall IStage” was sent by the user identified in the subsequent line of parsed and formatted data.
Conclusion? We're splitting the data lines in the wrong place. Instead of adding the carriage return after NR%4==2 (it's subtle, we use print instead of printf), we actually should be adding it after the match for NR%4==0, like this:
awk '{ if (NR % 4 == 0) { printf ("name=%s;\n", $0) } else if (NR % 4 == 1) { printf ("id=%s; ",$0) } else if (NR % 4 == 2) { printf ("msg=\"%s\"; ", $0) } }'
Now, let's try that statement again:
id=7395681235; msg="African or European?"; name=jeffrey; id=7395672894; msg="North Hall IStage"; name=sideline;
Ah, that's the ticket!
With the problem solved, I'll remove the added debug statements and unleash the listener beast:
got name = jeffrey, id = 7395681235, and msg = African or European? got name = sideline, id = 7395672894, and msg = North Hall IStage got name = Genuine, id = 7395669466, and msg = ummmmm I know
Perfect. Bug debugged!
Now when we run the script, it correctly sees only the new tweets since it was last run, and it responds only to those it understands:
Twitter user @Larkin asked for the time @Larkin the time on our server is LOCALTIME Twitter user @jennyj asked for the time @jennyj the time on our server is LOCALTIME
Run the script again, and it sees only what's newer yet:
Twitter user @NoA asked for directions in tweet 7396527668 @NoA directions to our office are here: SOMEADDRESS
Perfect! Now, a tiny tweak. As we've debugged things, I have set the variable tweet to /bin/echo, so as not to flood my followers with unnecessary messages. Change it back to the tweet.sh script (as developed in an earlier series of columns), and the script actually responds with tweets.
The first run looks like this:
$ sh tweet-listen.sh Twitter user @mosa asked for directions in tweet 7396566048 (sent tweet @mosa directions to our office are here: SOMEADDRESS) Twitter user @xwatch asked for the time (sent tweet @xwatch the time on our server is TIME) Twitter user @NoA asked for directions in tweet 7396527668 (sent tweet @NoA directions to our office are here: SOMEADDRESS)
To ensure that it won't answer more than once to a tweet query, I'll run the script again:
$ sh tweet-listen.sh $
That's it! Now one tiny additional task is left, to add it to crontab so that it'll be an active listener, which is done by having it run every two minutes with the line:
*/2 * * * * bash $SCRIPTS/davesbot/tweet-listen.sh
That's all there is to it. Congratulations, we've just built a fully featured Twitterbot.
If you'd like to test it, it has its own account on Twitter, @davesbot. Start by sending a 2–3 word message, and it'll tell you what it can do. Grab the final source code from the LinuxJournal.com site at ftp.linuxjournal.com/pub/lj/listings/issue192/10711.tgz.
Dave Taylor has been hacking shell scripts for a really long time. He's the author of the popular Wicked Cool Shell Scripts and can be found on Twitter as @DaveTaylor and more generally at www.DaveTaylorOnline.com.