Normalizing Filenames and Data with Bash
URLify: convert letter sequences into safe URLs with hex equivalents.
This is my 155th column. That means I've been writing for Linux Journal for:
$ echo "155/12" | bc
12
No, wait, that's not right. Let's try that again:
$ echo "scale=2;155/12" | bc
12.91
Yeah, that many years. Almost 13 years of writing about shell scripts and lightweight programming within the Linux environment. I've covered a lot of ground, but I want to go back to something that's fairly basic and talk about filenames and the web.
It used to be that if you had filenames that had spaces in them, bad things would happen: "my mom's cookies.html" was a recipe for disaster, not good cookies—um, and not those sorts of web cookies either!
As the web evolved, however, encoding of special characters became the norm, and every Web browser had to be able to manage it, for better or worse. So spaces became either "+" or %20 sequences, and everything else that wasn't a regular alphanumeric character was replaced by its hex ASCII equivalent.
In other words, "my mom's cookies.html" turned into "my+mom%27s+cookies.html" or "my%20mom%27s%20cookies.html". Many symbols took on a second life too, so "&" and "=" and "?" all got their own meanings, which meant that they needed to be protected if they were part of an original filename too. And what about if you had a "%" in your original filename? Ah yes, the recursive nature of encoding things....
So purely as an exercise in scripting, let's write a script that converts any string you hand it into a "web-safe" sequence. Before starting, however, pull out a piece of paper and jot down how you'd solve it.
Normalizing Filenames for the WebMy strategy is going to be easy: pull the string apart into individual characters, analyze each character to identify if it's an alphanumeric, and if it's not, convert it into its hexadecimal ASCII equivalent, prefacing it with a "%" as needed.
There are a number of ways to break a string into its individual letters,
but let's use Bash string variable manipulations, recalling that
${#var}
returns the number of characters in variable $var
, and that
${var:x:1}
will
return just the letter in $var
at position x
. Quick now, does indexing start
at zero or one?
Here's my initial loop to break $original
into its component letters:
input="$*"
echo $input
for (( counter=0 ; counter < ${#input} ; counter++ ))
do
echo "counter = $counter -- ${input:$counter:1}"
done
Recall that $*
is a shortcut for everything from the invoking command line
other than the command name itself—a lazy way to let users quote the
argument or not. It doesn't address special characters, but that's
what quotes are for, right?
Let's give this fragmentary script a whirl with some input from the command line:
$ sh normalize.sh "li nux?"
li nux?
counter = 0 -- l
counter = 1 -- i
counter = 2 --
counter = 3 -- n
counter = 4 -- u
counter = 5 -- x
counter = 6 -- ?
There's obviously some debugging code in the script, but it's generally a good idea to leave that in until you're sure it's working as expected.
Now it's time to differentiate between characters that are acceptable within a URL and those that are not. Turning a character into a hex sequence is a bit tricky, so I'm using a sequence of fairly obscure commands. Let's start with just the command line:
$ echo '~' | xxd -ps -c1 | head -1
7e
Now, the question is whether "~" is actually the hex ASCII sequence 7e or not. A quick glance at http://www.asciitable.com confirms that, yes, 7e is indeed the ASCII for the tilde. Preface that with a percentage sign, and the tough job of conversion is managed.
But, how do you know what characters can be used as they are? Because of the weird way the ASCII table is organized, that's going to be three ranges: 0–9 is in one area of the table, then A–Z in a second area and a–z in a third. There's no way around it, that's three range tests.
There's a really cool way to do that in Bash too:
if [[ "$char" =~ [a-z] ]]
What's happening here is that this is actually a regular expression (the
=~
) and a range [a-z]
as the test. Since the action
I want to take after
each test is identical, it's easy now to implement all three tests:
if [[ "$char" =~ [a-z] ]]; then
output="$output$char"
elif [[ "$char" =~ [A-Z] ]]; then
output="$output$char"
elif [[ "$char" =~ [0-9] ]]; then
output="$output$char"
else
As is obvious, the $output
string variable will be built up to have the
desired value.
What's left? The hex output for anything that's not an otherwise acceptable character. And you've already seen how that can be implemented:
hexchar="$(echo "$char" | xxd -ps -c1 | head -1)"
output="$output%$hexchar"
A quick run through:
$ sh normalize.sh "li nux?"
li nux? translates to li%20nux%3F
See the problem? Without converting the hex into uppercase, it's a bit weird looking. What's "nux"? That's just another step in the subshell invocation:
hexchar="$(echo "$char" | xxd -ps -c1 | head -1 | \
tr '[a-z]' '[A-Z]')"
And now, with that tweak, the output looks good:
$ sh normalize.sh "li nux?"
li nux? translates to li%20nux%3F
What about a non-Latin-1 character like an umlaut or an n-tilde? Let's see what happens:
$ sh normalize.sh "Señor Günter"
Señor Günter translates to Se%C3B1or%200AG%C3BCnter
Ah, there's a bug in the script when it comes to these two-byte character
sequences, because each special letter should have two hex byte sequences. In
other words, it should be converted to se%C3%B1or g%C3%BCnter
(I restored the
space to make it a bit easier to see what I'm talking about).
In other words, this gets the right sequences, but it's missing
a percentage sign—%C3B
should be %C3%B
, and
%C3BC should be %C3%BC
.
Undoubtedly, the problem is in the hexchar
assignment subshell statement:
hexchar="$(echo "$char" | xxd -ps -c1 | head -1 | \
tr '[a-z]' '[A-Z]')"
Is it the -c1
argument to xxd
? Maybe. I'm going to leave identifying and
fixing the problem as an exercise for you, dear reader. And while you're
fixing up the script to support two-byte characters, why not replace
"%20" with "+" too?
Finally, to make this maximally useful, don't forget that there are a number of symbols that are valid and don't need to be converted within URLs too, notably the set of "-_./!@#=&?", so you'll want to ensure that they don't get hexified (is that a word?).