Internationalizing Those Bash Scripts
The first software that I was actually paid to develop was a 2-page shell script that prompted the user for a dozen or so pieces of information, before launching a set of cooperating processes. Those processes formed the core of a performance evaluation suite for the public telephone network - a rather sizable system for its day with high visibility.
Thinking through that assignment and the greater application, I can say with complete certainty that none of its stakeholders were contemplating [human] language independence - that is, how to render prompts, error messages, progress diagnostics, etc. in a language other than US-English. Even if we had been thinking that progressively, the level of facilitation provided by development languages/platforms was either very limited or non-existent.
Fast forward to 2010, language independence - or Internationalization as it has come to be known - is something that is now expected of commercial grade software. That shell script that I had proudly written back in 1982 was one of a few application modules that interacted directly with the user or generated progress diagnostics. That is exactly the sort of shell script that would compel us to consider Internationalization.
My motivation to offer up this column is grounded in a recent experience. Our team was asked to assess the 'Internationalization readiness' of a large-scale legacy-system - that is, identify modules that were not internationalized and needed to be, and estimate the effort to apply all required changes. The gap was mainly found to be in modules implemented in interpreted languages such as Rexx, TCL and the bash shell. I found that while there seemed to be generally available documentation around Internationalization for most programming languages used in this application, there wasn't much to be found for shell scripting (at least nothing that provided a "how-to" with code samples). One of the more complete online resources I found was an appendix to a bash-scripting guide, which started out with the following sentence "Localization is an undocumented Bash feature.". Well, at least it offered some hope, basic information and code fragments. This column goes on to distill what I thought was missing in a complete but summary form.
The Big Picture (in a small frame).First, let's agree on a common vocabulary - terms that begin to lay out a framework for the effort and code samples presented thereafter.
- Message Catalog: is an indexed repository of natural language messages used by Internationalized applications. The Message Catalog provides for the decoupling of the [human] language content and the application code. When an application needs to access a message at run time, something in the underlying processing stack knows how to retrieve it based on a unique key. The format and maintenance details of a Message Catalog is typically development-platform specific, but the goal is always the same - decouple and centralize the application's natural language text.
- Internationalization: the term Internationalization (hereafter referred to as its commonly known abbreviation: I18N - "I - eighteen letters - N") applies to the steps that software designers/developers take in order to make an application language-independent. At the coding level, user readable text is never compiled into the application or intermixed with a markup language. Instead, the application code refers to such content through unique message-catalog keys.
- Localization: (sometimes abbreviated as "L10N") applies to the process of adapting an application to specific target languages. If IN18 has been applied, Localization should not involve re-coding, but rather focuses on language translation and re-deployment. Stated another way, Localization is simply the process of adding support for a new Language - translating Message Catalog content from one language to another.
- Locale: is the part of a user's environment that defines location, country and culture information - most noticeably, the user's language preference. The Locale is typically installed and configured as part of the underlying operating system or rendering application such as a browser.
So let's summarize. We international, so that we can localize. I18N is a design and coding time effort that requires developers to adhere to certain design and coding practices with one primary goal in mind - decoupling language sensitive content from the source code. For every language that needs to be supported, a Localization effort is performed - creating a new Message Catalog for that language.
The good news here is that the I18N process need not start from first principals. Most modern development languages, including Bash, offer features that facilitate the basics - leaving the developer with the task of deciding how to integrate these basics into the lifecycle process and the code base.
In and Out of ScopeThe only soft prerequisites to getting the most out of this material is a general understanding of I18N, (independent of programming language, as presented above) and a basic familiarity with shell scripting.
In the grand scheme, I18N/L10N goes beyond natural language independence. Although not the focus of this column, a Locale can include preferences that define date/time format, currency symbols, time zones, non-working days, … which all serve to drive aspects of processing and presentation. The process, coding and testing examples presented here only focus on language preference. It also should be noted that a rather simple example of Localization is presented - US English to Italian - languages that share the same alphabet (more or less). This precludes the need to cover details like extended character sets and the role of localized I/O devices such as keyboards. Other deeper and broader areas of I18N can be researched for further study through online and other resources. Here are some examples:
Unicode character encoding standards | http://www.unicode.org |
Decent I18N intro to I18n | http://www.debian.org/doc/manuals/intro-i18n/ |
W3C related I18N Material | http://www.w3.org/International/ |
Advanced Bash Scripting Guide | http://www.tldp.org/LDP/abs/html//td> |
Building on the fundamentals outlined above, let's move onto a real example. This section demonstrate how I18N and Localization are supported and applied in a bash environment, using a simple bash script to drive home concepts and details.
First, what sort of shell script elements are sensitive to natural language support? Well, the short answer is anything that a human user visually reviews as part of using an application. So that would include:
- Textual prompts to the user
- Error messages
- Progress or error diagnostics diverted to log files or presented on a console
- Help text, and other usage information and interactive documentation.
Just how does Bash facilitate I18N and Localization? We'll begin answering that question by presenting a shell script that cannot be considered internationalized. The short script below doesn't have much of a commercial value, but that "quality" will allow us to focus on the task at hand - identifying and applying changes to language sensitive areas. This script generates and displays a random number within a range provided by the user, and logs its activity.
- orig-rand.sh #!/bin/bash function random { typeset low=$1 high=$2 echo $(( ($RANDOM % ($high - $low) ) + $low )) } # (1) echo "Hello, I can generate a random number between 2 numbers that you provide" #(2) echo -n "What is your low number? " read low #(3) echo -n "What is your high number? " read high if [[ $low -ge $high ]] then #(4) echo "1st number should be lower than the second - leaving early." >&2 exit 1 fi rand=$(random $low $high ) #(5) echo "from/to generated (by/at): $low / $high $rand (${LOGNAME} / $(date))" >> /tmp/POC #(6) echo "Your Random Number Is: $rand " exit 0
Running the script produces the expected output.
$: orig-rand.sh Hello, I can generate a random number between 2 numbers that you provide What is your low number? 50 What is your high number? 125 Your Random Number Is: 95 $:
Commented lines (1) through (6) have been flagged as requiring change - as they contain natural language. With this content identified, we can move onto creating a Message Catalog that can be used by an altered, internationalized script. To introduce the format, here's an example Message Catalog. It contains 2 messages - a greeting and an error message. The general format of the file consists of key/value line pairs. The "msgid" portion naming a key, and the "msgstr" portion associating a natural language value. Each Message Catalog supports exactly one language - in this case, US-English.
File: en.po
msgid "Main Greeting" msgstr "Welcome, what do you want to do today?" msgid "Missing File Error" msgstr "File Not Found"
Message Catalogs like this can be constructed manually, post processed and installed in the environment to support one or more application. (These Message Catalogs reside in files that are otherwise referred to as Portable Object files, and by convention, are named with a .po suffix).
Now let's construct a Message Catalog to maintain the user viewable content found in the example script above. Notice there are 6 distinct messages that line up with the content that was embedded in the original script.
File: en.po
msgid "Greeting" msgstr "Hello, I can generate a random number between 2 numbers that you provide" msgid "Low Number Prompt" msgstr "What is your low number" msgid "High Number Prompt" msgstr "What is your high number" msgid "Input Error" msgstr "1st number should be lower than the second - leaving early." msgid "Result Title" msgstr "Your Random Number Is: " msgid "Activity Log" msgstr "from/to generated (by/at): "
Okay, at least as far as the Message Catalog is concerned, we now have US English content covered. Now let's assemble one for another language - Italian.
File: it.po
msgid "Greeting" msgstr "Ciao, posso generare un numero casuale fra il numero 2 che assicurate" msgid "Low Number Prompt" msgstr "Che cosa il vostro numero basso" msgid "High Number Prompt" msgstr "Che cosa il vostro alto numero" msgid "Input Error" msgstr "il primo numero dovrebbe essere pi basso del secondo - andando presto." msgid "Result Title" msgstr "Il vostro numero casuale :" msgid "Activity Log" msgstr "da/al generato a (da/a):"
Notice that the "msgid" values are constant and have not changed. They will be used by a modified script - an internationalized script. Now that the language catalogs exist, what needs to be done to make them accessible by Internationalized scripts? Linux provides a utility called "msgfmt" that creates 'message object files' (*.mo) from portable object files (*.po), without changing the portable object files. Refer to the installed or online manual page for complete command line usage details. Executing the following commands will generate and install the message object files for both US-English and Italian.
msgfmt -o rand.sh.mo it.po cp -p rand.sh.mo $HOME/locale/it/LC_MESSAGES/ msgfmt -o rand.sh.mo en.po cp -p rand.sh.mo $HOME/locale/en/LC_MESSAGES/
Now that the Message Catalogs for two languages are installed, how can a bash script leverage them? The other Linux utility critical to our example is called "gettext".
Given a directory and file naming organization for the Message Catalogs, gettext provides access to the messages stored in the catalog. First, depicting how Message Catalogs must be stored on the file system, see the listing below. For each 2 letter language code ('en' and 'it' in our example), some number of "text domain" message object files are stored under a subdirectory called LC_MESSAGES. By convention, a text domain is related to a single application, but this is an organizational decision to be made when localizing.
Directory/file listing:
en en/LC_MESSAGES en/LC_MESSAGES/rand.sh.mo it it/LC_MESSAGES it/LC_MESSAGES/rand.sh.mo
As shown above, we chose to install the Message Catalogs under the user's HOME directory under a subdirectory called locale. System Message Catalogs that get distributed with Linux are normally found under /usr/lib/locale. Here's what some of the directory listing looks like on my distribution:
aa_DJ aa_DJ/LC_MESSAGES aa_DJ.utf8 aa_DJ.utf8/LC_MESSAGES aa_ER aa_ER/LC_MESSAGES aa_ER@saaho ... many others not shown
Retrieving a message stored in a Message catalog is very straightforward - the following 2 lines demonstrate basic access. See installed or online manual page for complete command line usage. Setting the environment variable TEXTDOMAINDIR to the base of the Message Catalog directory is required.
$: export TEXTDOMAINDIR=/home/lji/locale $: gettext -s "Greeting" Hello, I can generate a random number between 2 numbers that you provide $:
Notice that the invocation above compelled the 'gettext' utility to present the US-English copy of the message. This was driven by the language preference value assigned to the user's Locale. Without elaborating on the details, the 'locale' Linux utility displays the following values. Of course, the first value drives language preference.
$: locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" ?? other values not shown. $:
So if you're following along, the next natural question to ask is how to alter language preference. How can we test access to our Italian Message Catalog? Once again, without elaborating of the details, setting the environment variable LC_ALL to a value that includes language and country codes, will reset every Locale attribute. Notice the updated output from the 'locale' utility after Italian/Italy (it/IT) has been assigned as the language/country.
$: export LC_ALL="it_IT.UTF-8" $: locale LANG=it_IT.UTF-8 LC_CTYPE="it_IT.UTF-8" LC_NUMERIC="it_IT.UTF-8" LC_TIME="it_IT.UTF-8" ?? other values not shown. $:
Now if the same 'gettext' command is executed, we would expect to display the equivalent Italian content, and we do as shown below.
$: gettext -s "Greeting" Ciao, posso generare un numero casuale fra il numero 2 che assicurate $:
So if the 'msgfmt' and 'gettext' utilities are the core of basic I18N and Localization in the bash shell, what's the best way of internationalizing the original example script and other scripts like it? The first step I took was to build a thin convenience library, which offers 4 useful functions. I chose this general approach for two reasons: it insolates the lowest level details from the application code, and promotes code reuse by offering developers a straightforward way of dealing these common natural-language sensitive operations:
- displaying text to standard output
- displaying an error message
- prompting a user for a response
- logging a message to a file
The library code below sets the TEXTDOMAINDIR environment variable and implements 4 functions.
Source code for i18n-lib.sh
#!/bin/bash ## # Thin library around basic I18N facilitated function # basic text display, file logging, error display, and prompting export TEXTDOMAINDIR=/home/lji/locale ############################################### ## ## Display some text to stderr ## $1 is assumed to be the Message Catalog key function i18n_error { echo "$(gettext -s "$1")" >&2 } ############################################### ## ## Display some text to sdtout ## $1 is assumed to be the Message Catalog key ## rest of args are used as misc information function i18n_display { typeset key="$1" shift echo "$(gettext -s "$key") $@" } ############################################### ## Append a log message to a file. ## use $1 as target file to append to ## use $2 as catalog key ## rest of args are used as misc information function i18n_fileout { [[ $# -lt 2 ]] && return 1 typeset file="$1" typeset key="$2" shift 2 echo "$(gettext -s "$key") $@" >> ${file} } ## Prompt the user with a message and echo back the response. ## $1 is assumed to be the Message Catalog key function i18n_prompt { typeset rv [[ $# -lt 1 ]] && return 1 read -p "$(gettext "$1"): " rv echo $rv }
So how can we transform the original sample script to leverage this library - that is, internationalize it? See the re-implemented script below. There are 4 noticeable changes:
- The TEXTDOMAIN environment variable is set to the base application value
- Our I18N library file is sourced in.
- The user is given the opportunity to select Italian as the preferred language.
- All "echo" statements that directed natural-language content were replaced by calls to functions offered by the I18N library.
File: i18n-rand.sh
#!/bin/bash ## # POC around i18n/Localization in a bash script #(1) export TEXTDOMAIN=rand.sh I18NLIB=i18n-lib.sh #(2) # source in I18N library - shown above if [[ -f $I18NLIB ]] then . $I18NLIB else echo "ERROR - $I18NLIB NOT FOUND" exit 1 fi ## Start of example script function random { typeset low=$1 high=$2 echo $(( ($RANDOM % ($high - $low) ) + $low )) } #(3) ## ALLOW USER TO SET LANG PREFERENCE ## assume lang and country code follows if [[ "$1" = "-lang" ]] then export LC_ALL="$2_$3.UTF-8" fi #(4) # Display initial greeting i18n_display "Greeting" # ask for input low=$(i18n_prompt "Low Number Prompt" ) high=$(i18n_prompt "High Number Prompt" ) # check for error condition and display error if found if [[ $low -ge $high ]] then i18n_error "Input Error" exit 1 fi rand=$(random $low $high ) # Log what was just done i18n_fileout "/tmp/POC" "Activity Log" "$low / $high $rand (${LOGNAME} / $(date))" # Display Results i18n_display "Result Title" $rand exit 0
Now we can prove that it all works. Two test runs appear below - one using the English content and the other the Italian content.
$: i18n-rand.sh Hello, I can generate a random number between 2 numbers that you provide What is your low number? 100 What is your high number? 1000 Your Random Number Is: 615 ## now specify Italian as language preference $: i18n-rand.sh -lang it IT Ciao, posso generare un numero casuale fra il numero 2 che assicurate Che cosa il vostro numero basso? 500 Che cosa il vostro alto numero? 1000 Il vostro numero casuale : 601 $:
The content of the log file is as expected. Notice, that this script was not the only processing affected by changing the Locale. The output of the 'date' command shows the Italian abbreviation of Sunday (dom) and June (giu). Yes, Linux and all of its utilities are to be considered internationalized.
from/to generated (by/at): 50 / 125 95 (lji / Sun Jun 10 12:57:38 EDT 2010) from/to generated (by/at): 100 / 1000 615 (lji / Sun Jun 10 12:57:59 EDT 2010) da/al generato a (da/a): 500 / 1000 601 (lji / dom giu 10 12:58:48 EDT 2010)Summary/Conclusions
Just as information exchange standards such as XML allow systems to be more interoperable, at its core, I18N allows applications to be more usable - by a broader, more global user base. I'm not suggesting that every trivial shell script necessarily warrants I18N, but because all commercial software is potentially a global commodity, language independence is something that needs to be considered - and considered early in the design/development process. The lack of such planning would be quite shortsighted in 2010. As with all core application services, I18N is much less expensive (overwhelmingly so) to address at the outset of a project rather than to shoehorn in a solution deep into a product lifecycle.
Every modern development language supports I18N / Localization in a unique way. But whether your application is a major web site or a 2-page shell script, the same general concepts always apply. Optimally, architects and designers set the tone by providing a convenient way for developers to leverage the existing I18N and Localization tools/APIs. Lead developers can and should implement a thin convenience wrapper around the low level details of obtaining content from a Message Catalog. Offering functionality at this level goes a long way to encourage developers to apply a common solution across all applications and prevent code bloat.
It may be sparsely documented, but there is real support in Linux and its bash shell for creating and using Message Catalogs. As a relatively small part of large-scale applications, shell scripts that present a textual interface, or control progress and error logging, are often forgotten in a sea of browser accessible content. It's just easy to forget the shell scripts. My hope is that the minor investment in time and effort put into assembling this material can be leveraged on development efforts that include shell scripts.
Miscellaneous Notes- These code samples used here were built and tested on a Suse Linux 10.
- the google translator (http://www.google.com/translate_t) was used to translate the base English Message Catalog into Italian, so they may not be the most appropriate, in-context translations. More often than not, language translation for Locaization is performed by a human translator that's familiar with the application and its customer base.
Photo Credit: © asharkyu/Shutterstock