ViaVoice and XVoice: Providing Voice Recognition

by Rob Spearman

Conversing with a computer has long been a staple of science fiction. Such conversations are still largely in the realm of fiction, but voice recognition technology has improved significantly over the last decade. A number of voice recognition and control products are available on various platforms. Many people don't realize, however, that it is possible to control the Linux desktop by voice, and it has been possible for some time.

Voice control can provide computer access for those with overuse syndromes or other arm injuries--users who in the past had to switch platforms to find voice support. Aside from the geek factor, ordinary users can benefit from reduced arm stress and improved ease-of-use and speed for some tasks. Although the future of the software discussed in this article is somewhat in question--and does not give a completely hands-free environment--it does work. All that is required is a modest investment of time and money.

Voice control on Linux is possible by using two software packages. IBM ViaVoice for Linux supplies the basic voice recognition engine. XVoice, available under the GPL, uses the ViaVoice libraries to provide control of the desktop and applications.

IBM offers ViaVoice for Linux (for US English) in the United States and Canada. It is available for around $40, plus shipping, and includes a headset. It also can be downloaded from the IBM web site for a small discount. A slightly newer version of ViaVoice also is available as part of the Mandrake 8.0 PowerPack and ProSuite editions. The Mandrake ViaVoice apparently offers language support for both British and American English, French and German. Mandrake versions later than 8., however, no longer include ViaVoice. This article focuses solely on installing and using the version available from IBM.

Installing ViaVoice

ViaVoice for Linux requires a 233MHz Pentium MMX or better, with at least 128MB of RAM and a 16-bit sound card. It was designed to install on Red Hat 6.2, but I am using it successfully on Red Hat 7.3. Others also have had success installing it on non-Red Hat systems. Be prepared to experience some installation problems, though.

The first step is to install a Java Runtime Environment. ViaVoice 1.0.1.1 was tested with JRE-1.2.2 revision RC4 from blackdown.org. Using this exact revision will avoid incompatibilities with a different JRE.

After the JRE is installed, mount the CD and run vvsetup in the CD root directory as root. Once installed, run vvstartuserguru as yourself to set up as a ViaVoice user, configure the right audio levels and begin training ViaVoice for your voice. I could not get myself installed as a user until I deleted the /viavoice directory in my home directory (created during installation). I then had to rerun the user guru. This move fixed the problem, but it's rather disappointing that the installation script is so frail. Judging by the accounts of other people trying to install ViaVoice, I had an easy installation.

Training ViaVoice

A base installation of ViaVoice, like other voice recognition software, does not provide great accuracy at first. Each user must train ViaVoice to better recognize his or her own idiosyncratic voice.

One training method is to read back text that ViaVoice displays in the user guru. This process is fairly easy to do, but it may not reflect the type of words and phrases that you tend to use a lot, making it less effective.

A better alternative is to use the ViaVoice Dictation Java application when working on actual documents. As you dictate, some words or phrases are recognized incorrectly. When this occurs, you use the correction facilities within Dictation to correct the errors. ViaVoice then tunes its voice models to better fit your voice. This method is more labor-intensive, but usually these corrections can be done with voice commands. A word of warning: save your work often, as Dictation is prone to crash.

An industry consultant told me that with 10 to 60 hours of training, current voice recognition technology should reach 98% accuracy. I have lost track of how much time I've spent on training, but my accuracy is only about 92-95% on arbitrary text. This may be because ViaVoice for Linux is much older than the Mac and Windows versions, or it could be for any number of other reasons. Fortunately, spoken commands are much more accurately recognized because there are fewer valid possibilities to match.

Even with only a couple of hours of training, you should notice improved accuracy. One thing I found is I needed to be more careful with my pronunciation. Bad microphones or background noise also can cause accuracy problems.

Installing XVoice

Once you have ViaVoice installed and at least partially trained, you are ready to install XVoice to allow voice control of your desktop and applications. On its own, ViaVoice for Linux does not give you these capabilities.

XVoice can be downloaded from xvoice.sourceforge.net. Be sure to download and install the RPM, as the source requires a discontinued ViaVoice for Linux SDK (more on this later).

Once installed, simply type xvoice -m in a terminal window (make sure that Dictation is not running, as they cannot run at the same time). As a simple test, say "next window", which should change focus to another window on your desktop.

XVoice Overview

XVoice allows a user to associate a set of actions with a predefined spoken command. A set of commands is called a grammar. Grammars can be associated with specific applications, windows or modes within an application. They also can be general and accessible from any context. Actions invoked can include generated keystrokes, mouse events, calls to external programs or any combinations of these.

XVoice uses the ViaVoice libraries to recognize commands or regular text. Commands are defined in an xvoice.xml configuration file. XVoice uses a standard configuration file, /usr/share/xvoice/xvoice.xml, until you create your own in ~/.xvoice/xvoice.xml.

The XVoice window displays which command grammars are active and includes a pane showing the most recently dictated words. If XVoice thinks that something you said was close to a command but isn't sure, then the text shows up gray in this pane to alert you, and the command actions will not be executed.

XVoice can be in four different states for any given application window. In command mode, XVoice listens only for commands. In dictate mode, XVoice doesn't listen for application specific commands (although it does listen for more general commands) and simply types whatever it thinks you have spoken. In idle mode, only general commands are listened for. Finally, in command and dictate mode both can be on simultaneously, so both dictation and commands are listened for. Commands are distinguished from plain text by pausing slightly before and after speaking a command.

When you first focus an application, XVoice automatically starts in command mode. To turn on dictate mode as well, you simply say "dictate mode". To stop dictating, say "stop dictation".

For optimal utility, make the XVoice window sticky in your window manager so you are always able to see how it has interpreted your speech. To have the XVoice start up automatically listen for input, put xvoice -m in your window manager start up programs.

Controlling Your Applications

Let's look at the sample application grammar definition in Listing 1 to understand how to define a grammar for an application. First we define the application name for human readability, and then we define an expression to match the window title for this application (line 1). This is how XVoice determines which grammar to activate. In line 1 we're looking at a special built-in application name, so this isn't a real window title. The commands in this special grammar are accessible from any context.

Listing 1. Sample Application Grammar Definition

An application tag also can have a dictation attribute. If true, this places XVoice into dictation mode when first activating this grammar. On line 2, we include some definitions that have been defined earlier in a <define name='numbers'> section. Define sections let you define your own tags for use throughout your configuration file.

Line 3 is an example of what might be included in a define section, although here the direction tag can be used only in the scope of this grammar. This line is associating spoken directions with their respective arrow keys. When evaluated in a command, the spoken direction is substituted with its corresponding key. XVoice allows any character names from /usr/X11R6/include/X11/keysymdef.h to be escaped in the & style. Note the closing period at the end.

The mapping of spoken commands to actions begins at line 4. Saying "last window" produces a simulated Alt-Tab keystroke. This is because \ is the escape sequence for the Tab key, and the Alt key is simulated because the alt attribute is true. Control and Shift are other possible attributes.

The char attribute actually can include a string, as seen in line 6. Commands like this really can save you time filling out forms.

Line 7 uses a more complex command expression. When evaluated, {1} on the right side of the arrow ("->") is replaced with the content of the first braces in the spoken command on the left, {2} with the second and so on. So saying "move to view port 3" results in the keystroke alt-F3 (alt + &F3;), which in my window manager configuration switches me to the third desktop view port.

Before listening for commands, custom defined tags are substituted with their definitions. Line 8 works exactly as if the definition of <direction> on line 3 appeared in place of the tag itself. The same is true of the raw-number tag, which has been defined as a positive whole number in the numbers definition mentioned above.

Line 8 also introduces the repeat tag. It repeats the enclosed events a defined number of times. Here it is repeating an arrow key press (defined on line 3). The number of times specified is the number spoken after the direction. In other words, saying "go up 10" results in 10 arrow up key presses.

The mouse event tag can be seen on lines 9-15. This event tag allows you to reposition the mouse pointer and simulate mouse clicks. The x and y attributes take pixel values. The mouse origin attribute can be root (absolute), window (from the top left corner of the application window), relative (to the current pointer position) or even widget (an experimental option for difficult-to-automate applications). The XVoice application allows mouse events to be easily recorded for pasting into your configuration file.

Lines 11 and 12 allow horizontal voice movement of the mouse pointer. Line 13 does the same for vertical movement but in a single line. Notice how the sign of the pixel movement amount is being determined: {2} will be either a + or - depending on the direction spoken.

XVoice also can execute other programs, as on lines 16-22. What could be easier than simply saying "x term" to get a new terminal window? I added Mozilla to the ViaVoice dictionary using Dictation to allow it to be recognized.

Look at the expr attribute on line 18. this is a window title matching expression. If I say "pine" and a window titled Pine is already open, focus is switched to the existing window rather than starting a new instance. The only problem is that your window manager (Sawfish, for example) may not switch you to the correct view port or workspace to actually use the newly focused window.

The calls to xmms on lines 19-22 illustrate a benefit of server-based applications. These lines allow me to control music playback from any context--I don't need to find the xmms window. In fact, the screen even can be locked, which could be a security issue for you.

Line 23 finishes the application grammar definition. Be sure not to forget the period to close the <<root>> section. Simple mistakes like adding an extra character or leaving one out can lead to error messages of varying usefulness or to lengthy delays at start up. Unfortunately, XVoice does not provide good error messages. Because the heart of the configuration lives in CDATA sections, XML validators probably cannot help you catch errors. Be careful when changing your configuration file, and make frequent backups.

By editing your personal configuration file, you now should be able to automate almost any task that previously required the use of a keyboard or a mouse. Grammars for many common applications are already included in the default configuration file, and they provide good study examples. If you do a lot of repetitive tasks, this can really save your muscles and your time.

Issues and the Future

Some applications, mainly games such as TuxRacer, bypass X for key presses, leaving XVoice unable to control them. Mouse-heavy applications, such as The GIMP or Netscape, can be automated, but it's extremely tedious to try to control the mouse by voice. Fortunately, Mozilla 1.2a has "type ahead find", which, in conjunction with XVoice, lets you speak text-within-a-text-link to navigate web pages by voice.

Voice recognition in general works great for commands and fine for casual text. However, even small error rates can be quite annoying for some uses. Be advised that it can be exasperatingly difficult to program by voice. Another issue to be aware of is the possibility of straining your voice, much as it is possible to overuse your arms.

While XVoice and ViaVoice put a lot of power at your control, it is not quite possible to control entirely the Linux desktop by voice. This is disappointing to anyone needing hands free accessibility. Sadly, the weak link is IBM. At least with the version shipped by IBM, Dictation requires keystrokes for unavoidable dialog windows, for example, and Dictation is the only way to train ViaVoice. Of course, if you don't need any additional training, can automate all your applications and aren't concerned about security, you're in good shape.

IBM has released new versions of ViaVoice for Mac and Windows but not for Linux. Despite all the money they're spending on other areas of Linux, they don't even actively market ViaVoice, and their future support is unclear. In March they pulled without comment the ViaVoice Linux SDK, which XVoice needs to compile. With this cloud over the future, the XVoice developers are currently trying to find a viable open-source alternative instead of adding new features. A group of developers and users is out there wanting to make ViaVoice for Linux a success, but without even minor support from IBM the opportunity will be missed.

Rob Spearman is a Seattle software architect recovering from an overuse syndrome. This article was written using voice recognition on Linux.

email: rob@smeg.com

Load Disqus comments