Text Processing in Rust
Create handy command-line utilities in Rust.
This article is about text processing in Rust, but it also contains a quick introduction to pattern matching, which can be very handy when working with text.
Strings are a huge subject in Rust, which can be easily realized by the fact that Rust has two data types for representing strings as well as support for macros for formatting strings. However, all of this also proves how powerful Rust is in string and text processing.
Apart from covering some theoretical topics, this article shows how to develop some handy yet easy-to-implement command-line utilities that let you work with plain-text files. If you have the time, it'd be great to experiment with the Rust code presented here, and maybe develop your own utilities.
Rust and Text
Rust supports two data types for working with strings: String
and str
.
The String
type is for working with mutable strings that
belong to you, and it has length and a capacity property. On the other
hand, the str
type is for working with immutable strings that you want
to pass around. You most likely will see an str
variable be used as
&str
. Put simply, an str
variable is accessed as a reference to some
UTF-8 data. An str
variable is usually called a "string slice" or, even
simpler, a "slice". Due to its nature, you can't add and remove any
data from an existing str
variable. Moreover, if you try to call the
capacity()
function on an &str
variable, you'll get an error message
similar to the following:
error[E0599]: no method named `capacity` found for type
↪`&str` in the current scope
Generally speaking, you'll want to use an str
when you want to pass a string
as a function parameter or when you want to have a read-only version
of a string, and then use a String
variable when you want to have a mutable
string that you want to own.
The good thing is that a function that accepts &str
parameters can
also accept String
parameters. (You'll see such an example in the
basicOps.rs
program presented later in this article.)
Additionally, Rust supports the char
type, which is for representing
single Unicode characters, as well as string literals, which are
strings that begin and end with double quotes.
Finally, Rust supports what is called a byte
string. You can define a new
byte
string as follows:
let a_byte_string = b"Linux Journal";
unwrap()
You almost certainly cannot write a Rust program without
using the unwrap()
function, so let's take a look at that here. Rust
does not have support for null
, nil
or
Null
, and it uses the Option
type
for representing a value that may or may not exist. If you're
sure that some Option
or Result
variable that you want to use has a
value, you can use unwrap()
and get that value from the variable.
However, if that value doesn't exist, your program will panic. Take a look at the following Rust program, which is saved as unwrap.rs:
use std::net::IpAddr;
fn main() {
let my_ip = "127.0.0.1";
let parsed_ip: IpAddr = my_ip.parse().unwrap();
println!("{}", parsed_ip);
let invalid_ip = "727.0.0.1";
let try_parsed_ip: IpAddr = invalid_ip.parse().unwrap();
println!("{}", try_parsed_ip);
}
Two main things are happening here. First, as my_ip
is a valid
IPv4 address, parse().unwrap()
will be successful, and
parsed_ip
will
have a valid value after the call to unwrap()
.
However, as invalid_ip
is not a valid IPv4 address, the second attempt
to call parse().unwrap()
will fail, the program will panic and the
second println!()
macro will not be executed. Executing
unwrap.rs
will verify all these:
$ ./unwrap
127.0.0.1
thread 'main' panicked at 'called `Result::unwrap()`
↪on an `Err`
value: AddrParseError(())', libcore/result.rs:945:5
note: Run with `RUST_BACKTRACE=1` for a backtrace.
This means you should be extra careful when using unwrap()
in
your Rust programs. Unfortunately, going into more depth on
unwrap()
and how
to avoid panic situations is beyond the scope of this article.
println!
and format!
Macros
Rust supports macros, including println!
and
format!
that
are related to strings.
A Rust macro lets you write code that writes other code, which is also known as metaprogramming. Although macros look a lot like Rust functions, they have a fundamental difference from Rust functions: macros can have a variable number of parameters, whereas the signature of a Rust function must declare its parameters and define the exact type of each one of those function parameters.
As you might already know, the println!
macro is used for printing
output to the UNIX standard output, whereas the format!
macro, which
works in the same way as println!
, returns a new
String
instead of
writing any text to standard output.
The Rust code of macros.rs
will try to clarify things:
macro_rules! hello_world{
() => {
println!("Hello World!")
};
}
fn double(a: i32) -> i32 {
return a + a
}
fn main() {
// Using the format!() macro
let my_name = "Mihalis";
let salute = format!("Hello {}!", my_name);
println!("{}", salute);
// Using hello_world
hello_world!();
// Using the assert_eq! macro
assert_eq!(double(12), 24);
assert_eq!(double(12), 26);
}
What knowledge do you get from macros.rs
? First, that macro definitions
begin with macro_rules!
and can contain other macros in their
implementation. Note that this is a very naïve macro that does
nothing really useful. Second, you can see that format!
can be very handy when you
want to create your own strings using your own format. Third, the
hello_world
macro created earlier should be called as
hello_world!()
.
And finally, this shows that the assert_eq!()
macro can help you test the correctness of
your code.
Compiling and running macros.rs
produces the following output:
$ ./macros
Hello Mihalis!
Hello World!
thread 'main' panicked at 'assertion failed: `(left == right)`
left: `24`,
right: `26`', macros.rs:22:5
note: Run with `RUST_BACKTRACE=1` for a backtrace.
Additionally, you can see an advantage of the assert_eq!
macro here:
when an assert_eq!
macro fails, it also prints the line number and the
filename of the assertion, which can't be done using a function.
Now let's look at how to perform basic text operations in
Rust. The Rust code for this example is saved in basicOp.rs
and is the
following:
fn accept(s: &str) {
println!("{}", s);
}
fn main() {
// Define a str
let l_j: &str= "Linux Journal";
// Or
let magazine: &'static str = "magazine";
// Use format! to create a String
let my_str = format!("Hello {} {}!", l_j, magazine);
println!("my_str L:{} C:{}", my_str.len(),
↪my_str.capacity());
// String character by character
for c in my_str.chars() {
print!("{} ", c);
}
println!();
for (i, c) in my_str.chars().enumerate() {
print!("{}:{} ", c, i);
}
println!();
// Convert string to number
let n: &str = "10";
match n.parse::<i32>() {
Ok(n) => println!("{} is a number!", n),
Err(e) => println!("{} is NOT a number!", e),
}
let n1: &str = "10.2";
match n1.parse::<i32>() {
Ok(n1) => println!("{} is a number!", n1),
Err(e) => println!("{}: {}", n1, e),
}
// accept() works with both str and String
let my_str = "This is str!";
let mut my_string = String::from("This is string!");
accept(&my_str);
accept(&my_string);
// my_string has capacity
println!("my_string L:{} C:{}", my_string.len(),
↪my_string.capacity());
my_string.push_str("OK?");
println!("my_string L:{} C:{}", my_string.len(),
↪my_string.capacity());
// Convert String to str
let s_str: &str = &my_string[..];
// Convert str to String
let s_string: String = s_str.to_owned();
println!("s_string: L:{} C:{}", s_string.len(),
↪s_string.capacity());
}
So, first you can see two ways for defining str
variables and
creating a
String
variable using the format!
macro. Then,
you can see two techniques
for iterating over a string character by character. The second
technique also returns an index to the string that you process. After
that, this example shows how to convert a string into an integer, if it's
possible, with the help of parse::<i32>()
. Next, you can see that the
accept()
function accepts both an &str
and a
String
parameter even
though its definition mentions an &str
parameter. Following
that, this shows
the capacity and the length properties of a String
variable, which are
two different things. The length of a String
is the size of
the String
,
whereas the capacity of a String
is the room that is currently
allocated for that String
. Finally, you can see how to convert
a String
to
str
and vice versa. Other ways for getting a
String
from an &str
variable include the use of .to_string()
,
String::from()
,
String::push_str()
, format!()
and
.into()
.
Executing basicOp.rs
generates the following output:
$ ./basicOp
my_str L:29 C:32
H e l l o L i n u x J o u r n a l m a g a z i n e !
H:0 e:1 l:2 l:3 o:4 :5 L:6 i:7 n:8 u:9 x:10 :11 J:12 o:13
↪u:14 r:15
n:16 a:17 l:18 :19 m:20 a:21 g:22 a:23 z:24 i:25 n:26 e:27
↪!:28
10 is a number!
10.2: invalid digit found in string
This is str!
This is string!
my_string L:15 C:15
my_string L:18 C:30
s_string: L:18 C:18
Finding Palindrome Strings
Now, let's look at a small utility that checks whether a string
is a palindrome. The string is given as a command-line
argument to the program. The logic of palindrome.rs
is found in the
implementation of the check_palindrome()
function, which is
implemented as follows:
pub fn check_palindrome(input: &str) -> bool {
if input.len() == 0 {
return true;
}
let mut last = input.len() - 1;
let mut first = 0;
let my_vec = input.as_bytes().to_owned();
while first < last {
if my_vec[first] != my_vec[last] {
return false;
}
first +=1;
last -=1;
}
return true;
}
The key point here is that you convert the string to a vector using a
call to as_bytes().to_owned()
in order to be able to access it as an
array. After that, you keep processing the input string from both its
left and its right side, one character from each side for as long as
both characters are the same or until you pass the middle of the
string. In that case, you are dealing with a palindrome, so the
function returns "true"; otherwise, the function returns "false".
Executing palindrome.rs
with various types of input generates the
following kind of output:
$ ./palindrome 1
1 is a palindrome!
$ ./palindrome
Usage: ./palindrome string
$ ./palindrome abccba
abccba is a palindrome!
$ ./palindrome abcba
abcba is a palindrome!
$ ./palindrome acba
acba is not a palindrome!
Pattern Matching
Pattern matching can be very handy, but you should use it with caution,
because it can create nasty bugs in your software. Pattern matching in
Rust happens with the help of the match
keyword. A match statement
must catch all the possible values of the used variable, so having a
default branch at the end of the block is a very common practice. The
default branch is defined with the help of the underscore character,
which is a synonym for "catch all". In some rare situations, such as
when you examine a condition that can be either true or false, a
default branch is not needed. A pattern-matching block can look like
the following:
let salute = match a_name
{
"John" => "Hello John!",
"Jim" => "Hello Boss!",
"Jill" => "Hello Jill!",
_ => "Hello stranger!"
};
What does that block do? It matches one of the three distinct cases, if there is match, or it goes to the match all cases, which is last. If you want to perform more complex tasks that require the use of regular expressions, the regex crate might be more appropriate.
A Version ofwc
in Rust
Now let's look at the implementation of a simplified
version of the wc(1)
command-line utility. The Rust version of the
utility will be saved as wc.rs, will not support any command-line
flags, will consider every command-line argument as a file, and it can
process multiple text files. The Rust version of wc.rs is the following:
use std::env;
use std::io::{BufReader, BufRead};
use std::fs::File;
fn main() {
let mut lines = 0;
let mut words = 0;
let mut chars = 0;
let args: Vec<_> = env::args().collect();
if args.len() == 1 {
println!("Usage: {} text_file(s)", args[0]);
return;
}
let n_args = args.len();
for x in 1..n_args {
let mut total_lines = 0;
let mut total_words = 0;
let mut total_chars = 0;
let input_path = ::std::env::args().nth(x).unwrap();
let file = BufReader::new(File::open(&input_path)
↪.unwrap());
for line in file.lines() {
let my_line = line.unwrap();
total_lines = total_lines + 1;
total_words += my_line.split_whitespace().count();
total_chars = total_chars + my_line.len() + 1;
}
println!("\t{}\t{}\t{}\t{}", total_lines, total_words,
↪total_chars, input_path);
lines += total_lines;
words += total_words;
chars += total_chars;
}
if n_args-1 != 1 {
println!("\t{}\t{}\t{}\ttotal", lines, words, chars);
}
}
First, you should know that wc.rs
is using buffered input for
processing its text files. Apart from that, the logic of the program
is found in the inner for
loop that reads each input file line by
line. For each line it reads, it counts the characters and words.
Counting the characters of a line is as simple as calling the
len()
function. Counting the words of a line requires splitting the line
using split_whitespace()
and counting the number of elements in the
generated iterator.
The other thing you should think about is resetting the
total_lines
,
total_words
and total_chars
counters after processing a file. The
lines
, words
and chars
variables hold the total number of lines, words
and characters read from all processed text files.
Executing wc.rs
generates the following kind of output:
$ rustc wc.rs
$ ./wc
Usage: ./wc text_file(s)
$ ./wc wc.rs
40 124 1114 wc.rs
$ ./wc wc.rs palindrome.rs
40 124 1114 wc.rs
39 104 854 palindrome.rs
79 228 1968 total
$ wc wc.rs palindrome.rs
40 124 1114 wc.rs
39 104 854 palindrome.rs
79 228 1968 total
The last command executed wc(1)
in order to verify the correctness of
the output of wc.rs
.
As an exercise, you might try creating a separate function for counting the lines, words and characters of a text file.
Matching Lines That Contain a Given String
In this section, you'll see how to show the lines of a text file
that match a given string—both the filename and the string will be
given as command-line arguments to the utility, which is named
match.rs
. Here's the Rust code for match.rs:
use std::env;
use std::io::{BufReader,BufRead};
use std::fs::File;
fn main() {
let mut total_lines = 0;
let mut matched_lines = 0;
let args: Vec<_> = env::args().collect();
if args.len() != 3 {
println!("{} filename string", args[0]);
return;
}
let input_path = ::std::env::args().nth(1).unwrap();
let string_to_match = ::std::env::args().nth(2).unwrap();
let file = BufReader::new(File::open(&input_path).unwrap());
for line in file.lines() {
total_lines += 1;
let my_line = line.unwrap();
if my_line.contains(&string_to_match) {
println!("{}", my_line);
matched_lines += 1;
}
}
println!("Lines processed: {}", total_lines);
println!("Lines matched: {}", matched_lines);
}
All the dirty work is done by the contains()
function that checks
whether the line that is currently being processed contains the
desired string. Apart from that, the rest of the Rust code is
pretty trivial.
Building and executing match.rs
generates output like this:
$ ./match tabSpace.rs t2s
fn t2s(input: &str, n: i32) {
t2s(&input_path, n_space);
Lines processed: 56
Lines matched: 2
$ ./match tabSpace.rs doesNotExist
Lines processed: 56
Lines matched: 0
Converting between Tabs and Spaces
Next, let's develop a command-line utility that can convert tabs to spaces in a text file and vice versa. Each tab is replaced with four space characters and vice versa.
This utility requires at least two command-line parameters: the first
one should indicate whether you want to replace tabs with spaces or
the other way around. After that, you should give the path of at least
one text file. The utility will process as many text files as you
want, just like the wc.rs
utility presented earlier in this article.
You can find tabSpace.rs
's logic in the following two Rust functions:
fn t2s(input: &str) {
let file = BufReader::new(File::open(&input).unwrap());
for line in file.lines() {
let my_line = line.unwrap();
let new_line = my_line.replace("\t", " ");
println!("{}", new_line);
}
}
fn s2t(input: &str) {
let file = BufReader::new(File::open(&input).unwrap());
for line in file.lines() {
let my_line = line.unwrap();
let new_line = my_line.replace(" ", "\t");
println!("{}", new_line);
}
}
All the work is done by replace()
, which replaces every occurrence of
the first pattern with the second one. The return value of the
replace()
function is the altered version of the input string, which
is what's printed on your screen.
Executing tabSpace.rs
creates output like the following:
$ ./tabSpace -t basicOp.rs > spaces.rs
Processing basicOp.rs
$ mv spaces.rs basicOp.rs
$ ./tabSpace -s basicOp.rs > tabs.rs
Processing basicOp.rs
$ ./tabSpace -t tabs.rs > spaces.rs
Processing tabs.rs
$ diff spaces.rs basicOp.rs
The previous commands verifies the correctness of tabSpace.rs
. First,
any tabs in basicOp.rs
are converted into spaces and saved as
spaces.rs
, which afterward becomes the new
basicOps.rs
. Then, the
spaces of basicOps.rs
are converted into tabs and saved in
tabs.rs
.
Finally, the tabs.rs
file is processed, and all of its tabs are converted
into spaces (spaces.rs
). The last version of
spaces.rs
should be
exactly the same as basicOps.rs
.
It would be a very interesting exercise to add support for tabs of
variable size in tabSpace.rs
. Put simply, the number of spaces of a
tab should be a variable that will be given as a command-line
parameter to the utility.
So, is Rust good at text processing and working with text in general? Yes it is! Additionally, it should be clear that text processing is closely related to file I/O and (sometimes) to pattern matching and regular expressions.
The only rational way to learn more about text processing in Rust is to experiment on your own, so don't waste any more time, and give it a whirl.
Resources