Finding Files and More

by Eric Goebelbecker

Not long after getting their first Linux system, new users usually need to locate a file somewhere on their system. So they learn the following command from a friend, or maybe a book or magazine:

$ find / -name filename -print

Now while this command does work perfectly fine, the syntax does seem awkward to people unfamiliar with the find command. Why should we have to specify print? [Note: On Linux systems, and other systems that use GNU find, we don't. But standard Unix find insists on it, so you might as well get used to it if you use Unix as well as Linux.]

For that matter, why should we have to specify name? Why not just find filename? It's this seemingly cryptic structure that makes find one of the most under used commands in the Unix toolbox.

A look at the find man page (on any system, not just Linux) completes the confusing picture. For someone not familiar with Unix, find's “operators” and “expressions” make it an awfully complicated program just for locating files.

If all you want to do is locate a file, there is a better way to do that:

locate filename

This will work on a properly set-up Linux system with GNU find. Why have a complicated command like find when we already have a simple command like locate? Because find is good for much more than just finding files. (Good Linux distributions some with update properly set up. If yours isn't, you can run updatedb as root to update the database it uses, or simply use find as shown above).

The Caldera/Redhat system that I use at home has several entries in the crontab that run this command:

find /tmp/* -atime +10 -exec rm -f {} \;

This command deletes any files in /tmp that haven't been accessed in the past ten days. The fact that find only deletes files that haven't been accessed in the past ten days rather than files that were created that long ago is a subtle, but very important point. Find gives us access to the very valuable set of information stored about files and directories in Unix filesystems.

Like most Unix filesystems, the second extended filesystem (“ext2”) that is used on most Linux systems stores a more extensive set of data about files than just their name, size and last-change-date the way systems such as DOS do. It also stores an owner and group, access mode, the dates that the file was last modified and accessed, the date that the file last changed status, and the type. (Don't worry, we'll explain these as we go).

With the exception of the names, all this information is stored for each file and directory in a structure called an inode. In Unix filesystems, directories are simply files that contain a list of filenames with inode numbers.

Finding Files and More

Table 1 has a list of inode entry fields and how they are “translated” for the different filesystem types supported by Linux. While this table may not mean much to you yet, it should be self-explanatory by the time you finish reading this article.

The Command Line

Let's analyze the find command line:

find starting-point options criteria action
  • starting-point One or more directories from which to start searching. The default is the current directory.

  • options Modify the methods used for searching in several ways.

  • criteria Specify which files are chosen, and which are ignored. All files found are chosen by default.

  • action What to do with the files that are chosen. GNU find has a default action of -print, but standard Unix find has no default action, and will abort and complain unless an action is explicitly provided.

The Starting Point

The starting-point parameter has two effects on find's actions. The most obvious is that it specifies in which directory (or directories; there can be more than one starting point) to start looking for files. The other effect is on how the chosen filenames are treated, as this example shows:

$ cd /usr/X11/man
$ find man5 -print
man5
man5/XF86Config.5x
man5/pbm.5
man5/pgm.5
man5/pnm.5
man5/ppm.5
$ find /usr/X11/man/man5 -print
/usr/X11/man/man5
/usr/X11/man/man5/XF86Config.5x
/usr/X11/man/man5/pbm.5
/usr/X11/man/man5/pgm.5
/usr/X11/man/man5/pnm.5
/usr/X11/man/man5/ppm.5

When a user is simply looking for a file, this difference in behavior does not matter very much. But when you want to use the output from find to drive another program, it can be very important, depending on the program being driven.

In addition to the starting point, we have control over some other aspects of find's behavior, such as how it should handle soft links, how to evaluate file timestamps and how deep to follow directory structures. These are controlled by options.

The -follow option tells find to follow soft (or symbolic) links to the actual file. A soft link is a file that “points” to another file. To demonstrate this option, create (as a normal user, not as root) a soft link with ln in your home directory that points to file that belongs to root.

$ cd
$ ln -s /vmlinuz ./kernel

Now use ls to produce a long listing for the file.

$ ls -l kernel
lrwxrwxrwx ... kernel -> /vmlinuz

The first column of the mode, l, tells us it is a soft link. We also are told what file the link “points” to.

Now let's demonstrate the effect of find's -follow option by searching through the directory for files belonging to root, using it. (uid 0 is root; we'll cover the -uid option in more detail later.)

$ find . -uid 0 -print
nothing is printed
$ find . -follow -uid 0 -print
./kernel

You created the link to the kernel, so you own the link, called ./kernel. But the file /vmlinuz is owned by root.

The -daystart option modifies the behavior of find when it comes to evaluating time. When -daystart is specified, find will measure days from the beginning of the day instead of from 24 hours ago. (We will cover the parameters related to time later.)

Frequently a user will need to find a file that he or she knows is somewhere on local hard disk, and not on a mounted cdrom or network volume. An easy way to keep find from straying to these other disks is with the -xdev option.

$ find / -name document -print

will cause find to search for the file “document” in every directory under /, which can be very slow with a CDROM or network filesystem mounted.

$ find / -xdev -name document -print

will instead cause find to limit its search to the device that / is mounted on. (An alias for -xdev is -mount) Of course, if you have more than one local filesystem, you will need to execute a different search for it. Perhaps

$ find / /usr -xdev -name document -print

if you have two partitions, one for / and one for /usr. Alternately, you can say

$ find / -fstype ext2 -name document -print

if all your local partitions are ext2 filesystems.

Another way to save time on searches is to use the options related to directory depth.

$ find /usr -maxdepth 4 -name document -print

will limit find's search for document to directories four level deep or less “under” /usr.

Another option related to directory depth is -depth, which causes the directories to be selected before any files in them. We'll see later why this is useful.

The -noleaf option is used for searching filesystems that aren't Unix-like. Table 1 tells for which filesystems specifying -noleaf may speed up your search.

We already had an example of finding a file by name. Other mechanisms for matching filenames are -path, which matches by directory name, -iname, which is similar to -name but case insensitive, and -ipath, which is also case insensitive.

Pick and Choose

Criteria allow you to select files.

Each file has access, status, and modification times, and find provides three time-based criteria, one for each of these values. They can be checked in increments of days or minutes, and files can be compared based on these times.

The modification time is set every time the file's contents are changed.

$ find . -mtime +10 -print

will print out files that have not been modified in the past ten days, similar to our second example.

In the previous example we used the plus sign to signify “more than.” In addition to this, find also supports the minus sign to indicate less than.

$ find / -mtime -5 -print

will print out files that were accessed less than 5 days ago. The absence of these operators will cause find to choose exact matches. As mentioned before, the -daystart option will modify the search so that the dates are based on the most recent midnight instead of 24 hours before now.

To use minutes instead of days, use the -mmin criterion.

$ find . -mmin +10 -print

will output files that have been modified more than ten minutes ago.

The -newer criterion

$ find . -newer document -print

will output files that have been modified more recently than document.

The command sets both the access and modification times on files. If the file does not exist, it will be created. We can use it for an example.

$ touch foo

will create a file named “foo” in the current directory, if there isn't already one there. Now,

$ find -mmin 1 -print

should output foo, but

$ find -mmin 2 -print

should not.

For access time, which indicates the last time the files were opened, find has similar options. For days there is -atime, for minutes -amin and for comparisons -anewer.

Status time initially indicates creation time, and then follows any modifications to the file or its inode. It can be used with -ctime, -cmin, and -cnewer. These criteria match files based on the last time a file's ownership, access mode, or other characteristics have been changed.

Find also has a -used option. It will match files that have been accessed since their status was last changed:

find -used +2

will find files that have been used more than two days since their status was last changed.

I've mentioned file modes a few time throughout this article. File modes express which users may perform certain operations on a file, what type of file it is and also some other information about the file. find allows us to match files based on their mode.

Before I go over these options, I will explain file modes and how they are displayed and set.

Users most commonly come in contact with file modes when they concern file ownership and access. A file belongs to an owner and a group, therefore it follows that access is controlled with respect to three entities: owner, group and world. (“World” is made up of users that are not the owner and do not belong to the affiliated group.)

Access is controlled with respect to three actions: Reading, writing (which includes deletion) and execution. Let's look at the output of a long listing with ls.

$ ls -l foo
-rw-rw-r-- 1 eric staff  0 Sep  6 22:55 foo

(I've deleted some of the spaces ls normally creates in order to fit the entire output.) The leftmost column of the output has ten characters that show use foo's mode and file type. From the left, the first is used by ls to show us the type of file. For example, if it were a link or directory we would see an l or d there.

The remaining nine characters show us the access mode. In groups of three, the show us the rights for owner, group, and world, in that order. Each triplet has a field for read r, write w and execute x.

$ chmod 777 foo
$ ls -l foo
-rwxrwxrwx 1 eric staff  0 Sep  6 22:55 foo

We have turned on all permissions for all users on the file “foo”.

The chmod command can use two different kinds of notation, symbolic and octal. While symbolic notation is easier to remember for most people, I used octal notation, because it is the format for modes that find expects. With this notation each number represents the octal permissions for each user class.

The permissions are calculated by adding the following:

  • 4 Read

  • 2 Write

  • 1 Execute

So if you want to give the owner of a file full permissions and group and world only read and execute permissions, you want to “set” all bits for owner, and the read and execute bits for the others:

Owner = 4 + 2 + 1 = 7
Group = 4 + 1     = 5
World = 4 + 1     = 5

So the command would be:

$ chmod 755 program
$ ls -l program
-rwxr-xr-x 1 eric  staff 106410 Sep  6 22:55 program

The listing shows the mode we expected.

Back to find: the -perm criterion accepts this type of notation.

$ find . -perm 777 -print

would find all of the files in and under the current directory that have read, write and execute permissions set for all users.

The -perm option also supports the + and - operators.

$ find . -perm +600 -print

would output any files that are readable or writable by their owner.

$ find . -perm -600 -print

would output any files that are readable and writable by their owner.

Therefore the + acts as a boolean “or” and the - acts as a boolean “and”.

The ability to find files based on their permissions is an important security tool. Later, I will cover some important special file modes, and how find can help protect a system from attacks that use them.

File size is another option offered by find. File sizes may be specified in 512 byte blocks, two byte words, kilobytes or just bytes. Since size is a numeric option + and - are also supported.

$ find . -size +4096k -print

will print the names of any files larger than four megabytes.

$ find . -size -1c -print

will print the names of any files smaller than one byte. The -empty option also matches empty files.

For 512 byte blocks the number should be followed by a “b”, for 2 byte words a “w”.

There is one caveat when searching for files by size. Some files, such as /var/adm/lastlog, have more space allocated than they actually use. These files are known as “sparse” or “holey” files. Like ls, find will report these files by the space they have allocated, not the space they are actually using. If you have any doubt about how much space a file is using, use the du command.

$ ls -l /var/adm/lastlog

reports a size of 16032 (15k) on my system;

$ du -k /var/adm/lastlog

reports only 3k.

Our first example showed us how to find a file when we know the exact name. Find will also accept the * wildcard, but the file name must then be quoted in order to prevent the shell from expanding the file name before passing it to find.

$ find / -name "*gif" -print

will output all of the files ending in “gif” on the entire system.

In addition to simple wildcards, find also supports regular expressions with the -regex option.

$ find . -regex './[0-9].*' -print

will locate any files in the current directory that begin with a number. Note that the regular expression is applied to the entire path, which makes the expression rather difficult to write. For more information about regular expressions see the man pages for grep or the article in the October issue of Linux Journal.

Another search criterion is file type.

$ find / -type d -print

will list all of the directories. Here is a list of the file types and the appropriate letter to use to search for them.

  • b block special files such as a disk device.

  • c character special files such as a terminal device.

  • d directory

  • p named pipe

  • f regular file

  • l symbolic (soft) link

  • s socket

If you are unfamiliar with any of these file types, don't worry. You can learn as you go.

Files can also be matched by user of group id. As demonstrated earlier,

$ find . -uid 0 -print

will output all files belonging to root.

$ find . -uid 120 -print

will output all files belonging to the user with UID 120.

To make things easier,

$ find -user eric -print

will output all files belonging to eric.

Find also has similar options for groups: -gid and -group.

More than printing!

Now that you know how to locate just about any file, what can you do with them besides print their names?

$ find . -fprint foo

sends a list of the files in the current directory to a file “foo”. If the file does not exist it is created. If it does, its contents are replace.

Find also offers the -printf action. This allows output to be formatted.

$ find . -printf 'Name: %f Owner: %u %s bytes\n'

produces a table of files with their name, owner, and size in bytes.

The -printf action has many predefined fields that cover all of the information available for a file. See Table 2 for an incomplete list of options. Find also has a -fprintf switch which will send the output to a file, like -fprint.

Table 2. printf Options

Escape Sequences\a - Alarm Bell\b - Backspace\f - Form Feed\n - Newline (not provided automatically)\c - Carriage return- Horizontal tab\v - Vertical tab\\ - A literal backslash\c - Stop printing and flush output

Formatting Sequences%b - File size in 512 byte blocks%k - File size in 1k blocks%s - File size in bytes%a - Access time in standard format%A - Formatted access time (see man page for options)%c - Status time in standard format%C - Formatted status time (same a %A)%F - Type of filesystem%p - File name%f - File name with path removed%P - File name with find argument removed (file instead of ./file)%u - User name%g - Group Name

(See the man page for complete listing)

A third option for output is -ls. This option produces a listing of files that is the equivalent of the output from ls -idls. The -fls option will send this to a file.

Of course, simply producing formatted lists of files is not the limit to find's usefulness. Find also allows us to execute commands on them with -exec and -ok. -exec executes a command for each file that matches.

Our earlier example demonstrates a common use for the -exec option: deleting old and unused files.

$ find /tmp/* -atime +10 -exec rm -f {} \;

After the -exec switch itself, we specify the command, any options (such as the -f), and then {}, which represents the matched files. The command line must then be terminated with ; (the \ is to prevent shell expansion).

$ find . -type f -exec grep -l linux {} \;

would execute the command grep -l linux on all regular files in and under the current directory.

The -ok switch operates the same way, but will prompt the user for confirmation before executing the command on each file.

$ find . -ok tar rvf backup {} \;

This command will descend through the current directory and below, asking the user which files should be added to the tar archive “backup”.

This leads us into some practical uses for find.

Sometimes it's necessary to duplicate a directory or directory structure. For this purpose many users utilize the cp command with the -r option. However, this command does not always create an exact copy!

Create a directory with a file and a link in it.

$ mkdir test
$ touch test/bar
$ ln -s /vmlinuz /test/foo
$ ls -l test
-rwx--x--x eric staff 0 Sep  9 bar
lrwxrwxrwx eric staff 8 Sep  9 foo -> /vmlinuz

Now copy it with cp -r

$ cp -r test test1
$ ls -l test1
-rwx--x--x eric staff      0 Sep  9 11:18 bar
-rw-rw-r-- eric staff 318436 Sep  9 11:18 foo

The cp command followed the soft link and copied the kernel into the new directory!

Let's try a different approach:

$ rm -r test1
$ cd test
$ find -depth -print | cpio -pdmv ../test1
$ ls -l ../test1
-rwx--x--x eric staff 0 Sep  9 bar
lrwxrwxrwx eric staff 8 Sep  9 foo -> /vmlinuz

This method uses cpio to copy files to the new directory. Find produces the file list by descending the directory structure. Even though our example was only one directory deep, we know that find can descend an entire directory structure. We also know that we can also control which directories it descends and which files it outputs.

In the above command I added the -depth option. It insures that directory names are output before the files in them. This allows cpio to create the directories before trying to copy files into them.

The cpio command is another multipurpose tool in the Unix toolbox. It can create archives in a variety of formats and also extract from them. It also handles the output of find's -print option perfectly. Combined, these tools could form a simple backup system. (Please note: I am presenting this purely as an example. Systems that support many users or that have irreplaceable data on them should use more extensive and robust backup systems.)

$ find . -depth -print \
  | cpio -ov --format=crc > /dev/fd0

find reads the contents of the current directory, and the filenames are piped to cpio, which copies the files to the floppy in the System V R4 archive format with CRC checksums. (This format is preferred to the default since it is platform independent, supports larger hard disks, and provides at least simple error checking.)

When cpio reaches the end of each floppy it prompts us with:

Found end of tape.  To continue, type device/file
name when ready.

In order to continue, type:

/dev/fd0 RETURN

Of course, if you are lucky enough to have a tape drive or other storage system, you may not have to do this, though cpio can also span tapes if the archive does not fit on one.

This system does have at least one drawback: if the data to be stored will not fit on one unit, the backup cannot be fully automated.

The first backup of my home directory spanned ten floppies. I reviewed the contents and noticed two subdirectories that probably were not worth backing up, so I altered find's arguments:

$ find . \
  \( -path ./.netscape-cache -o -path ./lg \)\
  -prune -o -print | \
  cpio -ov --format=crc > /dev/fd0

This introduces some more find options. The \( and the \) are parentheses with \ to prevent shell expansion. Find allows parentheses to logically group expressions. This was necessary since I have two expressions in the command

\( -path ./.netscape-cache -o -path ./lg \)

Inside the parentheses we have two -path statements separated by -o. This is a find “or” statement.

\( -path ./.netscape-cache -o -path ./lg \) -prune

Find's -prune option causes find to not enter a directory. Therefore, we can translate the above to “If the path is ./.netscape-cache or ./lg do not descend into the directory.”

After this clause we see another -o statement. If the file does not meet the criteria for pruning, it is printed instead.

So, my entire home directory with the exception of my Netscape cache and lg directory is now backed up.

This is fine for an initial backup. But what about next week when I want to backup my directory, but I've only really touched a few files?

$ find . \
  \( -path ./.netscape-cache -o -path ./lg \) \
  -prune -o \( -mtime -7 \) -print | \
  cpio -ov --format=crc > /dev/fd0

This adds one more clause: “If the file is not under the netscape cache or the lg directory, check if it has been modified in the past 7 days. If it has, then print the name.” The name is then sent to cpio to archive.

Obviously these command lines can get very complicated. It's usually a good idea to test them by piping the output through more before using cpio.

In addition to -o find also has an “and” operator, -and, and a negation operator -not. When multiple match criteria are specified, -and is implied.

$ find -mtime -5 -type f -print

prints files that have been modified during the last five days and are regular files.

$ find -mtime -5 -not -type f -print

prints things that have been modified during the last five days that are not regular files: directories, soft links, etc.

But wait, disaster has struck! Your (sister, son, daughter, little brother, mom, spouse, whoever) has deleted a very important file! Time to use that backup.

$ cpio -t < /dev/fd0

produces a table of contents from the archive. As it does during backup operations, cpio prompts for the next disk while it reads the table of contents.

$ cpio -i core < /dev/fd0

The -i switch tells cpio to extract the named file. The absence of a file name cause cpio to restore the entire archive.

System maintenance tasks can also be simplified with find. Our second example demonstrated using find to clean out older files.

$ find /home -name core -o -name foo \
  -exec rm -f {} \; 2> /dev/null

This command cleans out any core dumps or files named “foo” from home directories. (Although some files named “foo” can be very important!)

$ find /var/adm/messages -size +32k \
  -exec Mail -s "{}" root < /var/adm/messages \;
  -exec cp /dev/null {} \;

This is another example from the crontab on my Caldera/Red Hat system. It uses the implicit “and” function to mail the system messages file to root and then empty it.

Find also has an important security application. Two of the file modes that I did not cover earlier are SUID and SGID. These modes provide a user with the rights of the owner or group of a program when the program is executed.

An example of this is the passwd program. This program allows users to change their password. In order to do this the /etc/passwd (or /etc/shadow) file must be modified, which is a function only root should be able to perform. Since the passwd program belongs to root and has the SUID mode set, it can modify the necessary file. When passwd completes the user's rights return to normal. The passwd program is responsible for making sure the user can't do anything wrong while acting as root.

$ ls -l /usr/bin/npasswd
-r-s--x--x 1 root /usr/bin/npasswd

(/usr/bin/passwd is linked to /usr/bin/npasswd on my system.) The s in the execute field for owner signifies SUID. A SGID program would have s in the execute field for group.

This mechanism has obvious security implications. A user (or invader) who has compromised a system could install a program (such as a shell) with this mode set and then do whatever they wish whenever they want by running that program.

In octal notation SUID is expressed as 4000 and SGID is 2000, so

$ find / -perm 4000 -print

produces a list of SUID files on a system.

$ find / -type f \( -perm 2000 -o -perm 4000 \) \
  -print

produces a list of regular files that have SGID or SUID mode set.

This list could be saved to a file (with -fprint) and compared each day with the output from the previous day.

This article does not cover every option for find. This was also only a cursory explanation of filesystems and access modes. Hopefully, I was able to provide you with enough information to make using Linux a little easier and a lot more rewarding.

Resources

Essential System Administration by Æleen Frisch, O'Reilly and Associates

Practical Unix Security by Simson Garfinkel and Gene Spafford, O'Reilly and Associates

The manual pages.

Eric Goebelbecker is a systems analyst for Reuters America, Inc. He supports clients (mostly financial institutions) who use market data retrieval and manipulation APIs in trading rooms and back office operations. In his spare time (about 15 minutes a week...), he reads about philosophy and hacks around with Linux. He can be reached via e-mail at eric@cnct.com.

Load Disqus comments