Reading File Metadata with extract and libextractor
Modern file formats have provisions to annotate the contents of the file with descriptive information. This development is driven by the need to find a better way to organize data than merely by using filenames. The problem with such metadata is it is not stored in a standardized manner across different file formats. This makes it difficult for format-agnostic tools, such as file managers or file-sharing applications, to make use of the information. It also results in a plethora of format-specific tools used to extract the metadata, such as AVInfo, id3edit, jpeginfo and Vocoditor.
In this article, the libextractor library and the extract tool are introduced. The goal of the libextractor Project is to provide a uniform interface for obtaining metadata from different file formats. libextractor currently is used by evidence, the file manager for the forthcoming version of Enlightenment, as well as for GNUnet, an anonymous, censorship-resistant peer-to-peer file-sharing system. The extract tool is a command-line interface to the library. libextractor is licensed under the GNU General Public License.
libextractor shares some similarities with the popular file tool, which uses the first bytes in a file to guess the MIME type. libextractor differs from file in that it tries to obtain much more information than the MIME type. Depending on the file format, libextractor can obtain additional information, including the name of the software used to create the file, the author, descriptions, album titles, image dimensions or the duration of a movie.
libextractor achieves this information by using specific parser code for many popular formats. The list currently includes MP3, Ogg, Real Media, MPEG, RIFF (avi), GIF, JPEG, PNG, TIFF, HTML, PDF, PostScript, Zip, OpenOffice.org, StarOffice, Microsoft Office, tar, DVI, man, Deb, elf, RPM, asf, as well as generic methods such as MIME-type detection. Many other formats exist, and among the more popular formats only a few proprietary formats are not supported.
Integrating support for new formats is easy, because libextractor uses plugins to gather data. libextractor plugins are shared libraries that typically provide code to parse one particular format. At the end of this article, we demonstrate how to integrate support for new formats into the library. libextractor gathers the metadata obtained from various plugins and provides clients with a list of pairs, consisting of a classification and a character sequence. The classification is used to organize the metadata into categories such as title, creator, subject and description.
The simplest way to install libextractor is to use one of the binary packages available for many distributions. Under Debian, the extract tool is in a separate package, extract. Headers required to compile other applications against libextractor are contained in libextractor0-devel. If you want to compile libextractor from source, you need an unusual amount of memory: 256MB of system memory is roughly the minimum, as GCC uses about 200MB to compile one of the plugins. Otherwise, compiling by hand follows the usual sequence of steps, as shown in Listing 1.
Listing 1. Compiling libextractor requires about 200MB of memory.
$ wget http://ovmj.org/libextractor/ ↪download/libextractor-0.4.1.tar.gz $ tar xvfz libextractor-0.4.1.tar.gz $ cd libextractor-0.4.1 $ ./configure --prefix=/usr/local $ make # make install
After installing libextractor, the extract tool can be used to obtain metadata from documents. By default, the extract tool uses a canonical set of plugins, which consists of all file-format-specific plugins supported by the current version of libextractor, together with the mime-type detection plugin. Example output for the Linux Journal Web site is shown in Listing 2.
Listing 2. Extracting metadata from HTML.
$ wget -q http://www.linuxjournal.com/ $ extract index.html description - The Monthly Magazine of the Linux Community keywords - linux, linux journal, magazine
If you are a user of BibTeX, the option -b is likely to come in handy to create BibTeX entries automatically from documents that have been equipped properly with metadata, as shown in Listing 3.
Listing 3. Creating BibTeX entries can be trivial if the documents come with plenty of metadata.
$ wget -q http://www.copyright.gov/legislation/dmca.pdf $ extract -b ~/dmca.pdf % BiBTeX file @misc{ unite2001the_d, title = "The Digital Millennium Copyright Act of 1998", author = "United States Copyright Office - jmf", note = "digital millennium copyright act circumvention technological protection management information online service provider liability limitation computer maintenance competition repair ephemeral recording webcasting distance education study vessel hull", year = "2001", month = "10", key = "Copyright Office Summary of the DMCA", pages = "18" }
Another interesting option is -B LANG. This option loads one of the language-specific but format-agnostic plugins. These plugins attempt to find plain text in a document by matching strings in the document against a dictionary. If the need for 200MB of memory to compile libextractor seems mysterious, the answer lies in these plugins. In order to perform a fast dictionary search, a bloomfilter is created that allows fast probabilistic matching; GCC finds the resulting data structure a bit hard to swallow.
The option -B is useful for formats that currently are undocumented or unsupported. The printable plugins typically print the entire text of the document in order. Listing 4 shows the output of extract run on a Microsoft Word document.
Listing 4. libextractor can sometimes obtain useful information even if the format is unknown.
$ wget -q http://www.bayern.de/HDBG/polges.doc $ extract -B de polges.doc | head -n 4 unknown - FEE Politische Geschichte Bayerns Herausgegeben vom Haus der Geschichte als Heft der zur Geschichte und Kultur Redaktion Manfred Bearbeitung Otto Copyright Haus der Geschichte München Gestaltung fürs Internet Rudolf Inhalt im. unknown - und das Deutsche Reich. unknown - und seine. unknown - Henker im Zeitalter von Reformation und Gegenreformation.
This is a rather precise description of the text for a German speaker. The supported languages at the moment are Danish (da), German (de), English (en), Spanish (es), Italian (it) and Norwegian (no). Supporting other languages merely is a question of adding free dictionaries in an appropriate character set. Further options are described in the extract man page; see man 1 extract.
Listing 5 shows the code of a minimalistic program that uses libextractor. Compiling minimal.c requires passing the option -lextractor to GCC. The EXTRACTOR_KeywordList is a simple linked list containing a keyword and a keyword type. For details and additional functions for loading plugins and manipulating the keyword list, see the libextractor man page, man 3 libextractor. Java programmers should know that a Java class that uses JNI to communicate with libextractor also is available.
Listing 5. minimal.c shows the most important libextractor functions in concert.
#include <extractor.h> int main(int argc, char * argv[]) { EXTRACTOR_ExtractorList * plugins; EXTRACTOR_KeywordList * md_list; plugins = EXTRACTOR_loadDefaultLibraries(); md_list = EXTRACTOR_getKeywords(plugins, argv[1]); EXTRACTOR_printKeywords(stdout, md_list); EXTRACTOR_freeKeywords(md_list); EXTRACTOR_removeAll(plugins); /* unload plugins */ }
The most complicated thing about writing a new plugin for libextractor is writing the actual parser for a specific format. Nevertheless, the basic pattern is always the same. The plugin library must be called libextractor_XXX.so, where XXX denotes the file format of the plugin. The library must export a method libextractor_XXX_extract, with the following signature shown in Listing 6.
Listing 6. Signature of the function that each libextractor plugin must export.
struct EXTRACTOR_Keywords * libextractor_XXX_extract (char * filename, char * data, size_t size, struct EXTRACTOR_Keywords * prev);
The argument filename specifies the name of the file being processed. data is a pointer to the typically mmapped contents of the file, and size is the file size. Most plugins do not make use of the filename and simply parse data directly, starting by verifying that the header of the data matches the specific format.
prev is the list of keywords extracted so far by other plugins for the file. The function is expected to return an updated list of keywords. If the format does not match the expectations of the plugin, prev is returned. Most plugins use a function such as addKeyword (Listing 7) to extend the list.
Listing 7. The plugins return the metadata using a simple linked list.
static void addKeyword (struct EXTRACTOR_Keywords ** list, char * keyword, EXTRACTOR_KeywordType type) { EXTRACTOR_KeywordList * next; next = malloc(sizeof(EXTRACTOR_KeywordList)); next->next = *list; next->keyword = keyword; next->keywordType = type; *list = next; }
A typical use of addKeyword is to add the MIME type once the file format has been established. For example, the JPEG-extractor (Listing 8) checks the first bytes of the JPEG header and then either aborts or claims the file to be a JPEG. The strdup in the code is important, because the string will be deallocated later, typically in EXTRACTOR_freeKeywords(). A list of supported keyword classifications, in the example EXTRACTOR_MIMETYPE can be found in the extractor.h header file.
Listing 8. jpegextractor.c adds the MIME type to the list after parsing the file header.
if ( (data[0] != 0xFF) || (data[1] != 0xD8) ) return prev; /* not a JPEG */ addKeyword(&prev, strdup("image/jpeg"), EXTRACTOR_MIMETYPE); /* ... more parsing code here ... */ return prev;
libextractor is a simple extensible C library for obtaining metadata from documents. Its plugin architecture and broad support for formats set it apart from format-specific tools. The design is limited by the fact that libextractor cannot be used to update metadata, which more specialized tools typically support.
Resources for this article: /article/8207.
Christian Grothoff graduated from the University of Wuppertal in 2000 with a degree in mathematics. He currently is a PhD student in computer science at Purdue University, studying static program analysis and secure peer-to-peer networking. A Linux user since 1995, he has contributed to various free software projects and now is the maintainer of GNUnet and a member of the core team for libextractor. His home page can be found at grothoff.org/christian.