OpenOffice.org in the Limelight

by Cezary M. Kruk

OpenOffice.org is a great set of software, consisting of several useful components that offer a lot of options. It is customizable and introduces many open formats for documents. In order to adapt the basic configurations to your particular needs, OpenOffice.org allows you to prepare macros and additional scripts.

I work as an editor at a Polish free software magazine. At the beginning of the editorial process, the author supplies the text and the editor edits it. Editing means removing common content-related and formal mistakes or errors, as well as preparing the text in a standard form to make it easier to process at further stages. The proofreader then corrects the text and the editor looks through it again and makes the final changes. Finally, the typesetter prepares the text for printing, and the editor checks the entire work one last time.

The processed text is in a different format at each stage of this process. Our publishing house prefers open formats for documents, so our authors deliver the documents in text or HTML formats and the graphics in PNG or EPS formats. After editing the document, the editor sends a copy to the author—that copy is in HTML. Our proofreaders work on Microsoft Windows systems and use Microsoft Word, so they need the documents to be in the .doc file format. Our typesetters work on Macintosh systems and use QuarkXPress. They need two kind of documents: Microsoft Word files for printing and checking the required formats for the article and Macintosh text files for opening the files in Quark and processing them.

When our quarterly started in autumn 2000, I was using StarOffice. Since then, I switched to OpenOffice.org. The methods to work with authors' text files are similar for StarOffice and OpenOffice.org. I import the document in text or HTML format using StarWriter (previously) or OpenOffice.org Writer (at present), and—after processing it—I export it to HTML, Microsoft Word or the corresponding SDW or SXW file formats.

OpenOffice.org in the Limelight

Figure 1. The KillparZ macro facilitates preprocessing of the imported text files.

Importing Text and HTML Files

If a source file is prepared well, there should be no problems when importing it. If a file is damaged, it must be repaired. This is not difficult to do if you take into account the open formats of the documents.

Once a file is imported, you need to change it to the proper format. The editors of Polish, German, French or other non-English language publications should change the codepage as well. A standard codepage for Polish documents, for example, is ISO-8859-2, and the standard codepage for all OpenOffice.org documents is UTF-8. To convert imported documents in a convenient way, you need a macro. The macros I've built for OpenOffice.org consist of several codepage converters, including converters from ISO-8859-2 to UTF-8 and vice versa.

Paragraphs in text files written in some text editors may be broken into a number of lines. To consolidate them, you need to use the KillparZ macro, which is an improved version of the killpars macro by Andrew Brown (Figure 1). KillparZ is a component of the ooo-macro bundle.

Assuming the author of the document declared the appropriate charset, there shouldn't be a problem with the codepage when you import an HTML file. But another problem may arise—the shortcuts associated with your macros stop working in HTML documents. To make macros work, you need to create an empty OpenOffice.org Writer document, open the HTML file, copy it, close the HTML file and, finally, paste the content into the Writer document.

Codepages and DOCs

Our magazine is published in Polish, so I need to use more sophisticated methods when exporting files. Specifically, I need to use fonts with Polish diacritics. My tests of StarWriter and OpenOffice.org Writer have shown that if you want to avoid problems related to codepages in non-English language documents, you should use TrueType fonts instead of Type1 fonts. Moreover, you obtain the best effects of exporting documents to the Microsoft Word format if you use the same fonts as are used in Microsoft Windows. The Microsoft fonts, bundled in Microsoft FontPack, including Times New Roman, Arial and Courier New, are sufficient in most cases.

The authors of StarOffice and OpenOffice.org had to use some reverse engineering to discover how the Microsoft Word format is constructed. As a result, the export filter from Writer to Word works well but not perfectly. Therefore, if you want to exchange standard document types with other users, prepare one typical document using all the necessary formatting, including headers, italics and boldface. Then make the sample available to coworkers and ask them if everything works well.

The articles we publish are a simple kind of document. Our editorial office uses the three above-mentioned fonts, as well as italic and bold, two levels of headers and straight tables. We do not include the graphics in our documents; we simply list the names of the files in PNG or EPS format. Such documents can be exported from SDW or SXW formats to Microsoft Word without any problems.

OpenOffice.org in the Limelight

Figure 2. An HTML file as exported by OpenOffice.org—it uses styles, classes and a lot of other unwanted formatting.

OpenOffice.org in the Limelight

Figure 3. The same HTML file converted using the soffice2html filter—more standardized and more readable.

OpenOffice.org in the Limelight

Figure 4. CHIP Special editorial staff, from left to right: Robert Bielecki (editor), Romek Gnitecki (editor in chief), Cezary M. Kruk (CHIP Special Linux) and Tomek Borukalo (editor).

HTML Format

Obtaining proper documents in HTML format is slightly more difficult. StarWriter and OpenOffice.org Writer produce sophisticated HTML, as shown in Figure 2. You can convert this HTML, however, by using a simple Perl script. I call mine soffice2html. At the beginning of the script, you should instruct it to replace line endings by spaces, like this:


s/\n/ /;

Next, you can replace some elements of the code with different ones. For example, using the commands:


s/<(\/?)B>/<$1STRONG>/g;
s/<(\/?)I>/<$1EM>/g;

you can replace all <B> ... </B> and <I> ... </I> tag pairs with <STRONG> ... </STRONG> and <EM> ... </EM> tag pairs, so bold and italic is noted according to established standards. You then can remove unwanted tags, such as:


s/<EM><EM>/<EM>/g;
s/<\/EM><\/EM>/<\/EM>/g;

After this, it is good idea to restore some line endings. Simple commands such as:


s/(.+?)</$1\n</g;
s/>(.+?)/>\n$1/g;

put the marks of the line end before and after each HTML tag. To make your script more professional, you can add the finishing touch by using the command:


print OUT "<!-- ", "soffice2html: ",
          scalar localtime, " -->\n";

This adds a comment to the processed HTML file, which is something like:


<!-- soffice2html: Wed Jul 23 17:34:35 2003 -->

Now, if you start with document.sxw and export it to document.html, you should process the latter one using the command soffice2html document.html (Figure 3). Filtering HTML files in this way produces better—that is, more standardized and more readable—code and from 15%–40% smaller files. The current version of the ooo-macro bundle includes the soffice2html script.

To produce a simple Macintosh text file from a document, you should save it in the Text Encoded file type that uses the appropriate character set. For Polish documents, for example, the valid set is Eastern Europe.

This method of exporting is good enough for common tasks, but it's not so good for typographic purposes. Our articles often need to use symbols for keystrokes when discussing specific tasks and other special characters. When you use the standard method to produce Macintosh text files, you lose all those characters. To keep them, you need a macro to convert the characters from UTF-8 to the Macintosh codepage. The appropriate macro, recode_utf_8_to_apple_macintosh, is a part of the ooo-macro bundle.

In order to produce a text file using the above-mentioned macro, run it and then save the document as a Text Encoded file type by using System character set and CR paragraph breaks. The file includes information that makes the typesetter's job faster and easier.

In the Limelight

Using OpenOffice.org Writer as an editorial tool allows you to process documents and share them among authors, proofreaders and typesetters in a way that is transparent for everyone involved. You need only Writer, some TrueType fonts, a small bundle of macros and the Perl script for preparing nice HTML files.

Resources for this article: www.linuxjournal.com/article/7925.

Cezary M. Kruk lives in Wroclaw, Poland. He is an editor for the Polish quarterly, CHIP Special Linux.

Load Disqus comments