tom alison dot com


Command-line Document Conversion

If you're simply looking for a way to convert a one-off document, upload it to Google Docs or Scribd and let them take care of the conversion for you. Be mindful of your privacy settings so you don't accidentally share your document with the whole world.

For command-line conversion of documents, here's how to get going on Ubuntu:

Install OpenOffice (headless)

OpenOffice provides the core conversion facilities. They do a pretty respectable job at conversion of most document types, including spreadsheets and presentations.

You need to actually run an instance of OpenOffice in order to send it the request to convert the document. Installing the headless version means you can run in on a server without a windowing system, which you need to be able to do if you want to run a massive document conversion farm on EC2.

apt-get install openoffice.org-headless
apt-get install openoffice.org-java-common
apt-get install openoffice.org-writer

Simple conversion with CUPS-PDF

One simple way to do this is to use CUPS-PDF.

apt-get install cups-pdf

Use OpenOffice to print the doc to a PDF file using CUPS-PDF. Note that the output path can be found in /etc/cups/cups-pdf.conf (I'm using cups-pdf v 2.5.0)

Here's a script to do the conversion and open the output PDF in evince.

to_pdf1.sh

#!/bin/bash
#
# Prints a file to PDF using OpenOffice and
# CUPS-PDF. Opens output file with evince doc
# viewer.
##

CUPS_HOME="$HOME/PDF"

file=`basename $1`
prefix=${file%.[^.]*}
outfile="${CUPS_HOME}/${prefix}.pdf"

echo Printing ${prefix}.pdf

soffice -norestore \
    -nofirststartwizard \
    -nologo \
    -headless \
    -pt PDF $1

echo Sleeping 10 seconds

sleep 10

echo Opening $outfile

evince "$outfile"

Note that in the -pt PDF option, "PDF" is the printer device name of the CUPS-PDF printer device. It may differ on your platform. Check /etc/cups/printers.

Better conversion with unoconv

The printing method works okay, but it doesn't detect the orientation of your document, so you may notice presentations are rendering in portrait mode instead of landscape mode in the output PDF.

unoconv is a Python utility that talks to an OpenOffice process via an Uno bridge.

apt-get install unoconv

Because unoconv requires an OpenOffice instance, it's best to have a process running before doing a lot of document conversion. unoconv can start one for you:

unoconv --listener > /dev/null 2>&1 &

Verify the process started:

$ ps -ef | grep soffice

Look for something like this (line breaks inserted for legibility):

soffice.bin -nologo -nodefault 
-accept=socket,host=localhost,port=2002;urp;StarOffice.ComponentContext

Now you can bust out a script like this:

to_pdf2.sh

#!/bin/bash
#
# Converts a document to a PDF file using
# unoconv and opens it with the evince viewer.
##

file=`basename $1`
dir=`dirname $1`
prefix=${file%.[^.]*}
outfile="${prefix}.pdf"

echo Generating $outfile

unoconv -f pdf $1

echo Opening $outfile

evince "${dir}/$outfile"

This option is much nicer.

Java folks should check out the JODConverter project, which provides similar functionality.

Taking it further

Document thumbnails

Want to grab a thumbnail of the first page of your converted PDF? First install Imagemagick:

apt-cache install imagemagick

Then run:

convert example.pdf[0] -thumbnail 120x120! -gravity center thumbnail.jpg

This will generate a fixed 120x120 thumbnail in JPG format of the first page of your PDF. Note the array-like syntax: example.pdf[0]. This constrains the thumbnail generation to the first page. If omitted, a thumbnail of every page is generated.

Indexing document content

Want to add document search? Run:

unoconv -f txt sample.doc

This generates a text version of the document that you can index with Lucene or the indexer of your choice.

Read documents in the browser

You can generate an HTML version of your document to display on the web:

unoconv -f html sample.doc

For presentations and some other document formats, separate HTML files are generated for each page. A little messy...

You could get fancy like Scribd and render the document using Flash.

apt-get install swftools

Check out pdf2swf. Put in a little work to build navigation controls for your document and you've got yourself a pretty nice viewer.

Further Reading


blog comments powered by Disqus