Monday, April 20, 2009

Converting scanned images into PDF

I have a huge collection of printed sheet music and it has become more and more difficult and time consuming to retrieve them. Therefore, I started scan a lot of my printed sheet music into PDF so I can easily search them. (There are many other benefits such as: being able to transmit them digitally -- for so many times I retrieve my sheet music from home remotely; or just to carry one touchscreen notebook with all necessary sheet music for gigs).

One of the problems I encounter is when I scan a page from a thick book, it's very difficult to align the book to the scanner horizontally/vertically, i.e. the scanned images will turn out slanted. I use GIMP to correct it, but then after rotation, there will be new empty, transparent, area introduced on all edges. If I convert such images into a format that doesn't support transparency, such as jpg, the transparent part will become black region. The other problem is that after rotation, the page width/height ratio will not conform to standard paper size.

Why do I have to use jpg? The reason is when ImageMagick converts pictures into a PDF, the picture is only embedded, thus size preserved. For scanned images, jpg takes much less disk space than png, in the practical resolution range.

Here's what I had to do (convert is part of ImageMagick package):

convert -fill white -opaque "rgba(100.0%, 0.0%, 0.0%, 0)" ../*.png ./page-.jpg
convert page-*.jpg -bordercolor white -border 50x50 -trim combined.pdf

The first command converts all PNG images into JPG ones while converting transparent regions into white ones. The second command adds white border areas to four edges and then trim them back, so the border area can be minimal to reserve as much area for the original image as possible; then we convert the whole thing into a PDF.