Digitising Print Media – Briss and OCR

In almost any field one of the primary research tasks is the reading and absorption of information. Now, in some fields you may be blessed by having all your information accessible digitally, and preferably natively digitised. However, in every discipline I have worked in many of the papers and materials have only been available in physical format, be they books, journals, or otherwise. Some friends, and indeed some commentators on this site, prefer to use the majority of their works in a physical format. If that is you, then this post may not pique your interest. But, if you are like me and a significant proportion of my peers, then you will likely prefer digital media for your research, if only because it makes things easier to find in the long run (and you get to feel somewhat more environmentally friendly). So how do you get that information into your workflow and have it in a usable format. Welcome to the Monday toolkit post on Digitising Printed Media.

y-u-so-hard-to-readNow, if working with digital media is your thing, I am almost certain that you would have found whether your office, library or other choice of research institution has the ability to scan documents. Most photocopiers come with this functionality these days, and the majority I have used in the past four years have supported scanning straight to PDF on a USB stick. All well and good you might say, there is my digital media right there, job done. Well not quite. You see while natively digital media generally comes formatted for the screen, in either portrait or landscape, printed media comes in a wide variety of formats. Academic works in Psychology and Theology come in a range from squiffy-near-A5, through to square-wannabe-A3 books (Hermeneia commentaries I’m looking at you here…). However, your digital device likely only comes in one of two formats, both approximating a rectangle in either 16:10 (Android, and most PC monitors), or 4:3 (iPads and older monitors). How do you get your freshly scanned media to display nicely on your screen, and in such a way that you don’t spend extra time scrolling or zooming, and you don’t go blind from eyestrain. Imagine reading this on the screen:

Briss

This is where one little free app comes to the rescue: Briss (http://briss.sourceforge.net). While many apps have a plethora of options and capabilities, essentially sporting the modern Swiss Army knife, it is quite refreshing that Briss exists for only one purpose: cutting up PDFs. Although it is possible to do this job natively on the photocopier, manually programming it to output only that portion of the page that you are interested in, it is much more difficult than Briss. Briss is a Java app, and is therefore happily cross platform, running on both Windows and MacOS if you have the Java runtimes. Simply speaking Briss has three steps to its workflow:

  1. Open File
  2. Select Page Zones
  3. Output File

As in the screenshots above when you first open a file in Briss it will overlay all your PDF pages together so you get a feel of where the text is on the page. Selecting the new pages is a simple drag and select operation, with it displaying a translucent blue rectangle with the Odd/Even page number on it. Once you are happy with the locations of the pages you can simply output it to a new PDF. Briss on its own is an amazing timesaver, and makes for nice and easily readable PDFs, no matter whether you read them on tablet or monitor. In addition if you want to format scans for later printing it means that you can print cleaner files for better markup. In fact I know of several people who scan to PDF, Briss and then print the resultant file through their own printer as they prefer working on paper. This way they also have a digital copy incase they lose or clean out the hardcopy.

squareeyedWhen scanning the documents in I recommend using full platen scans, or at least one full size larger than the document you are scanning, and the highest resolution possible. With Briss you can easily cut down the page to suit the scanned document, and the higher resolution really helps in the next step. Plus having bad quality documents to read makes you feel like this poor person.

But wait, there is more—now I feel like a cheesy tele-salesman, although I have wanted to say that for most of this blog series. A friend of mine, Rob,  a while ago wrote a couple of minor upgrades to Briss. His version allows for files to be opened via command line arguments, and for automatic page resizing. What does that mean? The first mod allows for a small script to start the Briss process, and on MacOS you can easily implement this via the Automator app, and you can copy the script below if your Briss app is in the Applications folder. The second allows you to press a single key (V) and both pages spring to the same size, meaning that the pages don’t alternate sizing on digital devices. Minor tweaks, but really valuable. His version can be downloaded here: briss-rob.jar and just place it in with the rest of the Briss folder.

OCR

Now you have nicely formatted PDFs, but they aren’t overly usable. Each PDF is simply a big image of the page, and it knows naught of the words on the page. Well there is one simple way of fixing that problem: Optical Character Recognition tools. There are absolutely tons of them out there, both free and paid, although my recommendation is relatively mainstream and unfortunately costs money. Personally I have forked out money for Adobe Acrobat and even though it costs a reasonable amount it works brilliantly. Using the fairly basic settings (300dpi/Cleartype) it provides a relatively accurate transcription of the words on the page, and as a bonus it drastically reduces the file size. It is not uncommon for Acrobat to take 10mb files down to 500kb OCRed output. The downside is of course the cost.

Whichever program you use for OCR work it is important for the rest of the workflow to get a good transcription of the document. If there are severe inaccuracies, or simply gibberish, then it won’t be as usable in the later stages of the workflow.

That is about it for this step of the workflow. The digitising operations may seem trivial and inconsequential, but if you are wanting to work with digital media then this stage is critical. Getting good digitisation of your material really helps in the long term.

Please comment below on what you use for this stage, I’m always eager to re-evaluate my options.

Chris

About Chris