Digitising Print Media — Briss and OCR

In almost any field one of the pri­ma­ry research tasks is the read­ing and absorp­tion of infor­ma­tion. Now, in some fields you may be blessed by hav­ing all your infor­ma­tion acces­si­ble dig­i­tal­ly, and prefer­ably native­ly digi­tised. However, in every dis­ci­pline I have worked in many of the papers and mate­ri­als have only been avail­able in phys­i­cal for­mat, be they books, jour­nals, or oth­er­wise. Some friends, and indeed some com­men­ta­tors on this site, pre­fer to use the major­i­ty of their works in a phys­i­cal for­mat. If that is you, then this post may not pique your inter­est. But, if you are like me and a sig­nif­i­cant pro­por­tion of my peers, then you will like­ly pre­fer dig­i­tal media for your research, if only because it makes things eas­i­er to find in the long run (and you get to feel some­what more envi­ron­men­tal­ly friend­ly). So how do you get that infor­ma­tion into your work­flow and have it in a usable for­mat. Welcome to the Monday toolk­it post on Digitising Printed Media.

y-u-so-hard-to-readNow, if work­ing with dig­i­tal media is your thing, I am almost cer­tain that you would have found whether your office, library or oth­er choice of research insti­tu­tion has the abil­i­ty to scan doc­u­ments. Most pho­to­copiers come with this func­tion­al­i­ty these days, and the major­i­ty I have used in the past four years have sup­port­ed scan­ning straight to PDF on a USB stick. All well and good you might say, there is my dig­i­tal media right there, job done. Well not quite. You see while native­ly dig­i­tal media gen­er­al­ly comes for­mat­ted for the screen, in either por­trait or land­scape, print­ed media comes in a wide vari­ety of for­mats. Academic works in Psychology and Theology come in a range from squiffy-near-A5, through to square-wannabe-A3 books (Hermeneia com­men­taries I’m look­ing at you here…). However, your dig­i­tal device like­ly only comes in one of two for­mats, both approx­i­mat­ing a rec­tan­gle in either 16:10 (Android, and most PC mon­i­tors), or 4:3 (iPads and old­er mon­i­tors). How do you get your fresh­ly scanned media to dis­play nice­ly on your screen, and in such a way that you don’t spend extra time scrolling or zoom­ing, and you don’t go blind from eye­strain. Imagine read­ing this on the screen:

Briss

This is where one lit­tle free app comes to the res­cue: Briss (http://briss.sourceforge.net). While many apps have a pletho­ra of options and capa­bil­i­ties, essen­tial­ly sport­ing the mod­ern Swiss Army knife, it is quite refresh­ing that Briss exists for only one pur­pose: cut­ting up PDFs. Although it is pos­si­ble to do this job native­ly on the pho­to­copi­er, man­u­al­ly pro­gram­ming it to out­put only that por­tion of the page that you are inter­est­ed in, it is much more dif­fi­cult than Briss. Briss is a Java app, and is there­fore hap­pi­ly cross plat­form, run­ning on both Windows and MacOS if you have the Java run­times. Simply speak­ing Briss has three steps to its work­flow:

  1. Open File
  2. Select Page Zones
  3. Output File

As in the screen­shots above when you first open a file in Briss it will over­lay all your PDF pages togeth­er so you get a feel of where the text is on the page. Selecting the new pages is a sim­ple drag and select oper­a­tion, with it dis­play­ing a translu­cent blue rec­tan­gle with the Odd/Even page num­ber on it. Once you are hap­py with the loca­tions of the pages you can sim­ply out­put it to a new PDF. Briss on its own is an amaz­ing time­saver, and makes for nice and eas­i­ly read­able PDFs, no mat­ter whether you read them on tablet or mon­i­tor. In addi­tion if you want to for­mat scans for lat­er print­ing it means that you can print clean­er files for bet­ter markup. In fact I know of sev­er­al peo­ple who scan to PDF, Briss and then print the resul­tant file through their own print­er as they pre­fer work­ing on paper. This way they also have a dig­i­tal copy incase they lose or clean out the hard­copy.

squareeyedWhen scan­ning the doc­u­ments in I rec­om­mend using full plat­en scans, or at least one full size larg­er than the doc­u­ment you are scan­ning, and the high­est res­o­lu­tion pos­si­ble. With Briss you can eas­i­ly cut down the page to suit the scanned doc­u­ment, and the high­er res­o­lu­tion real­ly helps in the next step. Plus hav­ing bad qual­i­ty doc­u­ments to read makes you feel like this poor per­son.

But wait, there is more—now I feel like a cheesy tele-sales­man, although I have want­ed to say that for most of this blog series. A friend of mine, Rob,  a while ago wrote a cou­ple of minor upgrades to Briss. His ver­sion allows for files to be opened via com­mand line argu­ments, and for auto­mat­ic page resiz­ing. What does that mean? The first mod allows for a small script to start the Briss process, and on MacOS you can eas­i­ly imple­ment this via the Automator app, and you can copy the script below if your Briss app is in the Applications fold­er. The sec­ond allows you to press a sin­gle key (V) and both pages spring to the same size, mean­ing that the pages don’t alter­nate siz­ing on dig­i­tal devices. Minor tweaks, but real­ly valu­able. His ver­sion can be down­loaded here: briss-rob.jar and just place it in with the rest of the Briss fold­er.

OCR

Now you have nice­ly for­mat­ted PDFs, but they aren’t over­ly usable. Each PDF is sim­ply a big image of the page, and it knows naught of the words on the page. Well there is one sim­ple way of fix­ing that prob­lem: Optical Character Recognition tools. There are absolute­ly tons of them out there, both free and paid, although my rec­om­men­da­tion is rel­a­tive­ly main­stream and unfor­tu­nate­ly costs mon­ey. Personally I have forked out mon­ey for Adobe Acrobat and even though it costs a rea­son­able amount it works bril­liant­ly. Using the fair­ly basic set­tings (300dpi/Cleartype) it pro­vides a rel­a­tive­ly accu­rate tran­scrip­tion of the words on the page, and as a bonus it dras­ti­cal­ly reduces the file size. It is not uncom­mon for Acrobat to take 10mb files down to 500kb OCRed out­put. The down­side is of course the cost.

Whichever pro­gram you use for OCR work it is impor­tant for the rest of the work­flow to get a good tran­scrip­tion of the doc­u­ment. If there are severe inac­cu­ra­cies, or sim­ply gib­ber­ish, then it won’t be as usable in the lat­er stages of the work­flow.

That is about it for this step of the work­flow. The digi­tis­ing oper­a­tions may seem triv­ial and incon­se­quen­tial, but if you are want­i­ng to work with dig­i­tal media then this stage is crit­i­cal. Getting good digi­ti­sa­tion of your mate­r­i­al real­ly helps in the long term.

Please com­ment below on what you use for this stage, I’m always eager to re-eval­u­ate my options.

Chris

About Chris