Monday, December 28, 2015

Quick and Dirty OCR on OS X for Free

Wait, Stop, Don't spend that $80+ on some fancy OCR software just yet.  There are some free open source tools out there and with a bit of work you can have very functional workflow using them and OS X. This guide will help you set up an applet you can drop images onto to OCR them to plain text.  This will not create PDF's, formatted, or indent documents, but rather just one large text block that is structured the same as the source text with line breaks.  The following tools are used, but instructions are provided below so you don't need to download anything from the sites.
  • Homebrew The missing package manager for OS X
  • Tesseract Open Source OCR Engine
  • Automator

Install Tools

First off lets install Homebrew to simplify managing custom packages.  This will keep everything in a special /usr/local/ directory so it won't interfere with OS X's normal system.  To do this open up the Terminal application, copy and paste the following command, and press Return to execute it.  You will be prompted to authorize the installation and may be asked to install some OS X command line tools directly from Apple.

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Next we need to install Tesseract and some supporting libraries.  Again int he Terminal window enter each of these lines and press Return to execute them separately.  They may take a few minutes to compleate before the command prompt returns.

brew install imagemagick
brew install tesseract --all-languages

Configure Applet

Next we need to create a user friendly way to do the OCR.  We can easily do this with Automator in OS X.  Open Automator and create a new Application.  On the left search for and add Run Shell Script and Display Notification in order.  In the Run Shell Script change Pass input to as arguments add the following code to that step.

PATH="$PATH:/usr/local/bin"
for var in "$@"
do
convert "$var" -resize 400% -type Grayscale - | tesseract -l eng - - | pbcopy 
done

As an alternate the following code will convert the entire result to a single line of text if that is preferred, but it may cause issues if there are columns of text on the image.

PATH="$PATH:/usr/local/bin"
for var in "$@"
do
text=`convert "$var" -resize 400% -type Grayscale - | tesseract -l eng - -`
ocr="$orc$text"
done
echo $ocr | pbcopy

Next add some text to the notification step so you know when the task is done processing.  I added a title of "ORC Finished" and a message if "Text was copied to your clipboard.".  Then just save the application and give it a name.

Using the Applet

To use the applet first fine it and d rag it down to your Dock to make a shortcut.  Then you can drag image files onto the application in the Dock and it will do it's magic.  Once you drop an image onto the Applet it will take a few seconds to process and you should see the notification pop up when it is done in the corner.  At this point you can past the resulting text into a program of your choice, clean it up, and so what you want with it.

Other things that can be done with a bit of tweaking to the above scripts:

  • Processing multiple input files at once.
  • Saving results to a text file on the desktop or source folder instead of the clipboard.
  • Opening the resulting file automatically.
  • Remove the notification step if desired.
  • Create a Folder Action instead to automatically run on files added to a specified folder.
  • Advanced tesseract options can be passed in the script but in my experience these were not needed.

1 comment:

Dan said...

I had a problem installing imagemagick on Mav but this fixed it.
$ brew update
$ brew install imagemagick --disable-openmp --build-from-source