Wednesday, April 29, 2009

Human digitizer

When you do a lot of research like I do on the Internet sooner or later you end up at Google Books. The idea is that Google wants to digitize every book ever printed and place it on the Internet so it can be searchable. Now this has lead to some controversy around copyright but generally seems to be proceeding.

The question is how can you actually go about digitizing every book. You have to scan in every  page and then convert that page to text so it can be searched. As anyone who has ever used a scanner (or a fax for that matter) knows scanning something into a computer can sometimes mean illegible results. Generally, the only way to overcome this is for a real live human being to decide what the scanned text actually says. That could take years I hear you say, ah ha but you didn’t figure on the power of the Internet did you?

A huge issue on the Internet are programs (known as bots) that scan through web sites looking for emails. The bots report these emails back to spammers who add them to their list so they can send you more junk. Bots can also be used to automate the creation of web based emails and reply to forms – ah what a pain! To overcome this many sites use something called Captcha. This means that prior to any confirmation you are presented with some text that is difficult for a machine to recognize but hopefully not for a human. In this way the web page knows that the entry most probably came from a real human (assuming it was entered in correctly).

So what has bots got to do with Google Books and text recognition? You can imagine that there are many pieces of scanned text from books that needs to be viewed by a human being to determine what they are. So now it’s time for the Internet and Captcha to come to the rescue.


As you can see, a site I recently visited needed to verify I was a real person so it threw up this Captcha. However, note that down the bottom it says the words come from scanned books. So by typing in the text you are help to digitize scanned books.

So how does the Captcha know the correct word when the scanned word can’t be read you may ask? On the reCaptcha site you’ll find the answer:

But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.

If you a more in depth explanation then check out the reCaptcha site where you’ll the answers in much more detail.

A good example of how the power of the Internet is being put to good use. Pretty clever eh?