Re: proofing searchable pdf files

To: debian-user@lists.debian.org
Subject: Re: proofing searchable pdf files
From: Scott Ferguson <scott.ferguson.debian.user@gmail.com>
Date: Sun, 02 Nov 2014 12:35:58 +1100
Message-id: <[🔎] 54558A7E.9020005@gmail.com>
In-reply-to: <5452DC2D.2040502@verizon.net>
References: <5452DC2D.2040502@verizon.net>

On 31/10/14 11:47, Gary Roach wrote:
> Hi all,
> 
> Problem: I am working on an archiving project and wish to archive
> documents to searchable pdf files but can't seem to figure out how to
> proof read and correct the text overlay. Any suggestions.

I'm not sure what you mean by "text *overlay*"... but, my usual approach
is to only edit the text content of the final output if the font is
unique - otherwise I feed the to problematic text back into the training
data.
https://code.google.com/p/tesseract-ocr/wiki/AddOns

> 
> System: Debian Wheezy Intel i5-750 processor HP Officejet Pro 8600
> wireless all in one printer/fax/scanner gscan2pdf software with
> Tesseract ocr 300 to 600 dpi scans.
> 
> Tesseract seems to do a really great job but I have no good way of 
> proving this or correcting any mistakes.

Are they the only tesseract components you have installed??
What are the project constraints that prevent you from using the
traditional toolsets for similar projects (what you have listed is
better suited to scanning a few pages only)??

e.g. is there a reason you are not using Terese, YAGF or Lector (or any
of the other fine interfaces that allow proof-reading)?
http://terese.sourceforge.net/
http://code.google.com/p/yagf/
https://code.google.com/p/lector/

What about the standard box file editor and traners(sic, trainers?):-
https://code.google.com/p/tesseract-ocr/wiki/AddOns

> Some of the documents are 100 years old and may not be in such great
> shape. I can always retype everything but would like to avoid this,
> as much as possible, for obvious reasons.
> 
> Gary R.
> 
> 

Given that the default output is a standard utf-8 text file.... why are
people proposing convoluted processes to edit the text in a pdf?

Do you /have/ to tif -> pdf immediately??

It would make more sense from my experience of working with tesseract
and auto-bookscanners to just generate the tif files, then proof-read,
then convert to the final output format (puzzled).

Kind regards

Reply to:

Follow-Ups:
- Re: proofing searchable pdf files
  - From: Gary Roach <gary719_list1@verizon.net>

Prev by Date: Re: Perfect Jessie is something like this...
Next by Date: Idea: Rename package `udev` to `systemd-udev`, plus new `udev` metapackage, to "preserve freedom of choice of init systems".
Previous by thread: Re: proofing searchable pdf files
Next by thread: Re: proofing searchable pdf files
Index(es):
- Date
- Thread