Re: PDF rendering/extraction involving indic scripts
Quoting Ritesh Raj Sarraf (2017-01-09 08:51:04)
> On Mon, 2017-01-09 at 01:05 +0100, Jonas Smedegaard wrote:
>>> I don't recollect finding any such list, when I was running into
>>> problems with PDF. I remember talking to Vasudev and he suggested
>>> me your name, hoping you may have more insight into Fonts and PDF
>>> in general.
>>
>> Which problems did you run into, Ritesh, more concretely?
>
> Mostly with Indic text extraction from the PDF files. What is rendered
> in the PDF doesn't get exported as text.
Ah, extraction _from_ PDF. Yes, that is a pain, because it is
technically *not* possible to do reliably!
PDF is compiled output from drawing instructions, *not* a source format:
It was invented as the digital equivalent of paper - just as you can
scan a piece of paper but not be certain if you semantically got a
circle or the letter "o" or the digit "0", you can parse a PDF document
but not be certain if e.g. elements close to each other belong together.
PDF reverse engineering - a.k.a. PDF content extraction - is sometimes
possible, and more likely when same tools are used to produce and
extract. That trick is (ab)used in particular by the inventor of PDF -
Adobe - and that has no doubt added to the confusion (if not caused it).
Always call it "PDF files" (not specific brands), and never _depend_ on
ability to extract content (only proper source is reliable)!
Here are console tools for all known¹ PDF extraction libraries, tested
on a single² PDF file containing english and devanagari content:
* Succesfully extracts some devanagari:
* pdftotext (lib:poppler pkg:poppler-utils)
* pdftohtml (lib:poppler pkg:poppler-utils)
* pdf2htmlex (lib:pdf.js pkg:pdf2htmlex)
* pdf2txt (lib:pdfminer pkg:python-pdfminer)
* Extracts complete text streams (maybe decodable separately):
* pdfextract (lib:origami pkg:origami-pdf)
* mutool (lib:mupdf pkg:mupdf-tools)
* Fails to extract complete text - skipping devanagari:
* ps2ascii (lib:gs pkg:ghostscript)
* pstotext (lib:gs pkg:pstotext)
* podofotxtextract (lib:podofo pkg:libpodofo-utils)
* Fails to extract any text at all (or I uses it wrongly):
* pdftosrc (lib:poppler pkg:texlive-binaries)
* getpdftext (lib:cam-pdf pkg:libcam-pdf-perl)
* Untested (and relevant: uses untested library):
* pdfsam (lib:itext pkg:pdfsam)
* pdfbox (lib:pdfbox pkg:libpdfbox-java)
* pkg:php-tcpdf
* pkg:libcamlpdf-ocaml
NB! The list only includes tools with varying _extraction_ features,
which is typically limited by a single underlying library. Popular
examples already covered are OpenOffice (lib:poppler) and Scribus
(lib:podofo).
I care about PDF rendering and extraction, but I lack knowledge on indic
scripts and am unable to spot crucial flaws like misplaced or garbled
glyphs, or (for rendering) wrong spacing.
If anyone knows about alternative Free tools (with _different_
extraction features!), please let me know!
Please also share more sample texts with me - both source and rendered
PDFs - for multiple indic scripts.
- Jonas
¹ Only code in Debian is truly known; only Free code can become known.
² A sample text for a Free font authored by a friend of mine:
https://github.com/cyrealtype/Sumana/raw/master/Samples/Sumana%20Poster.pdf
--
* Jonas Smedegaard - idealist & Internet-arkitekt
* Tlf.: +45 40843136 Website: http://dr.jones.dk/
[x] quote me freely [ ] ask before reusing [ ] keep private
Reply to: