Re: PDF rendering/extraction involving indic scripts

To: rrs@researchut.com, Ritesh Raj Sarraf <rrs@researchut.com>, "Abhijit A. M." <abhijit13@disroot.org>, debian-dug-in@lists.debian.org
Cc: Siri Reiter <siri@jones.dk>
Subject: Re: PDF rendering/extraction involving indic scripts
From: Jonas Smedegaard <jonas@jones.dk>
Date: Mon, 09 Jan 2017 16:09:11 +0100
Message-id: <[🔎] 148397455120.2347.3555704517203462018@auryn.jones.dk>
In-reply-to: <[🔎] 1483948264.12261.5.camel@researchut.com>
References: <[🔎] 90d222f7-22c9-0e31-38d8-e32d5d11b66d@disroot.org> <[🔎] 1483896846.12261.1.camel@researchut.com> <[🔎] 148389854251.2347.10605487073741548101@auryn.jones.dk> <[🔎] 1483899942.12261.3.camel@researchut.com> <[🔎] 148392032510.2347.14272967871766835336@auryn.jones.dk> <[🔎] 1483948264.12261.5.camel@researchut.com>

Quoting Ritesh Raj Sarraf (2017-01-09 08:51:04)
> On Mon, 2017-01-09 at 01:05 +0100, Jonas Smedegaard wrote:
>>> I don't recollect finding any such list, when I was running into  
>>> problems with PDF. I remember talking to Vasudev and he suggested 
>>> me  your name, hoping you may have more insight into Fonts and PDF 
>>> in  general.
>>
>> Which problems did you run into, Ritesh, more concretely?
>
> Mostly with Indic text extraction from the PDF files. What is rendered 
> in the PDF doesn't get exported as text.

Ah, extraction _from_ PDF.  Yes, that is a pain, because it is 
technically *not* possible to do reliably!

PDF is compiled output from drawing instructions, *not* a source format: 
It was invented as the digital equivalent of paper - just as you can 
scan a piece of paper but not be certain if you semantically got a 
circle or the letter "o" or the digit "0", you can parse a PDF document 
but not be certain if e.g. elements close to each other belong together.

PDF reverse engineering - a.k.a. PDF content extraction - is sometimes 
possible, and more likely when same tools are used to produce and 
extract.  That trick is (ab)used in particular by the inventor of PDF - 
Adobe - and that has no doubt added to the confusion (if not caused it).

Always call it "PDF files" (not specific brands), and never _depend_ on 
ability to extract content (only proper source is reliable)!

Here are console tools for all known¹ PDF extraction libraries, tested 
on a single² PDF file containing english and devanagari content:

  * Succesfully extracts some devanagari:
    * pdftotext (lib:poppler pkg:poppler-utils)
    * pdftohtml (lib:poppler pkg:poppler-utils)
    * pdf2htmlex (lib:pdf.js pkg:pdf2htmlex)
    * pdf2txt (lib:pdfminer pkg:python-pdfminer)
  * Extracts complete text streams (maybe decodable separately):
    * pdfextract (lib:origami pkg:origami-pdf)
    * mutool (lib:mupdf pkg:mupdf-tools)
  * Fails to extract complete text - skipping devanagari:
    * ps2ascii (lib:gs pkg:ghostscript)
    * pstotext (lib:gs pkg:pstotext)
    * podofotxtextract (lib:podofo pkg:libpodofo-utils)
  * Fails to extract any text at all (or I uses it wrongly):
    * pdftosrc (lib:poppler pkg:texlive-binaries)
    * getpdftext (lib:cam-pdf pkg:libcam-pdf-perl)
  * Untested (and relevant: uses untested library):
    * pdfsam (lib:itext pkg:pdfsam)
    * pdfbox (lib:pdfbox pkg:libpdfbox-java)
    * pkg:php-tcpdf
    * pkg:libcamlpdf-ocaml

NB! The list only includes tools with varying _extraction_ features, 
which is typically limited by a single underlying library.  Popular 
examples already covered are OpenOffice (lib:poppler) and Scribus 
(lib:podofo).

I care about PDF rendering and extraction, but I lack knowledge on indic 
scripts and am unable to spot crucial flaws like misplaced or garbled 
glyphs, or (for rendering) wrong spacing.

If anyone knows about alternative Free tools (with _different_ 
extraction features!), please let me know!

Please also share more sample texts with me - both source and rendered 
PDFs - for multiple indic scripts.

 - Jonas

¹ Only code in Debian is truly known; only Free code can become known.

² A sample text for a Free font authored by a friend of mine: 
https://github.com/cyrealtype/Sumana/raw/master/Samples/Sumana%20Poster.pdf

-- 
 * Jonas Smedegaard - idealist & Internet-arkitekt
 * Tlf.: +45 40843136  Website: http://dr.jones.dk/

 [x] quote me freely  [ ] ask before reusing  [ ] keep private

Reply to:

Follow-Ups:
- Re: PDF rendering/extraction involving indic scripts
  - From: Mahendra Bhandwalkar <mahendra.bhandwalkar@gmail.com>

References:
- Report: Debian Packaging Workshop at COEP
  - From: "Abhijit A. M." <abhijit13@disroot.org>
- Re: Report: Debian Packaging Workshop at COEP
  - From: Ritesh Raj Sarraf <rrs@researchut.com>
- Re: Report: Debian Packaging Workshop at COEP
  - From: Jonas Smedegaard <jonas@jones.dk>
- Re: Report: Debian Packaging Workshop at COEP
  - From: Ritesh Raj Sarraf <rrs@researchut.com>
- Re: Report: Debian Packaging Workshop at COEP
  - From: Jonas Smedegaard <jonas@jones.dk>
- Re: Report: Debian Packaging Workshop at COEP
  - From: Ritesh Raj Sarraf <rrs@researchut.com>

Prev by Date: Re: Report: Debian Packaging Workshop at COEP
Next by Date: Re: PDF rendering/extraction involving indic scripts
Previous by thread: Re: Report: Debian Packaging Workshop at COEP
Next by thread: Re: PDF rendering/extraction involving indic scripts
Index(es):
- Date
- Thread