User:Xover/DjVu

DjVu stuff

General DjVu process

Grab the jp2's from IA
- The jp2's are typically higher resolution than the jpg's
Adjust the image files to match with book pages
- In particular, delete any scan-artifact first and last pages before or after the actual book covers
Use GraphicsMagic to convert the jp2's to jpg's
- Since DjVuLibre can't read JPEG2000
Use DjVuLibre (c44) to generate single-page .djvu files from the page images
Use Tesseract to do OCR of each page, spitting out .hocr files
Write some custom code to:
- Merge of the individual .djvu's into a multi-page .djcu for the whole book
- Parse the hOCR data from Tesseract and generate DjVuLibre s-expressions
- Use djvused to add a hidden text layer to the book .djvu
Upload to Commons

Rough outline algorithm notes

Relevant libraries:
- HTML::Parser (hOCR is a HTML-based microformat) (link to spec here)
- Use LWP or something to do the download and upload steps?
- Look for something to help with the parser logic or state machine?
- Tesseract
  - Is there a decent library for this so we won't have to wrap the command-line?
- GraphicsMagick
  - What happened to PerlMagick? Where are the bindings for this?
Use a simple pseudo-state machine for each level of hOCR data:
- There's some overall OCR data that can probably be ignored (it'll be per-page in this case)
- hOCR supports columns, but ignore these for now (too complicated)
- First state will be HOCR_PAGE
  - Maybe ignore this for OCR purposes and just use it to determine right DjVu page to add the hidden text layer to?
- Second state will be HOCR_PARA
  - Is it worth mapping this to DjVuLibre's equivalent concept? Maybe just ignore it.
- Third state will be HOCR_LINE
- Fourth state will be HOCR_WORD
- Fifth possible state will be HOCR_CHAR, and DjVuLibre supports it, but I don't think it's worth dealing with
Each parsing state is a constant
Need a global var or lightweight object to keep track of current state
HTML::Parser is event driven
- Need to catch start tag events and end tag events
- Need to check for valid events in each given state (not too many: hOCR is strictly nested and general HTML can be ignored; no tagsoup)
Build a tree in memory, or implement this as a streaming algorithm spitting out the sexprs as we go along?
Is it worthwhile to spend time on a generic data structure for this that can be serialized to many formats?
Maybe it makes more sense to write it as a straight hocr2sexpr converter and spit out per-page .sexpr files?
- This would make the overall algorithm dumber-but-simpler
- And given we'll be wrapping commandline utilities in any case, we can't avoid the "dumb" part. Maybe try to get the "simple" part too then?
- Then again, a fully-streaming implementation is probably not that much more complicated, all things considered, provided we can rely on djvused not crapping out on us too much
- That's a big if: for book-length djvu's, the DjVuLibre tools have crapped out rather a lot
- Maybe it's better if we can operate on a "page-at-a-time" level?
Then again, we need to keep track of at least HOCR_PAGE and HOCR_LINE to generate working sexprs

Most manual workflow

gm mogrify -format jpeg '*.jp2'
tesseract inputpage.jpeg inputpage -l eng hocr
At a minimum need custom code to convert .hocr to .sexpr
- NB! hOCR (top left) and sexpr (bottom left) use different coordinate systems!
c44 inputpage.jpeg inputpage.djvu
djvm -c output.djvu page1.djvu … pageN.djvu
djvused -e 'select N; set-txt [sexpr data]'