Jump to content

Help:DjVu files

From Wikisource
DjVu files

Shortcut:
H:DV

This page explains how to create, use, and upload files in the DjVu format, which groups scanned images into a single container format.

Image extraction

[edit]

Shortcut:
H:DJVUIMG

DjVu files generally have very heavy image compression that is optimised for text. This results in severe damage to image quality for illustrations and photographs. In general, it is better not to extract images from DjVu files and instead use more original files, for example, the page JP2s at the Internet Archive. Help:Image extraction contains more guidance.

Conversion

[edit]

Images to DjVu

[edit]

Windows

[edit]

DjvuToy is a software which provides different functionalities:

  • make a Djvu
  • merge Djvu files
  • split Djvu files
  • edit Djvu files
  • generate a bundled file
  • export from Djvu to another file
  • extract text from Djvu
  • download Djvu file structure info (eg. OCR coordinates)
Images → virtual printer → DjVu
[edit]

If the page scans are made available as a PDF file, e.g. Google Books scans, then this can be directly converted into a DjVu file using one of the following:

  • The free Any2DjVu online service; this can also OCR the text and embed it in the .djvu file.
  • The freeware Pdf To Djvu GUI. Note that this requires the installation of the Cygwin environment as a prerequisite to its own installation.
  • The freeware command-line tool with GUI for Windows is the Djvu-Spec Pdf 2 Djvu Converter from the djvu-spec.narod.ru software page. This tool offers many settings to change the quality and size of the resulting djvu file.
  • The free software command-line pdf2djvu (available in repositories, also for Linux), which is usually as simple as pdf2djvu -o output.djvu input.pdf. There's also a GUI available.
  • If you need to crop the PDF document, you can use pdfcrop.pl (see below) for black margins or freeware Govert's PDF Cropper for white margins (it requires Ghostscript and .Net 2.0).

If the scanned images are made available as individual images, then the easiest option is to print them to a PDF document via one of the many "virtual printer" tools, such as the free PDFCreator; then convert the PDF document to DjVu as described above.

Note that there are many other options for converting pages to .djvu. One could convert using PostScript or multipage TIFF as the intermediate format, rather than PDF, but this would of course require different conversion tools. It is also possible to convert from .pdf or .ps to .djvu with the DjVuLibre software and its GSDjVu plug-in but due to licensing restrictions installing the plug-in is a fairly intricate process that involves compiling a patched version of Ghostscript.

Another free Windows tool that can come in handy for the images-to-pdf-to-djvu process is ConcatPDF, a GUI tool that permits easy splitting and merging of PDF files. This tool can also be used online. An example of how ConcatPDF might be used is: if a 100-page document has previously been scanned and converted to .djvu and the single page #42 needs to be re-scanned, ConcatPDF would allow that one page to be inserted into the intermediate .pdf file without tracking down the other page images and re-composing the entire document. Installing ConcatPDF version 1.1 requires as prerequisites that the free Microsoft program libraries Microsoft .NET Framework Version 1 and the corresponding Visual J# .NET Redistributable Package be installed beforehand.

Images directly to DjVu
[edit]

However, a far higher quality document can be achieved using the DjVuLibre software library. Jpeg images can be directly encoded into individual DjVu pages using the c44 encoder. Images in lossless formats such as PNG should be converted to PPM (for colour scans) or PGM (for greyscale scans), then encoded using c44. For bitonal (i.e. black-and-white) scans, such as most page text images, a smaller DjVu file can be obtained by converting the page images to the monochrome PBM format, then encoding to DjVu using the cjb2 encoder. All of these image format conversions can be performed by the free ImageMagick library (in batch, with mogrify). Individual DjVu pages can be aggregated into a multi-page DjVu using the djvm program; this program can also be used to insert or delete pages from a djvu file.

An important caveat of this process is that high quality scans come at the cost of larger files, and there is currently a 100 Mb limit on uploads to commons. The size can be substantially reduced by applying foreground/background separation with didjvu and/or minidjvu.

Scripting djVuLibre
[edit]

This script allows you to take a whole directory of image files (JPG, PNG, GIF, TIFF, and any file that Imagemagick can convert to PPM) and convert and collate them automatically into a DJVU file. Currently this script is for Windows, but it can be easily converted for Linux. To use it, you will need Python, Imagemagick and DjvuLibre.

Linux

[edit]
See also: User:GrafZahl/How to digitalise works for Wikisource
Method 0 - converting graphic files with foreground/background separation
[edit]

Just use didjvu.

You may consider preprocessing the scans with Scan Tailor.

Method 1 - page at a time with DjVuLibre
[edit]

You need the djvu software, which includes a viewer, and some tools for creating and handling DJVU files. You will probably also need the Imagemagick software for converting scans from one format to another:

  • The tool cjb2 is used to create a DJVU file from (bitonal) PBM or TIFF file.
  • The tool c44 is used to create a DJVU file a PNM or JPEG files. This handles colour images, but the compression is lower.

Therefore you need to convert your scans if they are not already in one of these formats.

Conversion to intermediate format
[edit]

DJVU cannot use JP2 or PNG as a format. So next, you need to convert to a format that will work as input to a DJVU. Options include PBM (turns all pixels black or white, no shades of grey); PGM (greyscale, lossless); or JPEG (lossy compression optimized for photographs).

  • Conversion from PNG format to PBM format with the tool convert from Imagemagick
convert filename-000.png filename-000.pbm
  • Depending on the quality of the original scans, you may find it useful to process them with the unpaper utility, which deletes black borders around the pages and aligns the scanned text squarely on the page. Unpaper is also capable of extracting two separate page images where facing pages of a book have been scanned into a single image. Another utility is mkbitmap, another pdfcrop.pl (Perl-based and free software, it requires Ghostscript and texlive-extra-utils on Ubuntu; it uses BoundingBox; it can crop a whole multipage PDF document in just one passage). PDFCrop (another one!) deletes white margins.
Conversion to DJVU page file
[edit]
  • Creation of a DJVU file from a PBM file: (this command will not work for PGM or JPG)
cjb2 -clean filename-000.pbm filename-000.djvu
  • Creation of a DJVU file from a PGM or JPEG file:
    c44 -dpi 300 p100.jpg p100.djvu

(In this example, the PGM is specified to use a resolution of 300 dpi. The -dpi argument may be left out; the default value is 100.)

Creating final DJVU document
[edit]
  • Adding the DJVU file to the final document
djvm -i filename.djvu filename-000.djvu

You need to repeat these steps with a script for each page of the book. Example:

#!/bin/bash
for n in `seq 1 9`
do
        i="filename-$n.png"
        j=`basename $i .png`
        convert $i $j.pbm
        cjb2 -clean $j.pbm $j.djvu
        djvm -i filename.djvu $j.djvu
done

There is also another way to add all the *.djvu parts into one:

djvm -c filename.djvu filename-000.djvu filename-001.djvu filename-002.djvu

See the following section for an automated process for multiple pages.

Method 2 - PDF to DjVu bash script
[edit]

Use this script, which converts a PDF document (multiple or single page) into images, automatically crops them with ImageMagick, converts them in DjVu and bundles them. This is very slow (a large PDF document can require days) but a little more efficient than the following method.

The resulting DjVu document is quite big and low-quality, probably because of poor font recognition, which may be fixed by newer versions of poppler (the used library): the version available in repositories is usually several months old.[1]

You can also remove the pdftoppm part and use the script to convert multiple images directly in a multiple page PDF document. If images are not in pbm format, you can convert them with a single command using mogrify from ImageMagick.

Method 3 - pdf2djvu
[edit]

Simply download the pdf2djvu tool from your repository to directly convert PDF document (single or multiple pages) into DjVu.

If the document contains the results of OCR (as is the case e.g. with FineReader output) then they are preserved in the DjVu document as the hidden text layer. Some other properties of the source document, including metadata, are also preserved. The quality and the size of the output depends primarily on the features of the source document but can also be controlled with several program parameters, such the resolution of foreground and background.[2] The program is capable to use several threads to speed up the conversion.

As of 2019, file size on Wikimedia Commons is less important than image quality (although PDFs around 1 GiB in size can have problems with thumbnails). The simplest way to increase quality is to change --bg-subsample (default 3, max 12) to 2 or 1 (best quality).[3]

An example command may therefore be:

 pdf2djvu -j0 --bg-subsample=1 -o output.djvu input.pdf
Note on cropping
[edit]

With pdf2djvu, you need to crop directly the pdf before the conversion. On Linux this may be quite difficult. You could use ImageMagick convert -crop, but attention: with multiple page big PDF document, this can take several GB of memory (the limit is 16 TB!) and kill your computer if you don't use the -limit area 1 option directly after -crop. This make the conversion very long.

When using ImageMagick, the resulting PDF document is increased in size and reduced in quality because of rastering.[4]

See other crop tools above.

Method 4 - DjVuDigital
[edit]

Use djvudigital,[5] which like pdf2djvu converts pdf directly in DjVu.[6] There are licensing problems, because the GSDjVu library has a different license, then you'll need to compile it by yourself; the included utils make this step quite easy, but still long (about 1 hour) and a bit annoying.[7]

But, then you can convert PDF document into DjVu with a single command (see the previous section for crop). The conversion is slow (I find it will complete a 300 page PDF document in about 30-40 minutes). The resulting DjVu is of higher quality and lower file size compared to both the previous two methods.[1] Additionally, DjVuDigital can handle JPEG2000 (aka JPX) files embedded in PDF documents, which is a feature of many Google books. pdf2djvu, Any2Djvu and Internet Archive conversions all fail to convert these files, leaving blank pages in the output.

DjVuDigital has many advanced options to improve results, but they can be difficult to master.[8] In general, altering the --dpi option can give you a quick reduction in file size without too much fiddling.

Online ([almost] all systems)

[edit]
Any2Djvu
[edit]

Another method to convert the images to djvu is to zip them and use the Any2Djvu site to create the djvu file. The Any2Djvu will extract the images in the zip and create a OCRed djvu. OCR functions will only with English text.

Any2Djvu cannot handle huge files. Big files are best dealt with if you upload them by URL (e.g. by entering a link like ftp://ftp.bnf.fr/005/N0051165_PDF_1_-1DM.pdf). Conversion can take several hours. Any2Djvu will sometimes run out of memory on large or highly-detailed files and fail. It will also not convert "JPX" images embedded into PDF documents, which are common in Google Books scans.

The Internet Archive
[edit]

Another method is to upload a PDF document (or archive of image files) to the Internet Archive. You need to login (don't use OpenId, it won't function[9]).

Uploading
[edit]

Click "Upload" at the top-right corner. The flash upload (standard "Share" button) won't function with Firefox (use Opera or Internet Explorer instead[10]) or Linux. You can use the standard JavaScript non-flash method (although there's a file size limit of 2 GB with Firefox, but not with Chromium); FTP upload is deprecated because it's slower and crashy but is the only easy to learn possibility if you have to upload many files (which shouldn't be the case here).

OCR tricks
[edit]

When the upload has been completed, the Internet Archive will start the "derive" work: OCR to create an XML document of detected text based on the uploaded PDF file, then conversion of that to a DjVu file with embedded text, creation of plain text-only dump file, among others.[11]

Don't forget to set the correct language in the metadata before starting the derive (which is run automatically after upload if there's something to derive), otherwise the OCR language will be set to English and results will be poor for works based in any other language. It's not possible to set multiple OCR languages, but you're invited to upload the same book twice with the two languages to have two OCRs.[12] The length of processing time depends on the size and complexity of your file, as well as the current Internet Archive backlog of conversion tests.[13] You can check your progress in the queue here and more detailed information about jobs you submitted here (must be logged in).

The Internet Archive uses a professional, proprietary, commercial ABBYY software[14] with a quite good images and OCR output in many languages and fonts and an aggressive compression[15] which mantains an high quality of the final DjVu file.[1] However, the Internet Archive sometimes produces over-compressed DjVu files with poor quality. If this happens, you can often download a PDF document and convert manually. You can reduce the resolution the derivation aims at, which is normally set automatically by some "guessing", via the fixed-ppi field, setting it to 300 (dpi) or lower to reduce sizes, processing time and (sometimes) errors.

Images formats
[edit]

Book scans split into several tiff, jpg, jp2 format images (other formats are not accepted) are converted ("derived") as well, if you put them in a properly created tar or zip archive.[16] It's usually better to upload uncompressed scans or JPEGs; the jp2 files produced in the derivation process are compressed in a way you won't be able to emulate without a lot of effort.

Troubleshooting
[edit]

If you have severe problems with your deriving process and you need admin intervention (tasks shown in red in your tasks list), ask help at infoarchive.org, they're usually amazingly helpful. General requests for help should be placed in the forums though, don't bother them for nothing!


DjVu to text

[edit]

OCR via Any2DjVu

[edit]

The OCR option available at the free conversion service Any2DjVu does do an OCR of the scanned image but the resulting text is embedded within the .djvu file itself and must be extracted so it can be used on Wikisource.

One way to do this is to use the DjVuLibre software to extract the text, via a command like

djvused myfile.djvu -e 'print-pure-txt' > myfile.txt

or

 djvutxt myfile.djvu > myfile-ocr.txt 

JVbot can automatically upload the text layer of a DJVU to the pages on Wikisource. For example, Robert the Bruce and the struggle for Scottish independence - 1909.

OCR via the Internet Archive

[edit]

See above: if you upload a DjVu file, the derive process will OCR it.

OCR with Tesseract

[edit]

OCR can be done with Tesseract, a free OCR software, and a script:

OCR with Tesseract 3.x and other free OCR engines

[edit]

Use ocrodjvu.

DjVu to Images

[edit]

Linux

[edit]

To extract images from a DjVu file, you can use ddjvu

ddjvu -page=8 -format=tiff myfile.djvu myfile.tif

If you done all the pages (without -page=**) you can split the multi-page tiff into single pages png (or any other format)

convert -limit area 1 myfile.tif myfile.png

Extract all pages to single pages tiff with 80% quality.

ddjvu -format=tiff -eachpage -quality=80 myfile.djvu myfile-%03d.tiff

Manipulating

[edit]

There's some advice about manipulating DjVu files or images to be used to generate DjVu elsewhere:

Splitting DjVu files

[edit]

The DjVu documents come in two flavours: bundled and unbundled (indirect); the latter format stores every page in a separate file. The comment below made by the original author concerns only bundled documents, which should be avoided.

Large works can not be uploaded onto Wikimedia servers which have a 100 MB upload limit. To split the DjVu, use DjVuLibre "Save as", and specify a page range which will produce a file small enough to be uploaded. Some trial and error may be necessary.

The easiest way to split DjVu files from the command line is with djvmcvt:

 mkdir mydoc/ &&
 djvmcvt -i 'mydoc.djvu' 'mydoc/' 'new-mydoc-index.djvu'

Alternatively, djvused can be used from the command line:

 djvused myfile.djvu -e 'select 10; save-page-with p10.djvu'

This can be done for every page. To know the number of page of the file :

 djvused myfile.djvu -e 'n'
[edit]

Many of the already-created djvu files available at archive.org and other sites have the Google copyright page attached to the front of the document. Wikimedia policy, based on an analysis of the underlying law, does not accept that copyright is established on a public domain work simply by scanning or copying it or taking a two-dimensional photograph that faithfully represents its subject. See Wikimedia Commons for more information about scans, artwork and the position of the WMF.

Such copyright pages and other extraneous material can be removed with DjVuLibre, an open source program maintained by the inventors of djvu under the GNU Public License. Binaries are available for Windows, Mac, Linux, Solaris, and IRIX. It includes djvm.exe, which is run as a command-line utility. If you cannot figure out how to do this, you can message Mkoyle (talk), and he will do it for your file and email the file to you for upload. The command line to delete (-d) the first page (1) is as follows:

djvm -d filename.djvu 1

Inserting a new pages (e.g. a placeholder)

[edit]
Page placeholder file

If a DJVU file is missing pages, you can insert placeholders, so that if the pages are found and inserted later, existing pages won't need to be moved. You can use File:Generic placeholder page.djvu for the placeholder.

djvm -i main_document.djvu placeholder_file.djvu <page_num>

Note: work backwards from the last missing page in the file, to avoid having to recalculate the page numbers as you insert pages.

Realigning shifted OCR

[edit]

It often happens that the text layers of some pages of a DjVu file are invalid; the way that MediaWiki gets the DjVu text layer causes the text of all pages after it to be shifted towards the beginning of the file, which makes it useless. To solve this, first identify the invalid page. You can do that with

djvused file.djvu -e "output-all" > file.dsed

If the OCR is shifted, this should output an error. Look at file.dsed, and the last page number (indicated with # page) is the last valid page. The invalid page is the one after.

To fix this issue, you should remove the text of the invalid page, like so:

djvused file.djvu -e "select [invalid page number]; remove-txt; save"

(This will change file.djvu.) The OCR should now be realigned (check with another output-all, if it still makes an error rinse and repeat).

See also: phab:T219376.

Displaying a particular page

[edit]

The [[File:...]] link tag accepts a named parameter "page" so that, for example, this wiki code displays the image of page 164 of the file Emily Dickinson Poems (1890).djvu on the right, 150 pixels wide (the rear cover of the book, containing no text):

[[File:Emily Dickinson Poems (1890).djvu|right|150px|page=164]]



The page image can be displayed in the books Wikisource main space as with Personal Recollections of Joan of Arc/Book I/Chapter 2 using:

[[File:Personal_Recollections_of_Joan_of_Arc.djvu|page=27|right|thumbnail|200px|THE FAIRY TREE]]

Notes

[edit]
  1. 1.0 1.1 1.2 Example: this 205 MB PDF document of a 1691 book from Gallica is converted by pdf2djvu.sh script in a hardly readable 382.4 MB djvu, in a little better readable 316.7 MB djvu by djvudigital and in a better quality 51.3 MB djvu by the Internet Archive.
  2. The defaults are sensible for most cases: --dpi=300 (but requires the metadata about size to be correct) and --bg-slices=72+11+10+10, which the c44 manual recommends for higher quality photography: «74+13+10, for instance, would be appropriate for compressing a photographic image with three progressive refinements. More quality and more refinements can be obtained with option -slice 72+11+10+10.» (Checked in DjVuLibre 3.5.27.)
  3. From http://www.djvu-soft.narod.ru/scan/djvu_imager_en.htm : «BSF (Background Subsample Factor): The ratio of the foreground layer geometrical storage size (in pixels) to the background one (in DjVu). Ranges from 1 to 12. E.g. the background layer may be stored in a DjVu file downsampled to 1..12 times. [...] I recommend you to play only with BSF and not to touch the Background quality (because the latter almost doesn't make sense).»
  4. For instance, this 55 MB PDF document when cropped with ImageMagick gives a 100 MB PDF document which converted with pdf2djvu gives a 86.2 MB djvu, while the Internet Archive gives directly a 10.1 MB djvu of better quality.
  5. Man page.
  6. A comparison here.
  7. Complete instructions here.
  8. Moreover, they can require the proprietary msepdjvu libray instead of csepdjvu: see superhero pres: is it independently reproducible?.
  9. See forums: Authentication error; not a valid OpenID, Login problems when I click "Share" .
  10. See forum.
  11. If your original PDF has no embedded text-layer, the derive process will automatically create a second, text-rich, PDF for you by applying the same previously detected OCR generated text to create one.

    Please, note, however, if your PDF comes from GoogleBooks and has a first-page disclaimer notice, the derive process will detect the disclaimer page's hidden text-layer, assume the rest of the pages in the PDF also have embedded hidden text-layers too when they never do and skip the automatic creation of the second PDF file altogeher. Keeping the disclaimer page but stripping it of all hidden text is the optimal approach for reasons having to do with the complimentary creation of a DjVu file at the same time - swapping it with a suitable null or blank page will do just as well and of course the last resort is deletion of the disclaimer page.

  12. See forum.
  13. Example: Vocabolario degli accademici della Crusca, 1691, took 5.1 days to derive.
  14. Version 9.0 since 2013.
  15. In the example, dimension is 1/6 compared to djvudigital output.
  16. FAQ; documentation of the format to use. Remember: put extensions in lowercase everywhere, use tif with a single f, put the ppi value of the images in the metadata. If your archive of images is not recognized as such, it may help to edit the metadata and set its format as "Single Page Processed TIFF ZIP" (even if it's a TAR) and so on. You should probably first the _images.zip archive format.

See also

[edit]