Help:DjVu files/other pages

From Wikisource
Jump to navigation Jump to search
Help:Djvu files Other discussions

This is a quick rundown of how I usually create a Wikisource-ready DjVu scan of an old public domain or otherwise free work.

  • Often, I do not own the relevant works myself, so I visit a library which has them. I'm most interested in old mathematics journal articles, and most libraries provide those only for reference, not for borrowing. So I have to use a local photocopier. Normally, I let the machine send the copies directly to my private e-Mail address as 300dpi or 400dpi TIFF or PDF file. Not only are the fees much smaller than for creating hardcopies but this also saves an additional A/D step, leading to higher output quality. Plus, library photocopiers are much faster than my personal scanner. Unfortunately, this option is not available in small libraries with old photocopiers.
  • When I work with the raw scans, I use the PBM file format (that's the format created by my personal scanner).
    • To convert TIFF files to PBM, I first create a subdirectory called tifdir and split the original TIFF into its individual pages with the tiffsplit program from libtiff:
       tiffsplit scan.tif tifdir/article
will create files named articleaaa.tif, articleaab.tif, and so on in the tifdir subdirectory. Then I use the convert program from ImageMagick to convert the file to PBM (and possibly rotate the file in the process). For example
       cd tifdir/
       for file in *.tif; do convert -rotate 90 "$file" "$file".pbm; done;
will create rotated PBM files (Warning to mathematicians: the rotation algorithm uses left-handed (clockwise) rotation.)
  • To convert PDF files to PBM, I use the pdfimages utility from the Xpdf suite. The output file format depends on the format of the image embedded in the PDF. If it's not already PBM, you can use convert like above to convert the files to PBM.
  • Sometimes, the PBMs need to be cropped before they are converted to DjVu. I use a quick-and-dirty home-brewn pbmextract program for that which lets you specify the coordinates of the extraction rectangle (so you can read them off directly from some image manipulation program like The GIMP). The reason I don't use any off-the-shelf image manipulation software is that they're often not sufficiently capable of handling bitonal files.
  • Once I have the PBM files ready, I convert them to DjVu using the cjb2 and djvm programs from the DjVuLibre suite:
       for file in *.pbm; do cjb2 -dpi 400 -clean "$file" "$file".djvu; done;
       djvm -c finished_work.djvu *.djvu
Obviously, you may have to change the -dpi option depending on your situation. The -clean option removes "flyspecks", leftover artefacts from the scanning process. Of course, this also means the compression is no longer lossless, so depending on your source material you may want to omit this option.

DJVU help

[edit]

Is there free software I can download to edit DJVU files? I've been using DjVu Solo 3.1, but lately most files are saying that version is too old and they can't be read, so I should update. But I can't find a more up-to-date version of DjVu Solo. So does anyone know where I can either get a recent version of DjVu Solo or a recent version of some other DJVU editing program? Angr 16:47, 4 December 2008 (UTC)[reply]

I use DjvuLibre to edit the DJVU files I have. It's a command-line utility and can take some time to learn (it at least took me some time) but it's not a bad set of programs to use for DJVU manipulation. Unfortunately I know of no DJVU editing software that has a nice GUI provided with it.—Zhaladshar (Talk) 18:25, 4 December 2008 (UTC)[reply]
I use DjVuLibre too. Its encoders are not the best, but overall it is very useful. It encodes from PBM/PGM/PPM format, so for most purposes you would need to use it in conjunction with ImageMagick. Hesperian 22:05, 4 December 2008 (UTC)[reply]
Have you seen the corresponding help page? → Help:DjVu files
Hesperian, I'm afraid I don't even understand your answer. I don't know what "It encodes from PBM/PGM/PPM format" means, nor what ImageMagick is. I've seen the help page, but it doesn't tell me what I want to know. Basically, what I (used to) use DjVu Solo for is removing individual images from a package. For example, I often download DjVu files from archive.org, but they often have a front page written by Google pretending Google has the right to limit use of the file to non-commercial purposes only or whatever. Since these front pages are not part of the original book (not to mention being blatant copyfraud), I want to remove them from the package before uploading it to Commons. If I download DjVuLibre, will I be able to figure out how to do that without first getting a degree in computer programming? Angr 07:02, 5 December 2008 (UTC)[reply]
DjVuLibre does simple things simply. For Windoze, if you are just looking to trim the front or back pages, then it does that well; SAVE AS operation. If you want to do more complex bits, then it is a linux-centric application. If you are wanting to convert pages to .djvu form, then you may wish to consider the website Any2DjVu. -- billinghurst (talk) 10:43, 5 December 2008 (UTC)[reply]
I don't know about a SAVE AS operation. I use DjVuLibre in Windows, from the command line. The first page of a djvu file can be removed with the command
djvm -d file.djvu 1
Hesperian 13:04, 5 December 2008 (UTC)[reply]

double pages in djvu

[edit]

With the help of Help:DjVu files I now managed to create a djvu file from my png scans. But one issue remains: My scans were scans of double pages, always a left side and a right side on one png. So my scan of a 180 page book results in a djvu of 90 pages. Is there any convenient way to split the original pngs or the pages in the djvu so I will get a djvu with 180 pages? Does anybody know how to solve this problem? --Slomox (talk) 15:08, 18 February 2009 (UTC)[reply]

I do something like this all the time, but under Linux. Given files labeled 001.png through 999.png that are 3500 pixels across and 300 DPI:
 mkdir Output
 for i in `seq -w 1 999`
 do
     pngtopnm "$i".png > temp.pnm
     pnmcut -right 1750 temp.pnm > temp1.pnm
     cjb2 -dpi 300 temp1.pnm "$i"a.djvu
     pnmcut -left 1750 temp.pnm > temp1.pnm
     cjb2 -dpi 300 temp1.pnm "$i"b.djvu
     rm temp.pnm temp1.pnm
 done
 djvm -c book.djvu [0-9][0-9][0-9][ab].djvu

If they aren't even pages, half the width (1750, in this case) may not work, and you may want to cut a bit off the edges, too. If the scans aren't totally even, you may need to change that value part way through the book. Probably less than helpful, but that's how I do it.--Prosfilaes (talk) 16:47, 18 February 2009 (UTC)[reply]

  • The unpaper utility, which I generally try to use when cleaning up scanned pages, will optionally convert a single scanned image of two side-by-side pages into two separate output files (see the --input-pages and --output-pages options in the documentation). It locates the proper content for each page semi-intelligently by searching for margins consisting of mostly white space. I have been happy with its output so far. Tarmstro99 (talk) 17:15, 18 February 2009 (UTC)[reply]
Unpaper looks good, but I couldn't find a pre-compiled download. Although personally I like GUI programs most, I'm fine with command-line tools. But if I even have to compile the program, that's a bit too much for me ;-) Is there a pre-compiled Windows version available for unpaper? --Slomox (talk) 17:56, 18 February 2009 (UTC)[reply]
Google is your friend! :-) See http://www.abs.net/~donovan/pgdp.html. Tarmstro99 (talk) 18:26, 18 February 2009 (UTC)[reply]
Thank you. I still have one problem: If I provide a multi-page pbm as input, it will only handle the first page. Is there any special parameter I have to provide to handle all pages? --Slomox (talk) 20:44, 18 February 2009 (UTC)[reply]
I don’t believe so. The solution is to split document.pbm into doc0000.pbm, doc0001.pbm, doc0002.pbm, ... doc0099.pbm with pamsplit, then feed the resulting files into unpaper (which will accept an input parameter such as doc%04d.pbm to automatically start processing multiple files starting from doc0000.pbm). If you want to start processing at, say, page doc0004.pbm instead of counting from 0, just give unpaper the parameters -si 4 doc%04d.pbm. E-mail me if you have further problems with unpaper; I’ve used it for quite a few projects now, and the time spent mastering its idiosyncrasies is well worth it given the quality of its output. Tarmstro99 (talk) 21:02, 18 February 2009 (UTC)[reply]

Concerns about fidelity of Internet Archive DjVu files

[edit]

Following on from my "mystery symbol" discussion above, I now have some serious concerns about the nature of the DjVu encoding used by the Internet Archive, and whether the results can be considered faithful scans.

Here is an image of a paragraph taken from the raw tif files provided to the Internet Archive by Google Books:

And here is the same paragraph after the Internet Archive has encoded it into a DjVu file:

If you look closely you will see that

  1. The R and E of "GREVILLEA" look quite different in the Google Books scan, but have been converted to exactly the same glyph in the Internet Archive DjVu file; and
  2. The u's in "frutices" and "aemulis" have both been converted to what look like small-caps N's.

What worries me is that I can't see how this would have happened unless the Internet Archive's DjVu encoder knows something about what kind of glyphs to expect to find on a page, and is willing to take a guess as to which one is correct—a process tantamount to low-level OCR. If it is the case that the Internet Archive's DjVu processing is guessing glyphs rather than faithfully reproducing whatever it sees, then this casts serious doubt upon how we do our work here. What is the point of using scans to ensure fidelity, if the scans themselves lack fidelity?

Hesperian 01:29, 10 September 2009 (UTC)[reply]

Come to think of it, the encoder need not know about particular glyphs in advance. This output is just as easily explained by the encoder assuming that there are a relatively small number of glyphs, and trying to cluster the glyph instances that it finds into that number of glyph classes. But this is largely irrelevant; infidelity is infidelity whatever the cause. Hesperian 01:53, 10 September 2009 (UTC)[reply]

Looks bad Concern Arlen22 (talk) 01:36, 10 September 2009 (UTC)[reply]

Notice however, that it is correct where the same word appears at the bottom. Strange If in doubt, throw it out Arlen22 (talk) 01:39, 10 September 2009 (UTC)[reply]


Cygnis insignis has pointed out that w:JBIG2 probably explains what is happening here:

"Textual regions are compressed as follows: the foreground pixels in the regions are grouped into symbols. A dictionary of symbols is then created and encoded, typically also using context-dependent arithmetic coding, and the regions are encoded by describing which symbols appear where. Typically, a symbol will correspond to a character of text, but this is not required by the compression method. For lossy compression the difference between similar symbols (e.g., slightly different impressions of the same letter) can be neglected."

I suppose the issue for us is are we going to cop that? Hesperian 02:52, 10 September 2009 (UTC) [reply]

"The key to the compression method [JB2] is a method for making use of the information in previously encountered characters (marks) without risking the introduction of character substitution errors that is inherent in the use of OCR [1]. The marks are clustered hierarchically. Some marks are compressed and coded directly using arithmetic coding (this is similar to the JBIG1 standard). Others marks are compressed and coded indirectly based on previously coded marks, also using a statistical model and arithmetic coding. The previously coded mark used to help in coding a given mark may have been coded directly or indirectly."
— DjVu: Analyzing and Compressing Scanned Documents for Internet Distribution.[1] Haffner, et al. AT&T Labs-Research

— "So it goes", Vonnegut.
— Sigh, Cygnis insignis (talk) 03:52, 10 September 2009 (UTC)[reply]


Apparently the upshot of this is that this issue is inherent to DjVu, rather than specifically to the Internet Archive encoder. This is only the Internet Archive's fault inasmuch as they use very lossy compression. This is bad news all round. :-( Hesperian 04:10, 10 September 2009 (UTC)[reply]

It’s important to emphasize that this is a consequence of the specific compression settings IA has chosen for their djvu encoder. More reasonable settings can produce better results. I grabbed the topmost png image from this post, converted to PAM, and ran it through the c44 djvu encoder at the default settings. The result was:
Not perfect by any means—it would surely have been better to begin with the source TIFF, rather than a PNG; and tweaking the compression settings or using masks to isolate the foreground text could have produced a smaller file with comparable image quality. But a big improvement over IA’s scan, I think. I suppose the lesson here is to do our own djvu conversions whenever possible. Tarmstro99 (talk) 13:16, 11 September 2009 (UTC)[reply]
I'm not much involved in this project, but would it behoove us to make our own DJVUs for these works? If we can't even proof the scans, they're not much use to us.—Zhaladshar (Talk) 16:21, 11 September 2009 (UTC)[reply]
support Arlen22 (talk) 18:00, 11 September 2009 (UTC)[reply]


could you provide a link to the raw tiff and to the djvu at IA, where you spotted this ? ThomasV (talk) 18:52, 11 September 2009 (UTC)[reply]
I see that there are actually two versions of the file online at Commons. The newer one (31 August 2009) is IA’s and contains the compression errors discussed above. The older one (10 August 2008) is GB’s and, at least on the page referenced above, is error-free. Perhaps rather than re-djvu from TIFFs, we could simply revert to the older, error-free version of the document that is already online? Tarmstro99 (talk) 18:59, 12 September 2009 (UTC)[reply]
Once I've taken full advantage of the OCR, I'll manually generate a smik DjVu from the jp2 images, and upload over the top. Hesperian 01:53, 18 September 2009 (UTC)[reply]
  • I think I have profected a process of taking the zip file of tiffs, uncompress the zip, converting the imagines to an uncompressed format (neccessary for the next step) and converting it to a nice high quality djvu file using gscan2pdf. The djvu is a small size with a higher quality than off of archive.org. Can anyone give me a text that is really bad or better yet, can we start making a list of text need replacement? --Mattwj2002 (talk) 14:16, 26 September 2009 (UTC)[reply]
    • What's the process you use? I'm interested, because I'm trying to create DJVU files that are high quality but lower size. Right now I'm only getting pretty high-sized results.—Zhaladshar (Talk) 14:18, 26 September 2009 (UTC)[reply]
      • My process is pretty easy and involves using linux. In my setup I use Ubuntu. The first step is to download the zip files from the Internet Archive. :) Once the download is complete. You'll have unzip the file. This can be done either with the GUI or through the unzip console program. Then I use the following script to convert the tiffs to an uncompressed format (please excuse the messy coding):
#!/bin/bash
ls -1 *.tif | while read line; do convert +compress $line $i.tiff; echo $i; let i++; done
mkdir tiff
mv *.tiff tiff/

Then I take files in the tiff directory and use gscan2pdf to make a djvu file. The djvu appear to be roughly the same quality as the original tiffs and a good size. I hope this helps. --Mattwj2002 (talk) 18:42, 26 September 2009 (UTC)[reply]

        • One other point, some of the tifs are also bad quality (from the Internet Archive). A good source might be pdf's from directly from Google. If you go that route, I recommend the following commands (please bare in mind it takes a lot of ram and time):
pdf2djvu -d 1200 -o file.djvu file.pdf

This can be done using Windows or Linux. I hope this helps. --Mattwj2002 (talk) 09:27, 27 September 2009 (UTC)[reply]

Another way

[edit]

I do mine using ImageMagick and DjvuLibre, both freeware.

For typical black-text-on-white-paper pages, use ImageMagick to convert the tif/jp2/whatever to pbm format. PBM format is bitonal - every pixel is either fully black or fully white. Thus converting the bulk of a scan to this format gives you huge compression. Generally "convert page1.tif page1.pbm" gives you a sensible result, though you can fiddle around with manual thresholding if you want. It all depends on how much effort you are willing to invest in learning ImageMagick. DjVuLibre's cjb2 encoder will convert a PBM image into a DjVu file for you.

For pages with illustrations, convert to PGM for greytone images, or PPM for colour images. Then use DjVuLibre's c44 encoder to encode to DjVu.

Finally, use DjVuLibre's djvm to compile all the single-page djvu files into a single multi-page djvu. I find that listing all the files at once under the -c option doesn't work. You need to append one page at a time.

As for how to manage it all, rather than scripting, I find I get much more control and much more flexibility by enumerating the pages in a spreadsheet, and using formulae to construct the desired commands. e.g you can easily specify which pages should be treated as bitonal, which greyscale, and which colour, and define your formulae to produce the desired command for each case. Having done that, it is just a matter of copying a column of commands, and pasting it to the command line. It is a bit lowbrow, but it really does work well.

Hesperian 11:51, 27 September 2009 (UTC)[reply]

What's the quality of the DjVu you get when you a bi-tonal input? I've been using pdf2djvu because it doesn't reduce the colors of the PDF images when converting to DJVU and I get a nice, smooth looking result. Do DJVUs from bi-tonal images look good (or at least decent and not choppy) when all is said and done?—Zhaladshar (Talk) 13:01, 27 September 2009 (UTC)[reply]
Make up your own mind:
  • Pages from the Internet Archive version of An introduction to physiological and systematical botany typically look like this.
  • The IA version was missing a couple of pages, which I obtained elsewhere and shoe-horned in, having converted then to bitonal then encoded into DjVu using the bitonal encoder. Those pages look like this and this.
Hesperian 13:17, 27 September 2009 (UTC)[reply]
That answers my question. Thanks.  :) The quality isn't bad at all.—Zhaladshar (Talk) 13:24, 27 September 2009 (UTC)[reply]
To give an idea of what can be achieved, I managed to fit File:History of West Australia.djvu into 69Mb—not bad for 652 physically large pages at 200 dpi, including about forty plates that had to be retained in greyscale. It works out to about 250 pixels per bit; that's some serious compression. Hesperian 13:43, 27 September 2009 (UTC)[reply]