User:GrafZahl/How to digitalise works for Wikisource
Jump to navigation
Jump to search
This is a quick rundown of how I usually create a Wikisource-ready DjVu scan of an old public domain or otherwise free work.
- Often, I do not own the relevant works myself, so I visit a library which has them. I'm most interested in old mathematics journal articles, and most libraries provide those only for reference, not for borrowing. So I have to use a local photocopier. Normally, I let the machine send the copies directly to my private e-Mail address as 300dpi or 400dpi TIFF or PDF file. Not only are the fees much smaller than for creating hardcopies but this also saves an additional A/D step, leading to higher output quality. Plus, library photocopiers are much faster than my personal scanner. Unfortunately, this option is not available in small libraries with old photocopiers.
- When I work with the raw scans, I use the PBM file format (that's the format created by my personal scanner).
- To convert TIFF files to PBM, I first create a subdirectory called
tifdir
and split the original TIFF into its individual pages with thetiffsplit
program from libtiff:
- To convert TIFF files to PBM, I first create a subdirectory called
tiffsplit scan.tif tifdir/article
- will create files named
articleaaa.tif
,articleaab.tif
, and so on in thetifdir
subdirectory. Then I use theconvert
program from ImageMagick to convert the file to PBM (and possibly rotate the file in the process). For example
- will create files named
cd tifdir/ for file in *.tif; do convert -rotate 90 "$file" "$file".pbm; done;
- will create rotated PBM files (Warning to mathematicians: the rotation algorithm uses left-handed (clockwise) rotation.)
- To convert PDF files to PBM, I use the
pdfimages
utility from the Xpdf suite. The output file format depends on the format of the image embedded in the PDF. If it's not already PBM, you can useconvert
like above to convert the files to PBM.
- Sometimes, the PBMs need to be cropped before they are converted to DjVu. I use a quick-and-dirty home-brewn
pbmextract
program for that which lets you specify the coordinates of the extraction rectangle (so you can read them off directly from some image manipulation program like The GIMP). The reason I don't use any off-the-shelf image manipulation software is that they're often not sufficiently capable of handling bitonal files. - Once I have the PBM files ready, I convert them to DjVu using the
cjb2
anddjvm
programs from the DjVuLibre suite:
for file in *.pbm; do cjb2 -dpi 400 -clean "$file" "$file".djvu; done; djvm -c finished_work.djvu *.djvu
- Obviously, you may have to change the
-dpi
option depending on your situation. The-clean
option removes "flyspecks", leftover artefacts from the scanning process. Of course, this also means the compression is no longer lossless, so depending on your source material you may want to omit this option.
- The finished DjVu can the be uploaded to the Commons. Don't forget to fill out the info template, specify a licence, and categorise. Example: commons:Image:Über die Vertauschung von Argument und Parameter in den Integralen der linearen Differentialgleichungen.djvu.
- Once the file is uploaded to the commons, the text should be transcribed and proofread for Wikisource (or OCR'd, but OCR software does not work very well on texts with a lot of mathematical symbols). The ProofreadPage extension makes this process quite easy. See also Help:Side by side image view for proofreading.