Wikisource:DjVu vs. PDF

From Wikisource
Jump to navigation Jump to search
DjVu vs. PDF

Which file format is best for proofreading? Why is DjVu preferred? For help with using DjVu files, see Help:DjVu files.

The ProofreadPage Extension is the backbone of Wikisource's workspace (the Index namespace and the Page namespace) in which proofreading takes place. This uses a scanned file of a physical work to create an Index page, from which Page pages are created, which are eventually transcluded into the main namespace for anyone to read.

These days, the ProofreadPage Extension can use different formats for the scanned file. This was not always the case. When the extension was first written, only DjVu files were compatible with our needs, so only DjVu files were considered for the software. Since then, the PDF format has changed; it is now acceptable for Wikisource's purposes and covered by the extension.

For historical reasons, DjVu files are still preferred on Wikisource but either DjVus or PDFs can be used and there are advantages to both. This leaves the question: Which file format should be chosen when starting a new transcription project?

DjVu

[edit]

DjVu (pronounced "day·zha·voo" like French: déjà vu) is a computer file format designed primarily to store scanned documents, especially those containing a combination of text, line drawings, indexed color images, and photographs. It uses technologies such as image layer separation of text and background/images, progressive loading, arithmetic coding, and lossy compression for bitonal (monochrome) images. This allows for high-quality, readable images to be stored in a minimum of space, so that they can be made available on the web.

DjVu has been promoted as an alternative to PDF, promising smaller files than PDF for most scanned documents. The DjVu developers report that color magazine pages compress to 40–70 kB, black and white technical papers compress to 15–40 kB, and ancient manuscripts compress to around 100 kB; a satisfactory JPEG image typically requires 500 kB. Like PDF, DjVu can contain an OCR text layer, making it easy to perform copy and paste and text search operations.

DjVu files are an open source container format, holding page images and text to replicate scanned documents. As an open format from the start, it was allowed to be hosted on, and supported by, Wikimedia. DjVu has been supported since the beginning of the ProofreadPage extension and held a monopoly regarding proofreading on Wikisource for a long time. This head-start is part of why DjVus are still the most popular format on the project.

Advantages

[edit]
  • Compatible philosophy: The DjVu format is and always has been an open format, while PDF was originally a proprietary format owned by Adobe. PDF became an open standard (ISO32000) in 2008, and is no longer controlled by Adobe or requires royalty payments from implementators, but it is often still considered less open than DjVu.
  • Smaller files: DjVu files are generally smaller than equivalent PDF files. Wikimedia Commons used to have a 100MB limit, and earlier it was even smaller than that, but the limit is now 4GB (with most uploading methods). Nevertheless, it can be easier to work with smaller files because various processes (extracting images etc.) are quicker.
  • Tried and tested: DjVus have been in use on Wikisource for longer than PDFs. It is more likely that any problems that can occur have already occurred and have been solved. DjVu files are less likely to cause problems with the ProofreadPage Extension or Wikisource in general.

Disadvantages

[edit]
  • Lower resolution: DjVu files have a lower resolution than PDFs. For the most part, this is not a problem for proofreading as long as text is legible. It can be a problem with smaller text or text on the borderline of legibility. This might be a problem for illustrations and other images, if the images are being extracted from the file.[1]
  • Glyphs: Due to the compression system used in DjVu files some glyphs (letters, numbers and other symbols) may be incorrect. The pixels of each page are divided into symbols and a dictionary of these symbols is then created. The pages are then encoded by describing which symbols (from the dictionary) appear where. Therefore, the page image may not always be an exact representation of the original work. Note that this should only happen with poor quality, lossy compression.
  • Less external support: The DjVu format is not as widely supported as PDF. There is less software available for creating and editing files in this format.

PDF

[edit]

Portable Document Format (PDF) is a file format used to represent documents in a manner independent of application software, hardware, and operating systems. Each PDF file encapsulates a complete description of a fixed-layout flat document, including the text, fonts, graphics, and other information needed to display it. In 1991, Adobe Systems co-founder John Warnock outlined a system called "Camelot" that evolved into PDF.

While the PDF specification has been available free of charge since at least 2001, PDF was originally a proprietary format controlled by Adobe. It was officially released as an open standard on July 1, 2008, and published by the International Organization for Standardization as ISO 32000-1:2008. In 2008, Adobe published a Public Patent License to ISO 32000-1 granting royalty-free rights for all patents owned by Adobe that are necessary to make, use, sell and distribute PDF compliant implementations.

In the early days of Wikimedia, PDFs were not allowed to be hosted on Commons or Wikisource because it was not an open standard. This ended in 2008 when the format was released by Adobe. Following this change, PDFs can be freely uploaded to Wikimedia Commons and the ProofreadPage extension has been adapted to be compatible with this format.

Advantages

[edit]
  • Higher resolution: PDFs have a higher resolution than DjVus. This may not make much difference with text which just needs to be legible to be used for transcription by Wikisource, unless the text is particularly small or on the borderline of legibility, in which case PDFs are clearer. This might be an advantage for illustrations and other images, if the images are being extracted from the file.[1]
  • Wider external support: PDFs are more widely known and supported than DjVu files. Many users can find it easier to create and edit PDFs for this reason and PDFs may be easier to acquire than DjVu files.

Disadvantages

[edit]
  • Bugs: As a later addition to the original software, PDF-support is more likely to exhibit "buggy" behaviour and less likely to be as fully tested as DjVu support. Known bugs include:
    • The inability of the ProofreadPage Extension to recognise accents and diacritics in PDFs.
    • Some recent versions of PDF cannot be read properly by Wikimedia's Ghostscript software. In these cases, all pages appear blank when viewed on a Wikimedia website.
  • Larger file size: PDFs are generally larger than an equivalent DjVu file. Wikimedia Commons had a 100MB filesize limit, which may be a problem with very large documents (and certainly was a problem in the past when this limit was much smaller).
  • Expensive software: Editing a PDF can still require more expensive, proprietary software, whereas DjVu is free in both senses of the word.

Other options

[edit]

In some cases, Index pages can be created without a single source but with individual page images. This only really has the advantage of being able to make use of resources available without having to convert them to a different format. The Index page itself will be more difficult to set up for no real advantage. This is not recommended for works with more than a small handful of pages.

Notes

[edit]
  1. 1.0 1.1 With many sources of scanned files the original page images are available (for example, on the Internet Archive). It is often better to capture illustrations and other images directly from these page scans and not from either the DjVu or PDF, as the originals will be of higher quality.