Jump to content

User:SnowyCinema/QuickTranscribe

From Wikisource

QuickTranscribe is a Wikisource transcription tool, started by PseudoSkull, that allows for transcription of entire works on only one page, with a shorthand markup language called "QT markup". A bot does all the tedious work semi-automatically. The software is public domain.

The QuickTranscribe project is currently a work-in-progress. I will make this generally available to the Wikisource community when it is more user-friendly, and its use is more applicable to a wider range of transcription project types.

Here's an example of what QuickTranscribe is capable of. Keep in mind that most of this was literally automatically generated from a single page!

Milestones

[edit]

Tutorial

[edit]
To be added
  • Installation
  • Transcription preparation
  • Transcribing a work
    • QT markup documentation
  • Post-transcription maintenance

Support

[edit]

Types of works supported

[edit]
  • Basic chaptered novels
  • Front-matter-only prose works (such as children's books)
  • Collections (of short stories, poems, essays):

IA/Hathi/Books

[edit]
  • Can download all necessary image/scan files from HathiTrust and the Internet Archive

OCR

[edit]

Can automatically extract OCR from every page of any PDF file and return it to be processed by the proofreader.

Transcription cleanup

[edit]
  • Finds hyphenation inconsistencies (such as "foot-ball" and "football" appearing in the same transcription, which is most definitely an error)
  • Finds lots of probable scannos
  • Weird symbols within words ("roUed", "<0uld")
  • Bad single symbol (such as " f ")
  • Paragraphs not ending with punctuation

Wikidata

[edit]
  • Can create Wikidata items for both the base work and the version, and can add all necessary data to those items when necessary
  • Very cool features including:
  • Main work image (either cover or frontispiece), which is automatically detected by the software
  • Automatic parsing of a "dedications" page to add to a "dedicated to" property

Wikimedia Commons

[edit]
  • Can get all work image data based on 1. placement in the transcription and 2. iterative name ("1.png", "2.png", etc.)
  • Can create a Commons category for the work, with the necessary parent categories
  • Can create a Creator page for an author if it doesn't exist
  • Can upload both the scan file and all the work images, with a good titling scheme and with valid file descriptions and categories

Transcription parsing

[edit]
  • Can automatically convert beginning text of a chapter into small-caps ({{sc}}) properly, or put drop initial ({{di}}) at the beginning
  • Can automatically generate tables of contents, based on a format provided, and text inputted into the chapter headers
  • Can automatically place images into the work after uploaded
  • Supports formatting continuations between pages ({{fine block/s}} to {{fine block/e}} in headers/footers, continuations of poems across pages with {{ppoem}} etc.)

Transclusion

[edit]
  • Can automatically create Index pages
  • Automatically creates a default style sheet in the Index page based on templates used
  • Can input all pages properly into the Page namespace (assuming they're either of status "Proofread" (3) or "not needing to be proofread" (0))
  • Can transclude an entire chaptered novel accurately

To be done

[edit]
  • Disambiguation page creation/handling
  • Version page creation/handling
  • Semi-automated author page creation, updating, and disambiguation
  • Support for poetry collections
  • Support for film transcription (an improvement of the WikiProject Film draft system)
  • Support for periodicals
  • Support for newspapers
  • Support for dictionaries/directories/catalogs
  • Support for encyclopedias
  • Support for local (enwikisource) uploading of files, and categorization of those files
[edit]
  • GitHub repository, the code base for QuickTranscribe, released into the public domain


I, the copyright holder of this work, hereby release it into the public domain. This applies worldwide.

In case this is not legally possible:

I grant anyone the right to use this work for any purpose, without any conditions, unless such conditions are required by law.

Public domainPublic domainfalsefalse