User:Inductiveload/cleanup
This is an alpha level tool. Feel free to file bugs, but do not expect it to be perfect and avoid running it on pages you have already proofread - there are no guarantees it will not introduce errors, though most replacements are tested against a wordlist. |
Installation
[edit]The basic script can be installed as follows:
mw.loader.load('//en.wikisource.org/w/index.php?title=User:Inductiveload/cleanup.js&action=raw&ctype=text/javascript');
Once installed, you will access to the default cleanup tool configuration. You may wish to add more (work-specific) corrections, disable some, or add other configurations like possible languages or long-s corrections.
Concept
[edit]The tool performs a list of common actions:
- Collapsing of lines and hyphens where appropriate
- Adding paragraphs where likely
- Typographic fixes like spaces before/after commas
- Removing obvious running headers and scanning watermarks
- Fixing OCR errors:
- This uses a large (hundreds of entries) list of replacements, mostly tested against an English wordlist to avoid false positives. For example not many words end in
rcs
, socomparcs
is likely to becompares
. However,arcs
andorcs
are not changed. - A separate (partially-complete) list of long-s scannos is also included, but this is more likely to have false positives, so it is not on be default.
- This uses a large (hundreds of entries) list of replacements, mostly tested against an English wordlist to avoid false positives. For example not many words end in
- Extra user-defined functions
Configuration
[edit]Configuration is via a "standard" (for me) <nowik>mw.hook</nowiki>
, which is called with a configuration object for you to update:
This is the default config object, which is what you will get if you do not add a config hook handler:
const Cleanup = {
logLevel: ERROR,
enable: true,
testFunctions: [],
enableTesting: mw.config.get( 'wgTitle' ).endsWith( 'cleanup-test' ),
portletCategory: 'page',
activeNamespaces: [ 'page' ],
actionTitle: 'WsCleanup',
additionalOcrReplacements: [],
disabledReplacements: [],
cleanupFunctions: [],
italicWords: [],
doLongSReplacements: false,
doTemplateCleanup: true,
remove_running_header: true,
replaceSmartQuotes: true,
collapseSuspiciousParagraphs: true,
shortLineThreshold: 45,
possibleLanguages: [ 'en', 'fr', 'es', 'de', 'zh-pinyin' ],
italiciseForeign: true,
smallAbbreviations: [],
smallAbbrTemplate: 'smaller',
editSummary: '/* Proofread */',
markProofread: true,
cleanupAccesskey: 'c'
};
logLevel
- The logging level of the Cleanup functions. Set to 0 for
DEBUG
, 1 forINFO
and 2 forERROR
enable
- Enable the cleanup script (false prevents any of it from being added to the UI)
testFunctions
,enableTesting
- For now, internal
portletCategory
- The portlet category to add the tool link to
activeNamespaces
- Namespaces to load in (does nothing in other namespaces)
actionTitle
- The name in the sidebar
additionalOcrReplacements
- A list of additional OCR fixes (see below)
disabledReplacements
- A list of disabled replacements (see below)
cleanupFunctions
- Additional functions to run at the end of the process (see below)
replaceSmartQuotes
- Convert “smart quotes” to "straight quotes".
doLongSReplacements
- Perform a set of fixes for badly-OCR'd texts using long-s. For example:
affift
→assist
collapseSuspiciousParagraphs
- Collapse paragraphs together if they look "suspect". For example, if one ends without punctuation, and the next starts with a lowercase letter.
shortLineThreshold
- Dirty hack to re-insert paragraphs lost in the DjVu round trip. Adds a paragraph break to lines shorter than this that appear to be a sentence end and the next line looks like a sentence start. The spiritual inverse of
collapse_suspicious_paras
possibleLanguages
- Set to
true
if the work might contain these languages: this disables some replacements that would be invalid in those languages. For example, in Germanund
is valid, but in English, it's more likely to be a scanno forand
. italiciseForeign
- Italicise foreign words. This is a very short list at present.
smallAbbreviations
- Abbreviations to put inside an {{asc}} template. Note, A.D. and B.C. are already templates.
smallAbbrTemplate
- The template to use for small abbreviations.
editSummary
- The edit summary to automatically add, if any
markProofread
- Also mark the page as Proofread. Note: you still have to actually proofread the page yourself, this isn't magic!
cleanupAccesskey
- The access key to use: e.g.
c
→ Ctrl+Alt+c
Additional replacements
[edit]This is a list of replacements to make. It is a list of tuples of [ /regex/, "replacement/"]
entries.
Often, you will make special replacements only in certain works:
if ( _title.startsWith( 'The Chinese Review' ) ) {
cleanupConfig.additional_ocr_replacements.push( ...[
[ /\bYch\b/, 'Yeh' ],
[ /\bBouring\b/, 'Bowring' ]
] );
}
The g
is always applied to these regexes. If you need a non-global regex, use a cleanup_function
. Other flags are kept (e.g. i
). Replacement references like $1
work.
Disabled replacements
[edit]Disable replacements are a list of replacements to not apply even though they are part of the normal script:
cleanupConfig.disabledReplacements.push( ...[
[ /\berery/ ],
[ /\bhighfer/, 'higher' ],
] );
If only one element is given in an item, all replacements with that regex are disabled. If two are given, the regex and the replacement must match for the disabling to happen. Only the "text" of the regex is compared, flags are not used.
Cleanup functions
[edit]Final functions to run. This is a list of functions that are given the header, body, footer
editors (as in TemplateScript) as parameters. They run in order.
cleanup_functions
cleanupConfig.cleanupFunctions.push( ...[
function ( editor, header, footer ) {
header.set( 'Some custom header!' );
},
function ( editor, header, footer ) {
header.set( 'Some custom footer!' );
},
] );
See also
[edit]- TemplateScript General purpose tool that allows you to add actions to the sidebar. The script is mostly one huge TemplateScript action.