Jump to content

User:Inductiveload/Parser migration

From Wikisource

MediaWiki is preparing to change the under-lying parser that is used to interpret Wikitext and produce HTML. The old parser is called "Tidy" and the new one is "Remex".[1]

There are many (hundreds of thousands) of cases where the Wikitext at Wikisource is somehow invalid. In some cases, this might cause different results when the new parser is used. However, in a lot of cases, there is no real visible difference.

There is a tool called a linter which can show many of these problems. They are shown at Special:LintErrors.

How to check for linter errors

[edit]

Other than trawling Special:LintErrors, there is a tool by which adds a linter checker to the top of pages. There are instructions at en:w:User:PerfektesChaos/js/lintHint, but to activate on all WS pages, add this to your Special:MyPage/common.js:

// linter config object
var myLintHints = { };

//  lint in all namespaces
myLintHints.rooms = "*";

// communicate user defined object
mw.hook( "lintHint.config" ).fire( myLintHints );

// finally, load gadget
mw.loader.load( "https://en.wikipedia.org/w/index.php?title=User:PerfektesChaos/js/lintHint/r.js&action=raw&bcache=1&maxage=86400&ctype=text/javascript" );

A yellow button should appear on pages near the top and when clicked it will show the linter errors on the page. When you are editing a page, clicking it will check the current editor contents live.

Note: currently doesn't seem to work in Page: or Index: namespaces. The development version (as of 2.14) does work. Replace r.js with d.js above.

How to compare parser outputs

[edit]

There is a tool that you can active in your Preferences under Editing called "Parser Migration tool". This allows you to see a page as it is processed by each parser. Ideally, both sides will be identical.

Errors and templates

[edit]

The error reporting at Special:LintErrors often includes the template the error is in. This can be useful, and it can be misleading, as it could indicate the error is in a parameter of the template, or it could indicate it's somewhere in the template code, or it could be an interaction of the two.

Error in parameter

[edit]

In this case, the error has nothing to do with the template, it just happens to be in a parameter:

{{larger block|'''foobar}}

This will be reported as though {{larger block}}, but the problem is not found in the template code.

Common linter errors

[edit]

There is a description of each linter error reported at mw:Help:Extension:Linter.

Below is a quick description of some common ones in the context of Wikisource. Nearly all linter errors are harmless, in that the affected code does not generally render out differently, even if the two parsers might disagree about something. However, they do also often indicate low-quality markup or markup that has a typo or has been broken accidentally.

Misnested tags

[edit]

Span vs block

[edit]

HTML has "span" and "block" elements. Span elements generally look like <span>, <small>, <font> etc. Block elements are like <div> and <p>.

Span elements should not contain block elements. However this is often inadvertanly done by including block elements inside a template that represents a span elements:

{{larger|Foo

bar}}

In this case, a block element is produced by the paragraph break and is put inside a span-based template. This produces a Misnested tag with different rendering in HTML5 and HTML4 or Misnested tag linter error. To resolve it, either use a separate {{larger}} for each line, or use the {{larger block}} template, which can contain block elements.

If the misnested tag is a <font> tag, consider replacing with a block sizing template like {{larger block}}, as the font tag is deprecated anyway.

These errors are marked as "high-priority" (for the HTML4/5 ones). These problems will cause differences in rendering:[2]

The others are "medium-priority", and do not generally appear to cause much visual disturbance between the two parsers.

Tag interleaving

[edit]

This error can also happen when tags are interleaved rather than nested:

<b> fdsdfsf <i> sddsf </b>  adsafd </i>

In this case, the bold tag is closed before the italic tag, even though it was opened first. In HTML, the tags must be nested, so that if you open a tag inside another one, you close the inner one before you close the containing one. Exactly how you do this will depend somewhat on what the markup was trying to achieve but might be like this:

<b> fdsdfsf <i> sddsf </i></b> <i> adsafd </i>

This kind of mis-nesting is fairly rare as it's harder to do with templates, and generally the two parsers come up with the same output.

It can still happen with wikitext:

'''dasdas ''asdads''' adad''

Unterminated markup

[edit]

Markup like the following confuses the parser as it has to guess where to put the closing tag. When this happens, it often makes the right choice (which is why editors don't notice). In any case, it raises a Missing end tag error.

''italic, but where is the end?
<b>bold, but where is the end?

There are many tags that can be left unterminated (upsetting the linter but maybe producing valid output). The linter will tell you what the tag is. The majority are italic or bold markup, either HTML tags or Wikitext.

Obsolete tags

[edit]

Some tags are deprecated in HTML, mostly because they violate the separation between content and layout. The most common are <center> and <font>.

<center> can generally be easily replaced by {{center}}.

<font> can usually be replaced with a colour template like {{red}} or one of the text size templates like {{larger}}, depending on what it's being used for.

Stripped tags

[edit]

These are when the parser doesn't know what to do with a tag and discards it. Very often it's due to a superfluous closing tag, possibly left when the opening tag was removed in the past:

<div class="foobar">
Lorem ipsum
</div>
</div>

These errors probably are harmless as the linter is discarding them anyway, but they might be a sign that some formatting has been broken.

[edit]

This is a very specific error caused by code like this:

<font color="#CC99CC">[[A link]]</font>

When the new parser interprets this, the link will not be coloured as expected. Generally, at WS, this only happens in user signatures, so it's not a critical problem affecting content. As <font tags are deprecated anyway, the correct markup for it would be:

<span style="color:#CC99CC;">[[A link]]</span>

Error in template

[edit]

Any parser error that happens in the template will happen on all pages that use the template.

Imagine a template like this that used <center to center the only parameter:

<center>{{{1}}}</center>

Every page that uses this template will show up with an "Obsolete tag" lint error. Each error will be reported as being through this template. Changing <center> to {{center}} would fix every page the template is used on. It can take a while for Special:LintErrors to update when these errors are fixed.

Error caused by interaction of parameters and template

[edit]

Imagine a template that makes text red:

<span style="color:red;">{{{1}}}</span>

As this is a span-based template, if you feed it block elements, it will cause "Misnested tag" errors. Again it will be reported as though the template, but it's not fully the template's fault, and it's not fully the parameter's fault. The errors can be avoided in two ways:

  • Fix the parameters to not cause the errors. This may mean you can't format something quote how you wanted if the template isn't written to allow it (e.g. paragraph breaks in a template that wraps the input in a span)
  • Change the template to accomodate the input you want to give (in the above case, use a div). Bear in mind, this might break existing users of the template (in the above example, you would no longer be able to use the template within a line of text without a line break).

Another example of this could be a template like this:

''{{{1}}}''

If you call this template like this:

{{mytemplate|<span>''text''</span>

You will get an error, because the wikitext expands to:

''<span>''text''</span>''

Which is invalid, as the span and italic tags will be interleaved rather than nested as you might expect (as '' stands for <i> and </i>, depending on context). In this case, you could:

  • replace one or both of the '''s with <i></i>
  • rethink why you are italicising text that's already being italicised by a template - are you mistaken or is the template too inflexible?
  1. More details here: mw:Parsing/Replacing Tidy
  2. See mw:Help:Extension:Linter/html5-misnesting for details of why this happens.