Wikisource talk:Community collaboration/Monthly Challenge/November 2021

@Inductiveload, @TE(æ)A,ea., @Zoeannl: I think that I’ve almost finished setting up the MC for November. From the October experience, it seems that having more works increases the total number of pages proofread. From my observations, the number of pages proofread is high at the start of the month, sustains for a while, and then declines rapidly. To me, this indicates that users proofread what they find interesting and then the rest of the works either stall out or are proofread at a slower pace. The goal is the try and maintain interest. My hypothesis is that fewer works will lead to a shorter period of sustained interest. Therefore, the strategy should be to introduce more works and try to continue to recruit new users who might be interested in the stalled works.

One of the new features for next month is an indication of difficulty thanks to a suggestion from Zoeannl. In total, there are 17 works divided into the following categories of difficulty. As always, I’ve tried to select key works from English literature or scholarship to help build the core collection of enWS.

Easy – These works have good OCR and have simple formatting consisting mainly of font sizes, bold, italics, centering, and a TOC.

The Common Reader by Virginia Woolf
The Siege of London by Posteritas
The American Novel by Carl Van Doren
Tess of the d'Urbervilles by Thomas Hardy
The Posthumous Papers of the Pickwick Club (First Edition) by Charles Dickens

Average – These works have references and indexes in addition to the formatting of Easy texts. In addition, there language tends to be more academic or contain dialetical English.

Colonization and Christianity: a popular history of the treatment of the natives by the Europeans in all their colonies by William Howitt
The origin of continents and oceans by Alfred Wegener
Middlemarch: a Study of Provincial Life by George Eliot

Poor OCR – These works would otherwise be in the Easy Category, but the OCR is fairly terrible and will require significant corrections.

This Side of Paradise by Francis Scott Fitzgerald

Last Bits. – This category could use a better name, but the idea is that this is an Index that has mostly been proofread, but a few more challenging pages remain.

The Gift of Black Folk by W. E. B. Du Bois

Second Glance - This category is for works that appears to have been proofread once, but require a careful look to make sure that the text matches the scan.

The Secret Garden by Frances Hodgson Burnett

Formatting. – This category is for Indexes imported from another site that also does proofread, but wishes not to be named. They are fully proofread, but require checking to make sure that all the formatting is there. All pages will require the addition of header and footers. However, most pages will require no other work.

The Elizabethan stage (Volume 1) by E.K. Chambers
The Jade Story Book: Stories From the Orient by Penrhyn W. Coussens (Fairy Tales seem to be in fashion recently and it's good to have more non-US/UK works)
The collected works of Henrik Ibsen (Volume 1) by Henrik Ibsen
Fragment of a novel written by Jane Austen by Jane Austen (Might have a few more errors than the previous ones)

Transclusion – These works have been fully proofread, but remain untranscluded. They are a great place for a user to learn about transclusion and reduce the backlog.

Southern Historical Society Papers, Volume 3
Whole prophecies of Scotland, England, Ireland, France & Denmark by Penrhyn W. Coussens

Languageseeker (talk) 01:48, 29 October 2021 (UTC)Reply

I was skeptical at first, but I see now that more works tend to have more work done to them. This Side of Paradise will take Tesseract well; a note should be left for users to use that OCR rather than the default. The Jade Story Book is a nice find, but the text layer is foreign and no good. Also, Coussens only wrote one of the three books you attribute to him. Otherwise, I have no concern with this selection, other than the length of some of these works. The introduction of categories is a good idea, I think; especially the “easy” category. TE(æ)A,ea. (talk) 02:46, 29 October 2021 (UTC)Reply

@TE(æ)A,ea.: Happy to hear that you think that things are working out. What do you mean that "the text layer is foreign and no good" for The Jade Story Book? Languageseeker (talk) 03:01, 29 October 2021 (UTC)Reply

It is, right? The text layer should be OCR or match-and-split, but this text layer split from one of your user sub-pages, without a source. I don’t like working with such source text, where I don’t know where the text came from, and when I can’t be sure of matching OCR. From what I have heard, others share my objection. I have been interested by a number of indexes you’ve created, but have been dissuaded by the pre-created text layer. TE(æ)A,ea. (talk) 12:49, 29 October 2021 (UTC)Reply
@TE(æ)A,ea.: Ah, let me clear up the mystery. The texts are imported from PGDP after they complete the f2 stage. On PGDP, the texts are proofread against the original scans and no corrections are made to them. So, these Indexes have been triple-proofread and formatted against a scan. Inductiveload wrote a script to use match-and-split to import these files into enWS. I use this script to replace the OCR of the scan with the PGDP text. Languageseeker (talk) 14:50, 29 October 2021 (UTC)Reply
- (It seems my previous message got deleted by server error; how unfortunate.) I don’t think it acceptable to import text from other Web-sites of unknown origin to use as the basis of a scan-backed text. At that point, we cannot be sure of the value of local proofreading efforts, and the text might as well be imported directly from elsewhere—This is my concern. I had other comments, but I have forgotten them. Also, you have “Whole prophecies of Scotland, England, Ireland, France & Denmark” and “Southern Historical Society Papers, Volume 3” listed as being written by Coussens; could you fix that, please? TE(æ)A,ea. (talk) 16:22, 29 October 2021 (UTC)Reply

@TE(æ)A,ea.: The other site also uses a scan-backing system. They import their books from the IA and HT in the same way that enWS does, proofread them page by page 3 times, and then do two rounds of basic formatting. I'm importing a scan-backed text from one site to another. Languageseeker (talk) 19:09, 29 October 2021 (UTC)Reply

Yes, I understand that you are importing the text layer from elsewhere; it is this I oppose. TE(æ)A,ea. (talk) 19:16, 29 October 2021 (UTC)Reply

Can you tell me why you oppose? If the text is corrected from the same scan is there any drawback? Genuinely curious. Languageseeker (talk) 19:18, 29 October 2021 (UTC)Reply

Yes. Well I am not of that opinion, it is a belief generally held that works should not be imported en masse from other locations, such as Project Gutenberg. I do not believe that the text layer you have imported will be conducive to proofreading on Wikisource; thus, the end result will be a more or less direct importation of the text from Project Gutenberg (or whichever location) here. TE(æ)A,ea. (talk) 19:21, 29 October 2021 (UTC)Reply

I think that the problem with Project Gutenberg and similar sites is that they silently correct errata and printer mistakes. This leads to the creation of a new edition and errors that are much harder to catch. Project Distributed Proofreaders (PGDP) which creates many of the texts on Project Gutenberg uses a scan-backed system that forbids any corrections to the original text during the proofreading stages P1-F2. So our two processes are identical. Then, after the F2 step gets completed. The work is checked-out by a Post-Processor who creates the Project Gutenberg text and can also alter the text. I import the text from the F2 stage before any corrections are made.
@TE(æ)A,ea.: I'm also generally against importing works in bulk, but PGDP archives the proofreading files after around 100 days and at that point they are only available to administrators who have made it clear that they will never share those files with enWS. Their stance is that the administrators own the work of their volunteers. Their sole purpose is to create PG texts and after that the unaltered text has no more use. Personally, I don't think that it makes any sense to create two scan backed copies of the same work. There are enough works out there that there is absolutely no need to waste volunteer time on repeating the same work which is why I am bulk importing them before they uncorrected text gets deleted. Honestly, I wish that they would just share the work of their volunteers so that we could import the texts on a need-to basis, but that's just not possible. Languageseeker (talk) 19:44, 29 October 2021 (UTC)Reply
Thank you for catching the two errors with the author. I fixed them now. Languageseeker (talk) 19:47, 29 October 2021 (UTC)Reply

As a proofreader, I don't feel there is any difference if the text to be proofread comes from an OCR or imported from PG-I still proofread against the scan. From a brief glance the scans match the text, so no problem. Also, as one of the main benefits of WS over DP is that we always have the scan to compare to, we will always be able to compare the text to scan. Zoeannl (talk) 21:28, 29 October 2021 (UTC)Reply

I think "Last bits" should be named "Problematic" as this is the Page Status used for missing tables, images etc. Zoeannl (talk) 21:33, 29 October 2021 (UTC)Reply