User talk:Harris7

What to do when a current Wikisource work has no source

Latest comment: 2 years ago4 comments2 people in discussion

Hi, I just started on Wikisource recently, and validated/corrected several pages of An Indiscretion in the Life of an Heiress by editting pages in the source text, for example: Page:Littell's Living Age - Volume 139.pdf/90

Now I wanted to validate/correct several errors I noticed in Ardessa, but there is no source.

I assume that is because there is no <pages> element in the article?

I have searched Wikimedia Commons for "The Century Magazine", 1918, but didn't find it.

Can I just edit the pages of Ardessa directly, or do I need to find a source PDF/DJVU to upload first?

Thanks! Harris7 (talk) 20:39, 19 August 2022 (UTC)Reply

Harris7: The story “Ardessa” was transcribed from an off-Wikisource scan of the text (here), without using Index: or Page: (and thus no <pages>). You can either check the text here against the text at that scan, or make a request at the Wikisource:Scan Lab for someone to upload the index for you, so that you can check it against an index here. In short (to answer your question), it is your choice between the two options I gave (although it will probably be easier to work with a PDF/DJVU uploaded here). TE(æ)A,ea. (talk) 20:58, 19 August 2022 (UTC)Reply
TE(æ)A,ea.: Thanks!
I'm curious: How did you find that link to the (off-Wikisource) scan file on Internet Archive? Harris7 (talk) 21:04, 19 August 2022 (UTC)Reply
- Harris7: It was listed on the talk page (Talk:Ardessa) by the person who posted the transcription. TE(æ)A,ea. (talk) 21:21, 19 August 2022 (UTC)Reply

Hyphenated words across pages

Latest comment: 1 year ago9 comments3 people in discussion

When a hyphenated word is split across two pages, simply leave it "as is". The software automatically joins the word when the Pages are transcluded. --EncycloPetey (talk) 20:35, 19 July 2023 (UTC)Reply

EncycloPetey: Thanks, but when I exported Middlemarch to an epub (before I made this change), it left the hyphenated word in the text as "ambi- tion". Is this possibly a bug in the epub generation software? Harris7 (talk) 20:59, 19 July 2023 (UTC)Reply
- EncycloPetey: Another data point: when I view chapter 1 here: Middlemarch_(1874)/Chapter_1 it also renders this as "ambi- tion" (about midway through the first paragraph).

If you are seeing an error, then you should report it in the Wikisource:Scriptorium, as that bug would affect hundreds of works. --EncycloPetey (talk) 20:44, 20 July 2023 (UTC)Reply
There are two possible issues I can see that might be causing the problem:

(1) The first page is a transcluded section, not the full page, and that might be a unique problem to be solved.

(2) The header tag is being used instead of the full {{header}} template. --EncycloPetey (talk) 20:48, 20 July 2023 (UTC)Reply
I've tried a number of things to try to identify what is causing the issue, but can't find it. A Scriptorium discussion might turn up the cause. --EncycloPetey (talk) 20:50, 20 July 2023 (UTC)Reply

User:ShakespeareFan00 determined how to solve the issue. Apparently a carriage return before the end of section caused the hyphenation to fail to collapse. Once that carriage return was removed, everything worked as it should. --EncycloPetey (talk) 21:38, 21 July 2023 (UTC)Reply
EncycloPetey: Thanks to you & ShakespeareFan00 for resolving this! I found another one at the end of this page: Middlemarch (1874)/Chapter 14. I tried the same fix, which fixed the page-end-hyphen problem, but broke the italics: it changed the last italic section to bold instead of italics (i.e. interpreted three single quotes as bold instead of single quote-italic text...) Help? Harris7 (talk) 20:24, 22 July 2023 (UTC)Reply
There are a couple of issues there. (1) There is italicized text enclosed in single quotes, and with three quote marks in a row like that, the software interprets it as bold. The fix to that issue is to use {{'}}, which breaks up the group. (2) There were some carriage returns at the end of the first page. Sometime the system automatically inserts a carriage return when you start editing a page, so if you spot odd behavior, going back to remove the inserted carriage return can fix the problem. It's all working now. --EncycloPetey (talk) 20:32, 22 July 2023 (UTC)Reply
ON the issue of the hypen-section space glitch, How quickly could a rule be found to run AWB over Page: with the issue? ShakespeareFan00 (talk) 21:33, 22 July 2023 (UTC)Reply

Thank you

Latest comment: 1 year ago6 comments3 people in discussion

For the clearly very meticulous revision of pages in Solo. Many of your corrections are now being used in the OCR correction software I use to help proofread works. Would you be interested in just validating the pages, so that they can rise up in status? PseudoSkull (talk) 21:28, 1 November 2023 (UTC)Reply

Hi PseudoSkull, thanks for the kind words! Sure, will do. I'm still a newbie, and clumsy/unsure with the Wikisource process; I'm hesitant to mark things "validated" when I feel I may have missed something while reading/proofreading.

I see from your talk page that you do software - are you a fulltime developer too? I retired from SW dev a couple years ago, and still really enjoy working on code - mainly C/C++/C#/Javascript. Regarding the OCR correction SW you mentioned - I assume this is your own app? Is it a Windows app, or ... ?

Also - speaking of OCR, I encountered and corrected hundreds of errors in my proofreading of Middlemarch (1874) this summer; I posted some followup questions/notes to Stamlou, whom I thought was the person that did the OCR, but that was probably an incorrect assumption. Could you take a quick look at my questions here and post a reply here on my Talk page? Any guidance would be greatly appreciated! Harris7 (talk) 12:43, 2 November 2023 (UTC)Reply

Hey, thanks for the long-winded response.

I am indeed a software developer by job and by hobby. Though, I wouldn't call myself full-time, more like a freelancer. I essentially do business for myself. My primary area of interest in software are things like automated data entry, automated testing (especially with the Selenium package), APIs, data scraping, and data wrangling. I think my coding skills and my work on Wikisource are very well-aligned as well, which is another advantage. My time at Wikisource has actually inspired most of the learning I've done in terms of coding as well.

Since May of this year, I've been building an application (what is now more like a "power-user" app), called QuickTranscribe. The application was used to proofread Solo, the entire work that you're validating right now. Basically, the process of proofreading, processing, and submitting a work to Wikisource is an exceedingly time-consuming task, and honestly needlessly so. This combined with the fact that there's an endless sea of works that in theory need to be completed for this to become even close to a complete site means that the very long amount of time it takes to even get one work done is a huge problem.

So, my QuickTranscribe system aims to cut out as much of the tedious and repetitive work as possible, making the whole process dynamic, leaving the proofreader to actually proofread for 95% of their time on each project. I would even go as far as to say that QT has split the total work time on a transcription in half, at least. I went from getting one novel done every 3 days, to getting two novels done every day, which is something I honestly never thought possible until now.

The OCR correction software is part of the QT application I've been developing. It uses almost 2,000 lines of code just to correct the OCR that's there now, and that particular script is something I've been adding to for several years. Unfortunately, it can't detect everything. So, to answer your first question on Stamlou's talk page, the fact that there are hundreds of minor errors to be found in a single proofread work is a bit much, but not by that far. The human eye and purely logically-oriented software can only detect so much, out of the ~ 100,000 words in any given novel. If you're only finding one or two errors out of every five pages, that's fairly normal here.

The thing about OCR is that it's extremely unreliable for correctness, and it's really annoying to work with. I would never use it in work outside of Wikisource unless I absolutely had to. No matter what technology you're using, whether it's Tesseract, Google OCR, or AbbyReader, it's all going to produce some very ugly results in specific spots. We only use it here because it's faster than typing the pages out manually...

So, there being a few errors in the work doesn't mark bad proofreading. On the other hand, if all the pages are littered with errors and marked proofread (and you'd know what that looks like if you saw it, because of how bad OCR is as previously mentioned), then that would be a problem. This is a user-generated project that is always by its nature a work-in-progress, so any transcription project sitting around always has room to be improved for accuracy and presentation, even if it's been fully validated. PseudoSkull (talk) 19:51, 2 November 2023 (UTC)Reply

PseudoSkull: Hi again, I saw your 'thank you' for my "sic" marking of "Ada" (instead of "Aïda"), but after encountering several more instances of "Ada", I got the impression that the author did this intentionally, and that it is not a typo. It appears that he has the dog's owner using "Ada" in dialog, but the author still refers to it as "Aïda" in non-dialog text. So I removed the "sics": here and here. :-)

Speaking of Solo, could you add the running headers when you validate the text? That’s the main problem with the QuickTranscribe approach. TE(æ)A,ea. (talk) 20:28, 8 November 2023 (UTC)Reply
TE(æ)A,ea.: I'm still a novice here, so before I make a mess, would you please confirm this is an appropriate running header? Harris7 (talk) 21:02, 8 November 2023 (UTC)Reply
- Because the main text is the same, it should be {{rvh|{{{pagenum}}}|SOLO|SOLO}} if you were adding it in the index, and {{rvh|#|SOLO|SOLO}} (with the appropriate number) for manual addition. TE(æ)A,ea. (talk) 22:20, 8 November 2023 (UTC)Reply