Wikisource:Scan Lab/Archives/2021-11
Please do not post any new comments on this page.
This is a discussion archive first created in , although the comments contained were likely posted before and after this date. See current discussion or the archives index. |
The pages are off, probably by more than one--I did not thoroughly investigate the offness.--RaboKarbakian (talk) 21:36, 9 November 2021 (UTC)
- IA-Upload's phab:T194861 strikes again. I just replaced it with the DJVU directly from the IA. Inductiveload—talk/contribs 21:54, 9 November 2021 (UTC)
- Thanks!! --RaboKarbakian (talk) 23:59, 9 November 2021 (UTC)
- This section was archived on a request by: RaboKarbakian (talk) 23:59, 9 November 2021 (UTC)
The following discussion is closed:
Resolved
Notifying all members of Scan Lab (more info · opt out): DJVU pages 218–257 are duplicates. Please remove them, delete DJVU page 255, and shift the pages and OCR layer accordingly. TE(æ)A,ea. (talk) 20:38, 10 November 2021 (UTC)
- @TE(æ)A,ea. can't you just mark them "×" or "dup" in the page list? Inductiveload—talk/contribs 20:39, 10 November 2021 (UTC)
- Inductiveload: I have done that; but please delete this duplicate page. TE(æ)A,ea. (talk) 20:43, 10 November 2021 (UTC)
- @TE(æ)A,ea. please don't omit pages from the page list by using the "from" and "to" fields. That just makes it possible to "lose" pages into the abyss of Special:LonelyPages. Just mark them out in the normal way and optionally set them to "without text". Inductiveload—talk/contribs 20:47, 10 November 2021 (UTC)
- Inductiveload: I have done that; but please delete this duplicate page. TE(æ)A,ea. (talk) 20:43, 10 November 2021 (UTC)
- This section was archived on a request by: Inductiveload—talk/contribs 10:46, 12 November 2021 (UTC)
Notifying all members of Scan Lab (more info · opt out): Can pages 145-148 be inserted after Page:The rise, progress, and phases of human slavery.djvu/156 from HT.Languageseeker (talk) 02:45, 11 November 2021 (UTC)
- @Languageseeker Done Inductiveload—talk/contribs 14:48, 11 November 2021 (UTC)
- Thanks!
- This section was archived on a request by: Languageseeker (talk) 01:05, 12 November 2021 (UTC)
The Millbank case
The following discussion is closed:
Fulfilled
Notifying all members of Scan Lab (more info · opt out): Can The Millbank Case be imported from HT? Languageseeker (talk) 02:52, 11 November 2021 (UTC) https://babel.hathitrust.org/cgi/pt?id=hvd.hxdj3y&view=1up&seq=11&skin=2021
- Doing… Inductiveload—talk/contribs 18:25, 11 November 2021 (UTC)
- Done Index:The Millbank Case - 1905 - Eldridge.djvu - needs pagelist, etc. Inductiveload—talk/contribs 19:14, 11 November 2021 (UTC)
- This section was archived on a request by: Inductiveload—talk/contribs 08:16, 18 November 2021 (UTC)
Notifying all members of Scan Lab (more info · opt out): Insert 470 and 471 from [1] after Page:Middlemarch (Second Edition).djvu/481. Languageseeker (talk) 00:35, 20 November 2021 (UTC)
- @Languageseeker Done Inductiveload—talk/contribs 11:04, 20 November 2021 (UTC)
- Thank you! Languageseeker (talk) 11:58, 20 November 2021 (UTC)
- This section was archived on a request by: Languageseeker (talk) 11:58, 20 November 2021 (UTC)
Notifying all members of Scan Lab (more info · opt out): Could this PDF be turned into a DJVU, please? TE(æ)A,ea. (talk) 21:06, 5 November 2021 (UTC)
- @TE(æ)A,ea. Where did it come from? The file info page is missing the source. Inductiveload—talk/contribs 21:59, 5 November 2021 (UTC)
- Inductiveload: I just finished scanning it before I uploaded it. TE(æ)A,ea. (talk) 22:03, 5 November 2021 (UTC)
- I have one word: xpdftools. That is all.--RaboKarbakian (talk) 22:33, 5 November 2021 (UTC)
- Well, pdfimages in this case (but that can go very wrong for some PDFs which are not simply one image per page)
- @TE(æ)A,ea. the reason I asked is sometimes going to the source gets a better result and avoids piling compression on compression. The file is now at Index:Poor Cecco - 1925.djvu and needs page-listing and other tidying up. Because of the dark margins, the OCR will be poor in places. Doing better will probably need the original colour scan files and running though, e.g., Scan Tailor. 22:56, 5 November 2021 (UTC) Inductiveload—talk/contribs 22:56, 5 November 2021 (UTC)
- Inductiveload: These are the original scans, collated from 10 parts. The tough binding lets text to gutter, so I had to use grayscale for scanning most of the pages. TE(æ)A,ea. (talk) 23:11, 5 November 2021 (UTC)
- @TE(æ)A,ea. tight bindings are sometimes just like that. If you are going to scan lots of such books, an "edge scanner" like an Opticbook 3600 can sometimes be scored for cheap on eBay which can help (you only need to open the book to 90 degrees).
- But even then, why does the binding necessitate greyscale specifically? It's not guaranteed colour would be better but it can sometimes be helpful when adjusting the thresholds to pull the dark areas "up". Obviously, it depends strongly on the images in question. Inductiveload—talk/contribs 23:24, 5 November 2021 (UTC)
- Inductiveload: With the scanner I was using, when I scanned in black-and-white (as opposed to grayscale), the guttered text was unreadable, because the background (which should have been white) was also displayed black. For example, on this page, the text in the gutter (which is quite a lot) would be completely unreadable if I scanned in black-and-white. I have created the pagelist for the index. TE(æ)A,ea. (talk) 23:33, 5 November 2021 (UTC)
- @TE(æ)A,ea. oh, it's a greyscale scanner? Makes sense then. Definitely scanning as bitonal will drastically reduce your conversion options, to say the least. As will all data processing, aim to keep as much information content as possible until the last possible minute, because once you reduce the resolution or colour depth, or compress it, that information is gone. Inductiveload—talk/contribs 23:39, 5 November 2021 (UTC)
- Inductiveload: That was my intention. There’s only a limited amount of storage space available, so I don’t want to waste it on highest-quality scans if that’s not strictly necessary (as it is for scanning images). For those, this scanner can manage 70–75 kb a page. TE(æ)A,ea. (talk) 23:51, 5 November 2021 (UTC)
- If your issue is local storage on your system, then I understand. But you should not worry too much about Commons storage, there's no shortage of that, and if there were it would not be WS books using it all (DjVus account for under 0.8% of file storage, and PDFs about 10%, and that's after bulk importing of literally millions of IA PDFs, which is 6 times less than JPEGs). Even if you end up making a bitonal scan of only 5MB (which is more than possible for a text-only book, maybe not this one), starting colour scans at a decent resolution makes it much more likely you can get an OCR-able result after post-processing. The scanner's own processing is usually extremely rudimentary. Even if you don't actually upload the images to Commons (which is a huge pain), I could receive via Dropbox, the Internet Archive, or something and run them through Scan Tailor when making the scan. Inductiveload—talk/contribs 23:59, 5 November 2021 (UTC)
- Inductiveload: I’m sorry, I meant 70–75 MB, making this PDF 16 GB (several times my available storage). For this work, I scanned the entire work in “standard”-quality black-and-white PDF, although I had to switch to grayscale to be able to read the guttered text. I then re-scanned pages with black-and-white images in maximum quality grayscale (~10 MB per page) and pages with color plates in maximum quality color (~70 MB per page). TE(æ)A,ea. (talk) 00:15, 6 November 2021 (UTC)
- Still, in terms of ability to post-process, scanning to a reduced sized colour scan is probably better than a greyscale file of the same size with higher resolution, because the information content of the colour channels is (propably) more useful for isolating text content than an incremental resolution bump.
- If you can, I'd say targeting, say, 2MB colour image would leave more options open than a 10MB greyscale, because most of that greyscale resolution will be simply thrown away when finally compressed. Inductiveload—talk/contribs 11:19, 6 November 2021 (UTC)
- Hmm. My assumption would be the opposite: resolution über alles, in the general case. But actual results will of course vary based on the ability to actually separate the text from the background, where colour channels may trump resolution. Xover (talk) 17:37, 6 November 2021 (UTC)
- @Xover up to a point, but when it's a text page, the OCR won't care much at all if the file is 1000x2000 or 4000x8000 (with 16x the pixels), but it will care a lot if you're feeding it images that don't have the background separable from the text. In pathological cases like this, you may need to separate the text from background manually anyway, rather than letting Tesseract do an auto threshold.
- And then you crush it into a DjVu or PDF at a few hundred kB tops per page so all those pixels are lost anyway.
- Remember, at the end, you crush it into a DjVu or PDF at a few hundred kB tops per page so all those pixels are lost anyway, so they didn't actual get used is they didn't help the OCR. So while, yes, keep the data as long as possible, when there's a practical limit, choose the useful data first.
- For images, especially monochrome ones, then it's more likely that pixel count is more important. However, if there's a strong paper colour, the colour information is very useful at removing that background, whereas removing a mid-grey background is much harder, because you can no longer exploit the fact that "yellowish-orange" is to be removed. Removing a grey background will probably end up being pretty brutal to the edges of lines.
- Of course, it all depends on various things and it's very hard to make general statements, and for books without strong background or dark gutters it might not even matter. Inductiveload—talk/contribs 19:09, 6 November 2021 (UTC)
- Hmm. My assumption would be the opposite: resolution über alles, in the general case. But actual results will of course vary based on the ability to actually separate the text from the background, where colour channels may trump resolution. Xover (talk) 17:37, 6 November 2021 (UTC)
- Inductiveload: I’m sorry, I meant 70–75 MB, making this PDF 16 GB (several times my available storage). For this work, I scanned the entire work in “standard”-quality black-and-white PDF, although I had to switch to grayscale to be able to read the guttered text. I then re-scanned pages with black-and-white images in maximum quality grayscale (~10 MB per page) and pages with color plates in maximum quality color (~70 MB per page). TE(æ)A,ea. (talk) 00:15, 6 November 2021 (UTC)
- If your issue is local storage on your system, then I understand. But you should not worry too much about Commons storage, there's no shortage of that, and if there were it would not be WS books using it all (DjVus account for under 0.8% of file storage, and PDFs about 10%, and that's after bulk importing of literally millions of IA PDFs, which is 6 times less than JPEGs). Even if you end up making a bitonal scan of only 5MB (which is more than possible for a text-only book, maybe not this one), starting colour scans at a decent resolution makes it much more likely you can get an OCR-able result after post-processing. The scanner's own processing is usually extremely rudimentary. Even if you don't actually upload the images to Commons (which is a huge pain), I could receive via Dropbox, the Internet Archive, or something and run them through Scan Tailor when making the scan. Inductiveload—talk/contribs 23:59, 5 November 2021 (UTC)
- Inductiveload: That was my intention. There’s only a limited amount of storage space available, so I don’t want to waste it on highest-quality scans if that’s not strictly necessary (as it is for scanning images). For those, this scanner can manage 70–75 kb a page. TE(æ)A,ea. (talk) 23:51, 5 November 2021 (UTC)
- @TE(æ)A,ea. oh, it's a greyscale scanner? Makes sense then. Definitely scanning as bitonal will drastically reduce your conversion options, to say the least. As will all data processing, aim to keep as much information content as possible until the last possible minute, because once you reduce the resolution or colour depth, or compress it, that information is gone. Inductiveload—talk/contribs 23:39, 5 November 2021 (UTC)
- Inductiveload: With the scanner I was using, when I scanned in black-and-white (as opposed to grayscale), the guttered text was unreadable, because the background (which should have been white) was also displayed black. For example, on this page, the text in the gutter (which is quite a lot) would be completely unreadable if I scanned in black-and-white. I have created the pagelist for the index. TE(æ)A,ea. (talk) 23:33, 5 November 2021 (UTC)
- Inductiveload: These are the original scans, collated from 10 parts. The tough binding lets text to gutter, so I had to use grayscale for scanning most of the pages. TE(æ)A,ea. (talk) 23:11, 5 November 2021 (UTC)
the scans for Poor Cecco
I read some of the discussion about scans and scans for image work. I quit reading when I saw IL's justification was based on his way of restoring images. I took the scans: the pdf and the very large (width x length large) tifs for this project and worked with them.
As a restorer of images, from scans of aged printed texts to clean, ready to go electronic files those scans made me EEEEEK! and want to cry and wish I was home (some tapping together of heels may or may not have happened, but no red shoes were involved). The shared truth among all of the image restorers, a something we can agree on perhaps, is that we want to be the person to reduce the colors of the image. There are almost as many ways to reduce the colors as there are colors in the r,g,b palette and we want to choose the method of color reduction.
The tiffs provided for restoration purposes were indexed (mostly). Even though it worked and the restored images were fine, it is not acceptable. From 2 billion+ colors to 255 colors is too much for the image provider to be reducing the files to. That it is not acceptable can be seen on scans of the same images as downloaded from hathitrust (that were published in a magazine). I will not waste my time with those indexed images. Those have been additionally posterized and "optimized" for tesseract/ocr. To know the differences between those indexed tiff and the ones provided for this project is too much understanding to require the provider to have.
The non-indexed tiffs that were in the provided tiffs were grayscale. There are 3 image modes, this "mode" thing is about the palette being used. If the maker of the scans is concerned about file size, grayscale files are larger than rgb. I don't know the reason for this. png, tiff and jpeg will all save as this grayscale. It is better than indexed for working with, but it has the same quality as the indexed in that the color reduction has been accomplished using the person doing the scanning's methods and not using the person doing the restoration's methods. Grayscale, as nice as it is for monochrome files, is never the right choice. It is larger and who knows how the scan people reduced the colors to make it.
At the point that I stopped reading IL's explanation, it occurred to me that had the images been indexed with their colors (the yellow/pinkish paper, the flaws, etc.) he would have been able to select the yellows and manipulate his images the way he likes to. In fact, going from 10000s of yellows to 10s of yellows might have made his methods easier!
My reason, that image restorers want to reduce the colors their own way is a truth that is true for all of us.
The ocr probably really loved those scans. Thank you for your time if you got to here.--RaboKarbakian (talk) 14:31, 17 November 2021 (UTC)
- This is the point - it's quite possible that it would be easier to remove "dark yellow" as background that "mid grey", even though the colour is actually darker, because you can exploit the extra information that "yellow = background". It's not always that simple, or even possible, but you can always go colour → greyscale if you want, but you can not go from greyscale → colour (that's against a law of nature). Inductiveload—talk/contribs 10:50, 20 November 2021 (UTC)
- This section was archived on a request by: Inductiveload—talk/contribs 11:50, 23 November 2021 (UTC)
Notifying all members of Scan Lab (more info · opt out): Please redact this scan and convert to DJVU, as follows: redact the West key on p. 277; redact from “Background:” on p. 277 to the end of the page; redact all of p. 278 except for the running header; redact from “12. Copyrights” to above the rule in the second column on p. 279; redact the West key on p. 288; and redact from “Background:” on p. 288 to the end of the page. Thank you. (This file is to be used in a copyright discussion on Wikimedia Commons.) TE(æ)A,ea. (talk) 16:48, 11 November 2021 (UTC)
- Done Index:Darden v. Peters - 2007.djvu. Please fill in the file info and the pagelists as usual.
- Also, please don't upload files with copyright content, even temporarily for the purposes of redaction. Wikisource doesn't accept copyrighted content for any reason (not even fair use), and knowingly uploading such content is not allowed. Any other file host (Dropbox, Google Drive, OneDrive, a small infinity of free file hosts, etc etc) would do in this case. Inductiveload—talk/contribs 07:52, 12 November 2021 (UTC)
- This section was archived on a request by: Inductiveload—talk/contribs 11:49, 23 November 2021 (UTC)
Selected Czech Tales
Notifying all members of Scan Lab (more info · opt out): The scan is available at https://catalog.hathitrust.org/Record/001370293. I have already downloaded pdf of the scan and converted it to djvu using an online converter, but after I uploaded it to Commons I found out that several pages were lost during the conversion for some reason. Can the file be replaced, please? --Jan Kameníček (talk) 15:25, 20 November 2021 (UTC)
- Doing… Inductiveload—talk/contribs 15:35, 20 November 2021 (UTC)
- Done: Index:Selected Czech tales - 1925.djvu. I uploaded locally because it seems like Pick didn't die until 1958 and it's a UK publisher. 00:34, 21 November 2021 (UTC) Inductiveload—talk/contribs 00:34, 21 November 2021 (UTC)
- @Inductiveload:Thanks very much! As for copyright, it imo does not matter where the publisher is from, but where it was published, and this work was published simultaneously in both the US and the UK. --Jan Kameníček (talk) 11:48, 21 November 2021 (UTC)
- @Jan.Kamenicek oh, OK, I thought is was a UK publication and I was too lazy to check. I'll move to Commons in that case. No action should be needed from you for that! :-) Inductiveload—talk/contribs 16:11, 21 November 2021 (UTC)
- @Inductiveload:Thanks very much! As for copyright, it imo does not matter where the publisher is from, but where it was published, and this work was published simultaneously in both the US and the UK. --Jan Kameníček (talk) 11:48, 21 November 2021 (UTC)
- Done: Index:Selected Czech tales - 1925.djvu. I uploaded locally because it seems like Pick didn't die until 1958 and it's a UK publisher. 00:34, 21 November 2021 (UTC) Inductiveload—talk/contribs 00:34, 21 November 2021 (UTC)
- Doing… Inductiveload—talk/contribs 15:35, 20 November 2021 (UTC)
- This section was archived on a request by: Inductiveload—talk/contribs 08:42, 3 December 2021 (UTC)
Autobiography of a Catholic Anarchist
The scan are on the IA Internet Archive identifier: autobiographyofc0000henn and should be PD-US-No Notice. However, the IA has them locked behind a wall without a 14 day loan period which would allow for a PDF download. Is there anyway to retrieve this scan? Languageseeker (talk) 13:40, 20 November 2021 (UTC)
- @@User:Languageseeker I'll have a go. There are two photos that need redacting, right? Inductiveload—talk/contribs 15:35, 20 November 2021 (UTC)
- @Inductiveload: Thank you. I'm not too sure about the photos because there is no copyright notice on the book. Unless there is a notice on the images, would this still fall under PD-No-Notice? Languageseeker (talk) 20:34, 20 November 2021 (UTC)
- @Languageseeker hmm, so the illustrations are "with permission of the Catholic Worker". Lots of renewals for Eichenberg and not many for Bethune. So I think to be sure, you'd have to find where the illustrations were first published and make sure it wasn't renewed. Not impossible, but I think I'll just redact them and if you (or anyone) want/s to do the research I'll put them back later. Inductiveload—talk/contribs 21:22, 20 November 2021 (UTC)
- @Inductiveload: Sounds like a plan. Why make our lives more difficult. :) Languageseeker (talk) 21:30, 20 November 2021 (UTC)
- @Languageseeker Done Index:The Autobiography of a Catholic Anarchist - 1954 - Hennacy.djvu. FYI, this wasn't especially easy and I do not have an automated push-button method (yet) way to get such a work, so please go steady on requests for PD scans that are behind the IA loan-wall.
- Pagelists/verification, etc all required as usual.
- I'll try to keep the images around so I can regenerate with the illustrations (which are indeed quite interesting) if someone comes along with evidence that Bethune and/or Eichenberg's works are PD. Inductiveload—talk/contribs 11:28, 22 November 2021 (UTC)
- @Inductiveload: Huge kudos! I'm going to run in in the next MC :). Don't worry, I don't plan on flooding you with requests for such scans. Languageseeker (talk) 12:32, 22 November 2021 (UTC)
- @Inductiveload: Sounds like a plan. Why make our lives more difficult. :) Languageseeker (talk) 21:30, 20 November 2021 (UTC)
- @Languageseeker hmm, so the illustrations are "with permission of the Catholic Worker". Lots of renewals for Eichenberg and not many for Bethune. So I think to be sure, you'd have to find where the illustrations were first published and make sure it wasn't renewed. Not impossible, but I think I'll just redact them and if you (or anyone) want/s to do the research I'll put them back later. Inductiveload—talk/contribs 21:22, 20 November 2021 (UTC)
- @Inductiveload: Thank you. I'm not too sure about the photos because there is no copyright notice on the book. Unless there is a notice on the images, would this still fall under PD-No-Notice? Languageseeker (talk) 20:34, 20 November 2021 (UTC)
- This section was archived on a request by: Inductiveload—talk/contribs 08:42, 3 December 2021 (UTC)
Notifying all members of Scan Lab (more info · opt out): Insert 316 and 317 after Page:The history of silk, cotton, linen, wool, and other fibrous substances 2.djvu/355 from [2]. Don't shift text layer on enWS. Languageseeker (talk) 11:26, 23 November 2021 (UTC)
- Doing… Inductiveload—talk/contribs 11:36, 23 November 2021 (UTC)
- Done, but it looks like there's still misalignment in there from 443 onwards. I'll do the shift if you confirm what it is. Inductiveload—talk/contribs 11:45, 23 November 2021 (UTC)
- @Inductiveload For pages 445 - 517 shift by +2 Languageseeker (talk) 11:50, 23 November 2021 (UTC)
- Thanks, Done Inductiveload—talk/contribs 11:56, 23 November 2021 (UTC)
- @Inductiveload For pages 445 - 517 shift by +2 Languageseeker (talk) 11:50, 23 November 2021 (UTC)
- Done, but it looks like there's still misalignment in there from 443 onwards. I'll do the shift if you confirm what it is. Inductiveload—talk/contribs 11:45, 23 November 2021 (UTC)
- This section was archived on a request by: Inductiveload—talk/contribs 08:49, 3 December 2021 (UTC)