Index talk:History of Oregon volume 1.djvu

From Wikisource
Latest comment: 4 years ago by Peteforsyth in topic Reply to user talk note
Jump to navigation Jump to search

Better scan (?): Internet Archive identifier: historyoforegon00bancroft

Pages aligned, and Uploaded June 18, 2018. -Pete (talk) 22:21, 18 June 2018 (UTC)Reply

Ref naming and page numbers

[edit]

This work has many footnotes that span two or more pages. In order to avoid collisions (i.e., the same name being used for two different footnotes), it's essential to use unique names. In the months since I started working on this, I've consistently used the naming convention p<scan page number> -- for instance, <ref name=p123> for the 123rd page of the scan (even if the page number in the original book, printed at the top of the page, is 63). I don't know whether Wikisource has a requirement about this, but even if it does, I'd urge that other editors adopt the same convention I've used, so we don't have to redo all the pages I've already worked on. Let me know if you disagree. -Pete (talk) 18:41, 31 January 2020 (UTC)Reply

@Peteforsyth: There is no set policy or practice on this: use whatever works. I, personally, tend to use the printed page numbers and find it has less potential for confusion (i.e. what happens to scan page order when the scan is updated?), but this is really up to the contributors on each work. --Xover (talk) 07:36, 9 February 2020 (UTC)Reply
@Xover: Yes, I guess I wasn't as clear as I could have been. We've agreed to adopt the convention described above (using the scan number of the page) for naming refs. -Pete (talk) 20:19, 9 February 2020 (UTC)Reply

Reply to user talk note

[edit]

Replying to this comment from @Shenme: (because we now have several of us working actively on this volume -- which makes me very happy -- also including User:Ernst76).

  • Thanks for acknowledging my comment above re: page numbering in ref names. I think if we all keep that in mind, we won't end up with any broken continuity.
  • As for closing up lines into paragraphs, it isn't required, and it doesn't bother me a bit if you skip it. I may do it when I validate a page, but as far as I'm concerned it doesn't really matter -- and I don't think Wikisource has any policies about it. I use the "TemplateScript" code, which gives me a handy link to do it in one click...so I generally do it out of habit.
  • For the footers, to reproduce the printed page in this particular work, just using {{smallrefs}} without any rule seems like the most accurate approach. But it doesn't bother me if there's a little inconsistency, since this is something that won't show up on the transcluded work anyway. If we want to be really perfect, maybe we could run a search/replace on AWB when we're done. I hadn't realized that four hyphens produced a different colored line...cool! I think I used that long ago, before I knew of the existence of the {{rule}} template. But like I said, I have no strong preference.

Thanks for the good questions, and for all the effort...this is a pretty important work of history, so I'm really pleased to see so much effort going into it. -Pete (talk) 09:42, 1 February 2020 (UTC)Reply

@Shenme: I notice you've been doing a lot of routine (manual?) search-and-replaces for common scannos. Are you familiar with AWB (linked above)? It can be a great tool for taking care of them in an automated or semi-automated way. If you could come up with a decent list of what some of the common ones are, I could run search-and-replace using AWB at scale, and take care of a whole series of pages at once. Of course, some scannos are more easily rectified this way than others...for instance, sometimes "lie" should be "he", and other times it really should be "lie". But, if we can find the ones that are less ambiguous, we could make lighter work of these corrections. -Pete (talk) 21:54, 22 February 2020 (UTC)Reply
I am definitely interested in the scannos that reveal the limitations of the conversions (and with particular text sources), but am more interested in those that are 'optical illusions' that we can't seem to see. Plus, as mentioned elsewhere, I'm interested in the 'quality' of the final results.
I do take particular typo instances and experimentally search for them, primarily to see how many of that one there are, and whether they do cluster on particularly bad pages. And so this is manual searches, as I'm trying to refine to exact patterns.
But, I don't see searching for chosen typos as a tool for use _now_, in correcting errors in original scanned pages before they've gone through the stages. It is too often that there are multiple errors on a page, and original pages need to be scrutinized, each as a whole, to catch all errors.
Rather, I am excited at the idea of using searches _after_ all review stages, when otherwise we'd say all is done.
Consider that I've found that *I've* missed some optical illusions, even being well acquainted with them. Here's one [1] and I know there have been a couple others. I do scrutinize, yet goof anyway! That's a wow!
So... consider that doing searches after all the reviews will both correct those and then give a feeling for how well all the pages have been checked. It will also catch pages reviewed while sleepy, distracted, drunk, etc. It is as a final 'semi-machine-powered' review that I'm thinking searches will be best used. Adding an extra bit of review is great. But not as busy-work now, but as an extra boost to quality before release.
Plus, I confess, I detest the use of tools when not accompanied with due regard for accuracy and possible noxious side-effects. I've see too many instances of abuse over at WP. I have used AWB only twice, when someone misspelled parts of book titles in refs and copy-n-pasted them all over. One series was about 150 instances, another about 100 instances. I could be very specific in my searches then.
One reason I've been mentioning the errors - the frequently occurring patterns - is so that they could be harvested later and used in that search review. Oh, and if caution is given to others, great. ;-)
Sound reasonable? Shenme (talk) 00:03, 24 February 2020 (UTC)Reply
@Shenme: Sorry to lose track of this discussion. Your approach does seem reasonable, yes. I don't tend to read a page too closely on the first proofread, for me the primary initial goal is to get the overall work to the point that it's pretty readable for the general public, and then improve it from there. The less errors there are to fix on a page, the more enjoyable the proofreading is to me, and the more likely I am to get drawn into reading it closely and catching more errors...so my suggestion was, I guess, tailored to my own workflow. For instance, if we could fix (say) 40% of common scannos by an automated process (and leaving the status at "not proofread"), then I could be more efficient and get greater enjoyment from manually fixing the remaining errors. But, I see your point, and I appreciate your keeping a record in the edit summaries -- yes, I'm game to try running AWB after pages are validated, too. Thanks for explaining your thinking. -Pete (talk) 19:02, 9 March 2020 (UTC)Reply

Image uploads

[edit]

@Ernst76: Thanks for uploading this map. I've been intending to do that myself, but waiting until we've located most of the maps. Great if you want to work on them in the meantime -- do you know how to access the full-resolution JP2 files from Internet Archive, and convert to PNG? I've done it for this one; let me know if you'd like some tips. -Pete (talk) 22:09, 9 February 2020 (UTC)Reply

@Peteforsyth:Yes, I'd like to know the method of image insertion you're talking about. Please tell me where to read about the process, or describe it here if no clear help page exists. Thanks Ernst76 (talk) 02:26, 10 February 2020 (UTC)Reply

@Ernst76: I have cobbled together my own process, which is not yet documented in any one place, but I've been meaning to write it down. So this is timely. I did create a video (here, or search "ikGjWKCmghY" on YouTube) but that was 5 years ago, and I've learned a few new things since then.

Here's a basic outline; I can flesh this out with screenshots and so forth in the future but hopefully this will get you started.

How to capture high quality image scans from Internet Archive

[edit]

These instructions are for the following scenario:

  • A scanned version of a volume exists on the Internet Archive website
  • The volume contains images (photographs, maps, etc.) that you want to capture for use on a Wikimedia project
  • You have access to free Linux-based tools. This could mean you're using a Linux desktop computer, or that you've installed an operating system like Ubuntu Linux within Windows via the Windows Subsystem for Linux. (Fairly easy to do, but I'll let you find instructions elsewhere.) I believe the commands discussed below can be easily installed on a Mac as well, but I haven't tried it.

Note, there may be easier, graphical program-based approaches if you're just working on one or two images; this process is intended for situations where you want to process, say, a dozen or more files, where it would be tedious to do every step for each individual file.

Rough draft instructions

Steps:

  1. Determine which pages you want to capture. Possible methods:
    • While transcribing the volume on Wikisource, mark every page that contains an image as "problematic." Then make a note of the scan page number of every such page.
    • For a work that has images on most pages, you may simply wish to download and convert every page.
    • Work on one image at a time (maybe the best option for your first effort)
  2. On the Internet Archive page for the work (example), look for the section entitled "Download Options" and click the link "Show All". This will take you to a page that shows you all the different files and formats for that work available from Internet Archive.
  3. For many works, there will be two lines that contain the file extension ".jp2". This is the JPEG2000 format, which is good for creating really small files that preserve high quality images. The downside of this format is that there are few programs that work well with it. If you do not see that file extension, the bad news is, Internet Archive probably doesn't have very high quality versions of the images; but the good news is, the formats they're using (maybe TIFF, JPG etc.) are going to be easier to work with, so you won't need to follow all the steps here.
  4. Each of those two lines will contain a link that says "View Contents". Click one of those links. (It doesn't matter which one.) You will now see a list of scans of every page of the book. Each line contains a link ending ".jp2" which will allow you to download the high-res scan; next to it is a "jpg" link. That "jpg" link will allow you to preview the file in your web browser (at lower resolution), which is useful for confirming that you're downloading the correct page. Download every file that interests you. You will want these files to go into their own directory for later processing; you can either do that now, or create a directory and move them all from your "Downloads" folder when you finish downloading.
  5. Your next step will be on the Linux command line. You will need to install the package that contains the "opj_decompress" package (I will expand these instructions to include this part later). Navigate to the directory with your JP2 images, and run this command:
    opj_decompress -ImgDir . -OutFor PNG
    Now, wait a while; depending on many factors, each image will probably take a few minutes to convert.
  6. You will now have a directory containing a PNG and a JP2 for every page. (Notice how much more space the PNGs take up!) You can delete the JP2s if you like. Open one of the PNGs to confirm it came out how you expected.
  7. I recommend uploading to Wikimedia Commons in two stages. The first upload will contain the widest crop and the full resolution, so that those wishing to reuse the file in the future have the ability to make their own image adjustments, without having to go back to the original JP2. (example) The second upload (under the same file name) will be more tightly cropped and optimized for displaying on Wikimedia projects. (example)
  8. First upload (wide crop/minimal adjustments): Sometimes the Internet Archive version will contain a large portion of the scanning table behind the book, or will be rotated 90 degrees. If that is the case, it's best to address these basic issues before uploading to Commons. Do so in any image editor.
    • Optional: The PNG file saved by most image editors is not optimized, meaning it takes up a lot more space than it needs to. Prior to uploading to Commons, you may wish to run the following command (which can be run either on an individual image, or on all PNGs in a directory):
      optipng <filename>
      If you want to run it on all PNGs in the directory, simply replace "<filename>" with "*.png".
    • Considerations when uploading: File name, description, source, date, wikidata, categorization. I will expand these instructions to address this later.
  9. Second upload (more heavily edited): After uploading the first version to Commons, consider doing any or all of the following:
    • crop more tightly
    • adjust the levels (high contrast for black-and-white images, or with careful white and black point balance for photos)
    • convert to greyscale (or in rare cases, black and white)
    • eliminate moire by (optionally) increasing the resolution and then applying a gaussian blur
    • Consider using the "optipng" command described above.
  10. Upload the second version of the file using the "Upload a new version of this file" link on Commons.
  11. Transclude the file on Wikisource. Consider using a template like FI, which will scale the image nicely for different-sized screens. I will expand these instructions to address this later.

-Pete (talk) 19:14, 10 February 2020 (UTC)Reply

Thank you @Peteforsyth:. I appreciate the time and effort you put into explaining this. It's a lot more work than what I've been doing, but I'm going to study and use this method because quality is what this is all about.

I'd like your assistance on another matter not related to the Oregon History project. Because this page should be limited to Oregon History, do you mind if I place it on your personal talk page? Ernst76 (talk) 00:39, 12 February 2020 (UTC)Reply

@Ernst76: I'm glad it's helpful. And yes, it is fairly involved, but the results are worth it IMO. I've found that doing all the images for a work at once makes the process more efficient, so it somewhat offsets the time-consuming aspect. That's why I was saving these for last, as occasionally a new map would pop up when one of us proofreads a page. And, of course, you are always welcome on my user talk page! -Pete (talk) 20:46, 12 February 2020 (UTC)Reply

HWS and HWE templates no longer needed

[edit]

Hi @Ernst76: Regarding this edit, since Sept. 2018, these templates are no longer needed (at least in straightforward cases). If one page ends with with "Wel-" and the next page begins with "comed", when transcluded, it will render properly as "Welcomed". I only learned about this recently. I don't think there's any problem with adding the template, but you needn't go to the trouble for these straightforward cases. (I do think it's still needed, though, for hyphenated words in footnotes that span pages, and other odd cases.) -Pete (talk) 20:36, 11 February 2020 (UTC)Reply

@Hilohello: Are you aware of this? (re: this edit) -Pete (talk) 18:13, 4 March 2020 (UTC)Reply

Size of chapter headings (etc.)

[edit]

@Shenme: Regarding this edit summary, I have no strong feeling on this. I think you have a good eye for detail...if you'd like to set out some style guidelines here on this talk page, I'd be happy to follow them. It would be ideal to have consistency throughout the work, so I'm open to your guidance. -Pete (talk) 00:23, 14 February 2020 (UTC)Reply

Picnic at the falls?

[edit]

@Ernst76, @Shenme: It looks to me like we might be on pace to get this volume fully proofread, or even validated, some time this spring. I'd like to propose, if you're game, that we have a little picnic on the bluff above Willamette Falls to celebrate when we get it fully proofread. Of course, I don't know if you're actually in Oregon, but it seems like a good guess! How does that sound? -Pete (talk) 00:40, 18 February 2020 (UTC)Reply

I was trying for at least 4 pages a day, with a side effort at poking at some of the common scannos, e.g. "lie" -> "he" (skip first 4 matches)
Last couple of days I've been diverted with a pet project demonstrating the residual errors (elsewhere) even after 2 stages of review. ([2], [3]) I may have exhausted myself there...
I can come back to the 4 pages a day for awhile. I'd also like to play with {{hi}} and friends in the authorities quoted section (xix - xxxix). A fair number of errors were _because_ entries would get smashed together or accidentally separated (e.g.) See entry for "Stockton" at History_of_Oregon_(Bancroft)/Volume_1 - where does the previous entry end? Need those hanging indents! Shenme (talk) 07:20, 18 February 2020 (UTC)Reply

@Peteforsyth: For me it would be a 1600 mile trip one-way. I live in the midwest. But thank you for offering.

Looks like you've got some interesting projects going, Shenme. Too bad a picnic is not in the cards, but it's heartening to see we're approaching the halfway mark. -Pete (talk) 17:10, 20 February 2020 (UTC)Reply