User talk:Tarmstro99/Archives/2011
Add topicPlease do not post any new comments on this page.
This is a discussion archive first created in , although the comments contained were likely posted before and after this date. See current discussion or the archives index. |
Urgent help for Statutes at Large
I am a teacher, and in very dire need of having my class work on Statutes at Large Vol 18, part 2. George Orwell said to wait on proofing it becuase its not right, but can you PLEASE get it fixed ASAP? - Tannertsf (talk) 01:02, 5 May 2011 (UTC)
- Not quite. Just minutes later said I was mistaken and that you could carry on with Volume 18 (any part) as you wished. It was Volumes 65 thru 90 I asked you to stay away from until the pages per volume were verified then created. Volume 18 does not fall between Volume 65 and Volume 90. — George Orwell III (talk) 01:23, 5 May 2011 (UTC)
Ok...sorry. I misunderstood. Thanks for the update. - Tannertsf (talk) 01:25, 5 May 2011 (UTC)
- The Index pages for Volume 18 do look as if they could use a little TLC. Feel free to commence proofreading any of the pages in the volume; I’ll tidy up a bit as time allows. Thanks to you and your class for any help! Tarmstro99 01:46, 5 May 2011 (UTC)
seeing where the bot is at
How can I see where your bot is (which index page)? - Tannertsf (talk) 03:39, 7 May 2011 (UTC)
- CHECK if red not-proofread pages exist in the Index: pagelists. Basically if the page has been created or is in any other proofread status, its fair game further editing. George Orwell III (talk) 03:55, 7 May 2011 (UTC)
Hello! It was suggested by George Orwell III to come to you with questions about United States Statutes at Large. I have completed proofreading Chapters 53, 54, & 55 (the latter two being super-short) of V. 40/65th Congress, and would like to create Mainspace pages for them, but I don't know how you would prefer to title the pages; e.g.,
United States Statutes at Large/Volume 40/65th Congress/1st Session/Chapter 53
or
United States Statutes at Large/Volume 40/Sixty-Fifth Congress/1st Session/Chapter 53
or something else altogether...?
And with the following format? (See application)
{{header | title = [[United States Statutes at Large/Volume 40/65th Congress/1st Session/Chapter 53|United States Statutes at Large]], Volume 40 | author = United States Congress | translator = | section = [[United States Statutes at Large/Volume 40/65th Congress/1st Session/Chapter 53|Public Acts of the Sixty-Fifth Congress]], 1st Session, Chapter 53 | previous = [[United States Statutes at Large/Volume 40/65th Congress/1st Session/Chapter 52|Chapter LII]] | next = [[United States Statutes at Large/Volume 40/65th Congress/1st Session/Chapter 54|Chapter LIV]] | notes = }} {{USStatCols start}} <pages index="United States Statutes at Large Volume 40 Part 1.djvu" from=295 to=306 fromsection="chap53" tosection="chap53"/> {{USStatCols end}} <!--{{rule}}{{smallrefs}}-->
Or some other way? Thank you! Londonjackbooks (talk) 05:22, 20 May 2011 (UTC)
- Awesome work, and thanks for your help! Your application of the format looks just right. When posting in the main namespace, please use the form United States Statutes at Large/Volume 40/65th Congress/1st Session/Chapter 53. The {{USStatHeader}} template is set up to look for ordinal numbers (“65th,” not “Sixty-Fifth”) when creating its links to the Congress, Session, and Chapter of legislation in the Statutes at Large. Let me know if you have any further questions, and thanks again for contributing to the project! I’ll see about updating the Vol. 40 index page to make it a little more descriptive; I’ve been getting lots of practice with that lately. Tarmstro99 16:59, 20 May 2011 (UTC)
- It's a win-win... Mr. Orwell thought "proper context" might be in order with pointing to sources from a book I am transcribing, and it gives me some practice with different formatting to boot... My brain is turning to mush with legal terms now, however, but I might be able to blame that on being sick...?
USStatChapHead template
- I also had another question that I briefly asked Mr. Orwell about... He may have mentioned it... Much ado about "nothing", but... With regard to the {{USStatChapHead}} template: as of right now, it renders sidenotes in the following manner... (ignore *edit mode* formatting I used for illustration):
[H. R. 4961.].
[Public, No. 41.]
- ...using the following:
{{#ifeq:{{{side}}}|left|{{USStatSidenote|L|{{bottom border|{{{date}}}.}}{{#if:{{{datenote}}}|<br />{{{datenote|}}}| }}}}|{{USStatSidenote|R|{{bottom border|{{{date}}}.}}{{#if:{{{datenote}}}|<br />{{{datenote|}}}| }}}}}}
{{#ifeq:{{{align}}}|centered|<center>}}{{sc|Chap}}{{#ifeq:{{{chapnum}}}|1|{{sc|ter}}|.}} {{{roman}}}.—''{{{title}}}''{{#if:{{{footnote|}}}|<includeonly>{{#tag:ref|{{{footnote}}}}}</includeonly>|}}{{#ifeq:{{{align}}}|centered|</center>}}
- Is there a way to reformat the template to (1) get rid of the period outside the brackets
[H. R. 4961.].
(2) center the bottom note[Public, No. 41.]
and (3) get rid of the line space between the line ( {{rule}} ) and the bottom note so that it more closely resembles the original? It's nit-picky, I know, but I am merely curious... Thanks! Londonjackbooks (talk) 20:47, 20 May 2011 (UTC)
- Is there a way to reformat the template to (1) get rid of the period outside the brackets
- What I know about template programming could fill a thimble; the hack-jobs that produced {{USStatChapHead}} and the other templates I cobbled together for the Statutes at Large project consisted almost entirely of copying other people’s similar works that seemed to be getting the job done. I see that User:George Orwell III has been doing some experimenting with the Statutes at Large template families lately; has that work addressed your concerns at all? Tarmstro99 01:10, 24 May 2011 (UTC)
- Like I know what I'm doing or something? <chuckle>
- I will take a look at the header one next but I need to know if the changes to the sidenotes one are acceptable or at least shows no difference when compared with the previous coding calling 6 or 7 other templates in the process? -- George Orwell III (talk) 01:16, 24 May 2011 (UTC)
- From a layman's point of view (from one who knows nothing about template programming), the sidenotes look good to me (in "Dynamic layout" as well)... Print previews well, but for minor overlapping due to (I assume) a larger width setting for body text in print mode as opposed to WS viewing... For what it's worth, and thanks! —Londonjackbooks (talk) 02:52, 24 May 2011 (UTC)
- Thanks - I had hoped as much would be the case since my last. The thing in print preview is just not as obvious but still present in the other the layouts unfortunately as well. First of all, these dynamic layouts "grab" up to 3em (mostly on the left side) of margin space to hold the links back to the individual djvu pages. This causes the basic formula for USSaL page presentation after transclusion to also be off by at least 3ems without even beginning to add in all the other template, layout and font factors also at work or in play. I'll keep the print preview in mind from now on along with the rest of the corresponding layouts but I don't think there will be much that can be done to insure those displays work in every instance at the same time as all the rest. -- George Orwell III (talk) 03:13, 24 May 2011 (UTC)
- ...I didn't bother with bringing up PDF-preview...Figured I shouldn't even "go there" ;) ...While realizing/agreeing that the text should be rendered as closely as possible to the original, those sidenotes sure are a pain!...Moreso for you than me, though, granted... Londonjackbooks (talk) 04:00, 24 May 2011 (UTC)
┌───────────────────────┘
Alright, I've started work on a more "robust" Chapter Header (see {{USStatChapHead/sandbox}} ). It might not display correctly for some in the main namespace without the USStatCols start & end templates in place (basically same as using the current 'Layout 2'), but I'm on-again & off-again working on a USSaL Dynamic Layout at the same time to hopefully replace the need for any of those fixes and/or templates currently in place one day soon. Anyway, here are some examples of the test template already applied. . .
Let me know how it looks or have any suggestions, etc. I've taken the opposite approach found in the existing template and started with something found in more recent volumes so I can work "backwards in time" through the various layouts used in earlier and earlier volumes. It never made sense to me why a modern baseline was not set up first and just "undo" all the additions and/or changes in format as needed on the way back down to volume one (newer volume=more complex, older volume=more simple), but no use kicking the cat now after it drank the anti-freeze right? ;) — George Orwell III (talk) 11:00, 27 May 2011 (UTC)
Deletion of Sherlock Holmes short story
Hi! Was just poking around here briefly, and discovered that The_Adventure_of_the_Three_Garridebs had been deleted for copyright violation. According to Wikipedia's article on Conan Doyle, and the newspaper citation that backs it, the author of this story died in 1930; so seventy years after that makes 2000, which should mean that the work is out of copyright. Is this correct?
(Please respond on my talk page, as I won't come looking here for a response.)
Rosuav (talk) 07:11, 3 June 2011 (UTC)
- answered on user's talkpage -- George Orwell III (talk) 08:14, 3 June 2011 (UTC)
USSaL v.65 to v.94
fyi... Volumes 65 to 94 (1951 to 1980) are now available on FDsys - GPO Access — George Orwell III (talk) 05:44, 22 April 2011 (UTC)
- What a great find! I’ll get straight to work djvu-izing and making rudimentary index pages. Many thanks! Tarmstro99 15:58, 22 April 2011 (UTC)
- No big deal. I was being told over and over again to be patient; they're coming, blah... blah... blah. I guess with the wind-down of the old GPO-Access sites nearly complete, they finally got around to putting these up. No notice has been given to the public-at-large as far as I can tell and why they are hosted in the azzcrack of the site instead of along with the existing 4 or 5 volumes, I cannot say. More info when I get it. — George Orwell III (talk) 16:19, 22 April 2011 (UTC)
- There is undoubtedly some opportunity here for sites like WS to teach the professional library community a thing or two about ease of access, having all the content in a single series available in one place, etc. At any rate, for now I am happy simply to have the content. The sooner we can get away of this GPO Access foolishness about only allowing PDF downloads of one Public Law at a time, the better off we will be. On the flip side, nice to see the PDFs bear a certification of authenticity from the GPO! Must figure out whether it's possible to preserve that when doing the djvu conversion. It will be nice when we can have downloadble PDFs of the Statutes at Large built from text, not just scanned images which must then be OCR'ed and proofed! :-) Tarmstro99 16:29, 22 April 2011 (UTC)
- I don't think the Certification will remain once the associated doc is no longer a PDF. We can probably keep the textual part in the djvu's metadata but there is little reason to since the embedded online verification part becomes disabled when tampered with (conversion in this case).Volumes 117 to 120 were "born digital" and have a text layer in place - no need to run OCR in those instances but it is unlikely that anything prior to 1997 or 1998 will ever be re-done that way. It's not so bad from even a year ago. If you extract the text layer with something like DJVULibre and run it through a simple spell check before re-insertion & upload, the proofreading becomes more about just formatting than an editing task on top of adjusting layout, format, etc. too. Its more time intensive to do initially but the pay off means less proofreading needed througout. — George Orwell III (talk) 18:11, 22 April 2011 (UTC)
- There is undoubtedly some opportunity here for sites like WS to teach the professional library community a thing or two about ease of access, having all the content in a single series available in one place, etc. At any rate, for now I am happy simply to have the content. The sooner we can get away of this GPO Access foolishness about only allowing PDF downloads of one Public Law at a time, the better off we will be. On the flip side, nice to see the PDFs bear a certification of authenticity from the GPO! Must figure out whether it's possible to preserve that when doing the djvu conversion. It will be nice when we can have downloadble PDFs of the Statutes at Large built from text, not just scanned images which must then be OCR'ed and proofed! :-) Tarmstro99 16:29, 22 April 2011 (UTC)
- No big deal. I was being told over and over again to be patient; they're coming, blah... blah... blah. I guess with the wind-down of the old GPO-Access sites nearly complete, they finally got around to putting these up. No notice has been given to the public-at-large as far as I can tell and why they are hosted in the azzcrack of the site instead of along with the existing 4 or 5 volumes, I cannot say. More info when I get it. — George Orwell III (talk) 16:19, 22 April 2011 (UTC)
Please, for the love of Christ, can you hold of on running that bot that creates the pages for a week or two? I'd like to insure all the pages are present and in contiguous order before locking in the Page: namespace and then having to move dozens and dozens of pages afterwards when duplicates or drop-outs are found. — George Orwell III (talk) 00:26, 3 May 2011 (UTC)
- OK, I’ll wait before uploading any additional pages. The content uploaded thus far has been extracted from the GPO’s original PDFs (not re-OCR’ed), so whatever quality controls they put in place, if any, still apply here. I’ll continue creating the index pages if you have no objection to that. Should it be necessary to move or delete pages after the fact, that’s also a task that can be assigned to User:TarmstroBot. Tarmstro99 00:36, 3 May 2011 (UTC)
- Again please don't misunderstand - by all means continue uploading the new volumes and creating the Indexes for each - just don't run the bot creating each page (making hundreds of red not-proofread atatus pages) afterwards. This way I can double check the pagelist and trim (or replace bad pages i.e. Page:United_States_Statutes_at_Large_Volume_65.djvu/30 ) before you run that bot. I am trying to avoid the pitfalls present with the existing volumes this time around is all & will work closely with you to insure that doesn't happen this time around. For example volume 70 has no OCR induced text layer, what would happen if you created those pages with the Bot? Wouldn't that mean 1600 blank pages? — George Orwell III (talk) 00:46, 3 May 2011 (UTC)
- The bot isn’t drawing text from the (non-existent, as you have observed) DjVu text layer, but rather from the text included in the GPO’s source PDFs. To explain, the workflow I have been following (with some purely clerical steps, such as file renaming, omitted) looks like this:
- Again please don't misunderstand - by all means continue uploading the new volumes and creating the Indexes for each - just don't run the bot creating each page (making hundreds of red not-proofread atatus pages) afterwards. This way I can double check the pagelist and trim (or replace bad pages i.e. Page:United_States_Statutes_at_Large_Volume_65.djvu/30 ) before you run that bot. I am trying to avoid the pitfalls present with the existing volumes this time around is all & will work closely with you to insure that doesn't happen this time around. For example volume 70 has no OCR induced text layer, what would happen if you created those pages with the Bot? Wouldn't that mean 1600 blank pages? — George Orwell III (talk) 00:46, 3 May 2011 (UTC)
Step Input Tool Output 1 source PDFs downloaded from FDSys pdfimages
grayscale images of scanned pages, 1 per page of original source file, in PPM format 2 output from Step 1 pamthreshold
black-and-white images of scanned pages, 1 per page of original source file, in PBM format 3 output from Step 2 cjb2
black-and-white images of scanned pages, 1 per page of original source file, in DjVu format 4 output from Step 3 djvm
single bitonal DjVu file containing all pages of source 5 source PDFs downloaded from FDSys pdftotext
extracted text of source PDFs, 1 UTF-8 encoded text file per page of original source file 6 output from Step 5 shell script single concatenated file containing all extracted text in format suitable for parsing by pagefromfile.py
- Apparent garbage pages: the
pamthreshold
tool tries its best to find black pixels in a grayscale image. When the grayscale image is very light (for example, a photocopy of a blank page where text appears only on the opposite face), it will keep adjusting the “contrast” level until what you are seeing is the facing text “bleeding through.” That’s why Page:United States Statutes at Large Volume 65.djvu/30 is just a mirror of Page:United States Statutes at Large Volume 65.djvu/29, Page:United States Statutes at Large Volume 65.djvu/814 is just a mirror of Page:United States Statutes at Large Volume 65.djvu/813, and so forth. It’s just the way thepamthreshold
tool converted the source grayscale images to black-and-white. - Extracted text: because
pdftotext
takes its text directly from the source PDF file, our text is exactly equivalent to FedSys’s; no better or worse. If you want to do further post-processing (spell checks, cleaning up common OCR errors, etc.) before uploading to WS, the place to do it is probably after Step 6. If you want to see an example of a Step 6 output file (ready for subsequent processing or upload by the bot), here is a sample: Volume 69. Tarmstro99 14:39, 3 May 2011 (UTC)
- Apparent garbage pages: the
- OK Thanks I better understand all this & forget the reverse-grayscale rich pages too. No need to change anything except to slow it down because I think you need to add the additional step of editing the Header/Footer fields on the Index page as the bot goes through each section to insert some sort of header/footer at creation. Please take closer look at the pages you have created today and you'll see that you've added a non-included header with page# left -- page title -- Page# right for each one automatically. If you would have paused the bot, edited the header field on the Index page, you could have further tweaked the non-included header to mirror the actual header text found in each section of the volume instead of the defaults I set up yesterday. (that's what I meant below when I wrote up to p.1314)
- Update: You can keep going with that BOT now in volume 68 part 1 up to djvu page 1314 where we'll need to adjust the automatic header insertion text on the Index page. — George Orwell III (talk) 03:40, 3 May 2011 (UTC)
- For works such as this one that do not have a consistent running header throughout the document, I am wondering whether the effort involved in setting up a header parameter on the index page is really justified. Having to change the headers repeatedly interferes with uploading of the text without really easing the proofreading process, since the auto-generated headers will still need to be changed when the pages are eventually raised from red to yellow status. It’s nice to have running headers that match the corresponding page image, IMHO, but not at the cost of the underlying textual content of the works. Just my 2 cents, of course. Tarmstro99 20:57, 3 May 2011 (UTC)
- Update: You can keep going with that BOT now in volume 68 part 1 up to djvu page 1314 where we'll need to adjust the automatic header insertion text on the Index page. — George Orwell III (talk) 03:40, 3 May 2011 (UTC)
- Question -- what's the story with Volume 68A: the Internal Revenue Code of 1954 ? — George Orwell III (talk) 04:45, 3 May 2011 (UTC)
- Doesn’t seem to have been included in the GPO’s scans; a substantial oversight (if it was an oversight). There are a few other similar bound volumes containing especially lengthy legislation, e.g., Volume 70A. These may possibly be available from other sources (Google Books, perhaps?); if so, we should migrate them over in the interest of having a complete collection. I can also check to see whether they are available in our microfiche collection and can scan them if so. Tarmstro99 21:01, 3 May 2011 (UTC)
- Ummm.... look again [1]. That's why I asked. — George Orwell III (talk) 21:17, 3 May 2011 (UTC)
- Oh, I see now. All right, I will add that to my to-dos. Tarmstro99 21:23, 3 May 2011 (UTC)
- No worries. That's partly why I'm here - to make you look omnipotent, all-knowing, etc. :) — George Orwell III (talk) 22:08, 3 May 2011 (UTC)
- Surely a hopeless task! :-) Anyway, here you are; please let me know if it is OK to upload the text extracted from the original PDF. Tarmstro99 00:44, 4 May 2011 (UTC)
- Sorry for the delay - 68A, 70 and 71 all check out. Let it roll on those 3 three for now. — George Orwell III (talk) 01:43, 4 May 2011 (UTC)
- Surely a hopeless task! :-) Anyway, here you are; please let me know if it is OK to upload the text extracted from the original PDF. Tarmstro99 00:44, 4 May 2011 (UTC)
- No worries. That's partly why I'm here - to make you look omnipotent, all-knowing, etc. :) — George Orwell III (talk) 22:08, 3 May 2011 (UTC)
- Oh, I see now. All right, I will add that to my to-dos. Tarmstro99 21:23, 3 May 2011 (UTC)
- Question -- what's the story with Volume 68A: the Internal Revenue Code of 1954 ? — George Orwell III (talk) 04:45, 3 May 2011 (UTC)
┌───────────────────────┘
You can add volume 70A to that run. I'll try to verify 3 volumes (or parts thereof) every day but no guarantee I won't miss a day or two at some point. I will continue reporting what volumes are ready for BOT page creation here and if you decide to go past what's been verified and setup because real-life got in the way - so be it I guess. — George Orwell III (talk) 20:43, 4 May 2011 (UTC)
- FYI - I am having major browser issues after some push update I had to accept. I can barely post plain text in the main namespace and get nothing but script errors in Index and Page namespaces in edit mode. I don't know if I can "fix it" over this weekend (Mother's Day) so if you're not willing to wait or edit the pagelists somewhat to at least get the correct (((pagenum))) into the headers/footers yourself - go ahead & run the bot I guess. George Orwell III (talk) 03:49, 7 May 2011 (UTC)
Ok, I'm back (sort of -- still not 100%) and so far today Volume 72, Parts 1 and 2 are ready for your BOT run. Quick question though - wondering if there is some way to add a default prefix letter to the the pagelist numbers for scanned page numbering such as the kind found in volume 72 part 2, all sections.
Not a big deal but it would be nice if the index page at least reflected the actual numbering without having to assign each page a specific value manually George Orwell III (talk) 22:16, 10 May 2011 (UTC)
- All the files from FDSys are now online at Commons, and I'll create the few remaining Index pages shortly. Thanks for your help updating the scan list template. Unfortunately, I don't know of any way to tell
<pagelist>
to number pages A1, A2, A3 … An, so ultimately I suspect that will mean a fair quantity of manual overrides. (It should not be hard to write a short script to generate the necessary "pagenum = value
" pairs for the ranges at issue and dump them into a text file, so it's not doomed to be an exercise in pure drudgery.) When all the necessary Index pages are online, I'll start reformatting them working backwards from Vol.94 and perhaps we'll meet somewhere in the middle. Tarmstro99 01:05, 11 May 2011 (UTC)- Working from opposite ends until we meet seems fine, as long as the BOT is run the same way after the pagelist is broken out first (even if its with just a dummy header creating the correct page number). But before you do that, you might consider nailing the other 5 volumes at FDsys first. I already toyed with one or two of those in the early days when I knew less than the little I know now so might need to dump those bits & pieces both here and at Commons. George Orwell III (talk) 01:21, 11 May 2011 (UTC)
- Am I correct that volumes 117 et seq. are not available for download as complete volumes? I see many individual files representing particular pieces of the volume, and I can probably devise a script to grab all of them that are available, but putting them together into a complete volume will be somewhat labor intensive. Tarmstro99 12:39, 11 May 2011 (UTC)
- No you are incorrect. Always try to reach the top tier 'Content Detail' page. FDsys is a pain for some to navigate because of its layout. I've become somewhat familar with digging in and drilling down for their gold is all. George Orwell III (talk) 14:21, 11 May 2011 (UTC)
- Aha! OK, let's see what can be done with those remaining volumes, then. Tarmstro99 14:50, 11 May 2011 (UTC)
- Update. OK, volumes 117–121 are now online. I had to upload each of them as just a single massive file (3 to 4 thousand pages each) because the FDsys PDFs did not include separate cover page images for the split parts. Probably not important. These ones do contain the embedded text extracted from the source PDFs and the image quality should be very high (600dpi versus 150dpi for volumes 65–94). Tarmstro99 00:46, 12 May 2011 (UTC)
- No you are incorrect. Always try to reach the top tier 'Content Detail' page. FDsys is a pain for some to navigate because of its layout. I've become somewhat familar with digging in and drilling down for their gold is all. George Orwell III (talk) 14:21, 11 May 2011 (UTC)
- Am I correct that volumes 117 et seq. are not available for download as complete volumes? I see many individual files representing particular pieces of the volume, and I can probably devise a script to grab all of them that are available, but putting them together into a complete volume will be somewhat labor intensive. Tarmstro99 12:39, 11 May 2011 (UTC)
- Working from opposite ends until we meet seems fine, as long as the BOT is run the same way after the pagelist is broken out first (even if its with just a dummy header creating the correct page number). But before you do that, you might consider nailing the other 5 volumes at FDsys first. I already toyed with one or two of those in the early days when I knew less than the little I know now so might need to dump those bits & pieces both here and at Commons. George Orwell III (talk) 01:21, 11 May 2011 (UTC)
┌───────────────────────┘
That's good to hear. Vol 73 & 74 have been aligned. George Orwell III (talk) 02:11, 12 May 2011 (UTC)
Volumes 75, 76 and 76A have been aligned & ready for BOT run page creation. -- George Orwell III (talk) 02:19, 15 May 2011 (UTC)
Update: FYI, I’ve worked my way down through Volume 80; just a few more to go. I’ll also make a point to fix the pagination of the indexes (A1, A2, A3, etc.) in the volumes you have already completed. Tarmstro99 18:09, 26 May 2011 (UTC)
USSaL v.95 to v.116
UPDATE...
Volumes 95 to 116 (1981 to 2002) are now available on FDsys - GPO Access as well !!! — George Orwell III (talk) 22:16, 14 June 2011 (UTC)
- Great news! I’ll set to. Tarmstro99 20:38, 15 June 2011 (UTC)
Text layers
Ummm... 150dpi seems too low for any effective OCR'ing to take place (at least as far as the free online services go) in creating a workable text layer. Same thing happens with the latest vol. 96's uploads.
- Unless you have something else up your sleeve on how to get around this? -- George Orwell III (talk) 16:20, 17 June 2011 (UTC)
- Agreed, I wish the GPO had posted 300+ DPI scans, but this is the source material we have to work with, regrettably. The source scans posted at FDsys are 150 DPI grayscale PDFs, which are much too big to host on Commons, so I am converting them to 150 DPI black-and-white DJVUs. Luckily for us, however, we don't need to OCR the DJVUs to produce a text layer (which would, as you surmise, probably yield pretty poor results), because there is a text layer embedded in the source PDFs from FDsys that I can extract and upload. (I’ll have my bot begin uploading Vol. 95 later today). That text layer gets lost in the PDF-to-DJVU conversion the way I do it (a problem I lack the technical know-how to fix; I’m just not certain how DJVU stores text layers), but the text can be extracted from the source PDFs and uploaded here to give us a relatively firm starting point for proofreading. Tarmstro99 16:47, 17 June 2011 (UTC)
- I personally don't get into that whole shtick about OCR'd vs. embedded text layers - nobody seems to consistently practice a good standard either way. The loss during conversion is not all that uncommon. I don't know much about what is going on underneath in certain PDFs but the DJVU standard is a bit overkill for what we (currently) do on Wikisource. Basically its nothing more than coordinate-mapping using parameters such as page, region, column, paragraph, line and word using a X-min, Y-min, X-max, Y-max format. So if a typical .DJVU page in a bundled index is 3886 x 5450 at 600 dpi, your text layer, no matter how it got there, would look something like the following to start with:
# ------------------------- select "front0001.djvu" set-txt (page 0 0 3886 5450 (column 1208 3829 2560 4753 (region 1208 3829 2560 4753 (para 1208 3829 2560 4753 (line 1600 4601 2156 4753 (word 1600 4601 2156 4753 "CODE")) (line 1208 4217 2556 4373 (word 1208 4217 1476 4369 "OF") (word 1652 4221 2556 4373 "FEDERAL")) (line 1208 3829 2560 3989 (word 1208 3829 2560 3989 "REGULATIONS"))))) (column 1336 2072 2427 2316 (region 1336 2072 2427 2316 (para 1336 2072 2427 2316 (line 1336 2240 2427 2316 (word 1336 2240 1545 2313 "TITLE") (word 1599 2240 1762 2315 "3--") (word 1758 2241 1921 2314 "THE") (word 1976 2242 2427 2316 "PRESIDENT")) (line 1366 2072 2408 2166 (word 1366 2088 1839 2163 "1936-1938") (word 1897 2072 2408 2166 "Compilation"))))))
- As you may have realized from the above, most of this data is useless on Wikisource because all the Page: namespace currently does is dump the text with typewriter style line returns and ignores the paragraph, column and remaining values altogether -- leaving us to (re)format a text layer that is somewhat formatted for us already. On Wikisource, the same exact output and effect as the bloated layer above produces can be achieved with only:
# ------------------------- select "front0001.djvu" set-txt (page 0 0 3886 5450 (line 1600 4601 2156 4753 "CODE") (line 1208 4217 2556 4373 "OF FEDERAL") (line 1208 3829 2560 3989 "REGULATIONS") (line 1336 2240 2427 2316 "TITLE 3-- THE PRESIDENT") (line 1366 2072 2408 2166 "1936-1938 Compilation"))
- To summarize, as long as the Page: namespace relies on "dynamic" layouts for main namespace transclusion then display rather than the padding, margin, column, etc. parameters assigned in any good .DJVU's text layer following the .DJVU file standards, you can map PDF embedded text anywhere within the DJVU's page width (x-axis; 0, 3886) and height (y-axis; 0, 5450) coordinates and get the same results as if the original scans were perfect and the OCR routine used was just as flawless. -- George Orwell III (talk) 20:29, 17 June 2011 (UTC)
- I just tried your Volume 97 .djvu with a FDsys downloaded PDF local conversion @ 150dpi and the text layer fit perfectly (though if you could refrain from naming the first indirect file "statZZ-0000.djvu" in the bundle, it would make life 1000x easier. The all zero or the "null" position is usually where shared annotations/dictionaries go if present --> & then file number can match the page number more often than not at the same time. If you can't start your program at "statZZ-0001.djvu", then please put a blank page before the cover page (which near always is page 1). -- George Orwell III (talk) 06:13, 18 June 2011 (UTC)
- I would like to hear more about how you managed to get the text layer embedded in Vol. 97! I’ll do it myself with the future volumes in the unlikely event that it’s within my meager technical capabilities.
- Concerning the page numbering issue, that, at least, is an easy fix. Take a look at how I did it with (for example) File:United States Statutes at Large Volume 100 Part 1.djvu. Page 1 is
0001
rather than0000
. Tarmstro99 00:47, 20 June 2011 (UTC)
- I just tried your Volume 97 .djvu with a FDsys downloaded PDF local conversion @ 150dpi and the text layer fit perfectly (though if you could refrain from naming the first indirect file "statZZ-0000.djvu" in the bundle, it would make life 1000x easier. The all zero or the "null" position is usually where shared annotations/dictionaries go if present --> & then file number can match the page number more often than not at the same time. If you can't start your program at "statZZ-0001.djvu", then please put a blank page before the cover page (which near always is page 1). -- George Orwell III (talk) 06:13, 18 June 2011 (UTC)
- More specifics on this over the coming week but basically I used command lines from PDF-to-DJVU to create a garbage looking djvu that matched your good-looking djvu's dimensions & etc. but had a good text layer underneath. Then I used DJVULibre to extract and apply it to the djvu that you uploaded to Commons. Once I
trimmedswapped-out those reverse-image "blank" pages from the original, I uploaded the finish product and it seems to have worked through out the Index. Using the GUI interface is faster but you can't tweak the settings to match your's unless you use the command lines and the .exe file in case that was not clear to you. -- George Orwell III (talk) 07:38, 20 June 2011 (UTC)
- More specifics on this over the coming week but basically I used command lines from PDF-to-DJVU to create a garbage looking djvu that matched your good-looking djvu's dimensions & etc. but had a good text layer underneath. Then I used DJVULibre to extract and apply it to the djvu that you uploaded to Commons. Once I
- p.s. - In light of this layer "swapping", you may not want run the BOT creation right after you upload a blank DJVU. We just might be making more work for us down the road than we need to. -- George Orwell III (talk) 07:47, 20 June 2011 (UTC)
┌───────────────────┘
I do have pdf2djvu
installed on my Linux machine, but have not played around with it. I am comfortable working with command-line apps rather than the GUI, so that prospect doesn’t frighten me. It does appear that pdf2djvu accepts quite a few options from the command line; would it be possible for you to outline the particular combination you have found to yield the best results?
C:\Program Files\PdfToDjvuGUI>pdf2djvu --output=097.djvu --pageid-template=stat97-{dpage:04*}.djvu --dpi=150 --monochrome --no-metadata --no-hyperlinks --words --pages=1-1702 --jobs=2 097.pdf
- Sorry for the DOS but no Linux here. Just remember that its not so much the quality of the images that are of concern but matching the dimensions of the better process. Your process seems to produce far better results than most software or services that I've seen so stick with whatever you're doing. The point here is to create a throw-away djvu with PDF embedded text applied so that it may be extracted and reapplied to your version. If you can find a combination of settings that will let you reproduce your quality -- go for it !! -- George Orwell III (talk) 20:57, 20 June 2011 (UTC)
At the moment, all volumes through 102 are online at Commons, but they do not have text layers (except for your redone Volume 97, which my bot is uploading now). Where do you think my time would be best spent at this point? (1) figuring out how to use pdf2djvu and using that to convert the remaining volumes (103–116)? (2) using my existing toolchain to convert the remaining volumes and letting you add a text layer later? (3) creating Index pages for the volumes that have already been uploaded? Just trying to figure out how to minimize duplication of our collective efforts. Tarmstro99 13:38, 20 June 2011 (UTC)
- (2). I can promise to try and keep up but not as much free time of late. Either way just keep doing what you've been doing, add an index page if you have the time but don't run the automated BOT page-creation after you've uploaded a "blank" djvu is all. I just downloaded Vol. 116 (half a Giig) so I'll find out if PDF2DJVU can handle something really large I guess -- George Orwell III (talk) 20:57, 20 June 2011 (UTC)
- I used your technique to add a text layer to Vol. 103 before uploading; see what you think. Tarmstro99 19:06, 21 June 2011 (UTC)
- At first glance; it looks good. The pages match up and the content itself seems to match the PDF's too. Still, maybe if we put our heads together we can make this process even better.
- First point.--As long as you have a .txt file of the extracted text layer, why not pause and correct some of the glaring mistakes before re-insertion? If you skim through the file and compare it to the images, you'd notice nearly all the sidenote citations, and some main body ones, of the abbreviation used for United States Code (USC or U.S.C) are wrong. They look something like
42 use 5117a
instead of42 USC 5117a
(i.e use when it should be usc or USC). Considering the amount of times use is used incorrectly compared to the instances where use is proper and correct, it makes sense to bulk replace this word in order to help increase potential search engine hits for the possible readers out there while cutting down down on the eventual editing needed during proofreading for us on en.WS.There are many other super repetative terms that are found through out the text layer. For instance,
Section (aX2Xi)
is reallySection (a)(2)(i)
(i.e. a subsection's closing and opening parenthesis is frequently mistaken as a capital "X"). It makes sense to bulk replace all capital "X"'s with ")(" because the likelyhood that are proper words using a capital X (Such as Chapter XII for example) is low compared to the number of times capital "X" is incorrectly used.- A search-and-replace script (whether before or after uploading) can fix most, but perhaps not all, of those problems. I would prefer to avoid introducing new errors into the text to the extent possible. Here are some scripts I ran before uploading the text of Volume 98:
sed -i "s/(\([a-z]\)X/(\1)(/g"
sed -i "s/(\([0-9]\)X/(\1)(/g"
- Those should have caught the great bulk of instances where X was mistakenly substituted for )( during the OCR.
sed -i "s/\([0-9]\) use/\1 USC/g"
- That should catch the erroneous references to the U.S. Code. Tarmstro99 19:31, 25 June 2011 (UTC)
- Update: I ran those same scripts before re-embedding the text layer in the DjVus of Vol. 104, which are now available at Commons. The process requires one tweak to the parameters to be passed to
pdf2djvu
as you provided above: replace--words
with--lines
. That way, each line of text in the source document corresponds to one (and only one) line of text in the output, which makes autocorrections feasible using sed (which ordinarily searches only one line at a time for a given pattern). Tarmstro99 20:27, 26 June 2011 (UTC)
- Update: I ran those same scripts before re-embedding the text layer in the DjVus of Vol. 104, which are now available at Commons. The process requires one tweak to the parameters to be passed to
- I used your technique to add a text layer to Vol. 103 before uploading; see what you think. Tarmstro99 19:06, 21 June 2011 (UTC)
┌───────────────────┘
missed this until now - yeah --lines
is what I would have eventual gone to once the stand alone glaring word mistakes were fixed but I'm not all that advanced to be able to script anything like you've done. I really haven't gone beyond basic search and replace to be truthful about it. I see your still pluggin' away at getting the volumes up but at some point we need to address the USStat templates as they currently stand. There are far too many variations on a theme from one era to the other and something needs to be done in regards to standardizing styles and their corresponding templates & usage. -- George Orwell III (talk) 02:06, 2 July 2011 (UTC)
- Just a short status update, without getting into the larger issues about templates and formatting for now. The DjVu files for volumes 104-116 have been corrected before uploading to Commons using the typo-catching sed scripts above, so the text for those volumes should be free from the types of errors you identified above. After the last few volumes have been uploaded, I will go back and have the bot apply the same corrections to the previously uploaded Volumes 65-113. That will at least give us a common baseline of corrected text against which any further automated edits can be applied. Finally, I'll try to extract all the text from Volumes 65-116 into a single concordance file which can be searched for misspelled words (and sorted according to the frequency with which they occur) to guide further auto-corrections of the text. (It doesn't seem like it's necessary to apply those corrections to Volumes 117+ since those were "born digital"). Tarmstro99 13:44, 5 July 2011 (UTC)
- Second point.--The one difference between post-upload bot insertion of PDF text used in Vol.95 and pre-upload PDF text manipulation manually a la Vol. 103. is that your way produced better CR & LR (carriage returns at the end of a paragraph & line returns much like <BR> than my way does once uploaded and the pages are created. I don't know if there is a scriptable way to achieve the same effect without making a supreme mess of the text layer in general. I know
\n
is one way for a djvu file to recognize the end of a paragraph but I don't think the Page: namespace will allow it to be used in its text dump like processing. Thoughts? The DjvuLibre Doc Files have some interesting points on UTF-8 vs ASCII text layers regarding extraction &/or insertion but I don't know what exactly the difference would be if we used one parameter over the other (the default I guess?). -- George Orwell III (talk) 21:30, 21 June 2011 (UTC)
- Second point.--The one difference between post-upload bot insertion of PDF text used in Vol.95 and pre-upload PDF text manipulation manually a la Vol. 103. is that your way produced better CR & LR (carriage returns at the end of a paragraph & line returns much like <BR> than my way does once uploaded and the pages are created. I don't know if there is a scriptable way to achieve the same effect without making a supreme mess of the text layer in general. I know
Bot text layer — favour to ask
Gday Tim. I would like to ask a favour of you and your bot. Would you be so kind to get it to apply the text layers for the pages in Index:Woman's who's who of America, 1914-15.djvu. It is a work where I just need to dip in and get the occasional biography for authors, and having text layers applied just makes it readily available, and not lucky dip for the page. Thanks if you can do it. — billinghurst sDrewth 13:42, 26 June 2011 (UTC)
- sure, happy to oblige. Tarmstro99 13:50, 26 June 2011 (UTC)
urgent request
as a steward, I need an admin ASAP. please contact me privately. Matanya (talk) 15:26, 11 July 2011 (UTC)
- I'm not a steward, sorry. Tarmstro99 15:38, 11 July 2011 (UTC)
- I am, but user Zhaladshar had contacted me. thanks anyway. Matanya (talk) 15:45, 11 July 2011 (UTC)
Issues list
Rather than cluttering up our User: Talk-pages with one discovery or another concerning USSaL scans, I've started an Issue section under the main Talk-page linked above. Two issues listed so far. -- George Orwell III (talk) 03:06, 14 July 2011 (UTC)
- Good idea and much appreciated. I’ll see about fixing those and getting corrected versions of the files uploaded to Commons. Tarmstro99 19:10, 14 July 2011 (UTC)
Uploading open-access papers systematically
Hi there, I saw that you had signed up at Wikisource:WikiProject Academic Papers, so I thought I'd ask your opinion on scaling up the import of open-access materials, as discussed at Wikisource:Scriptorium#Scaling_up_the_import_of_open-access_sources. Thanks and cheers, -- Daniel Mietchen - WiR/OS (talk) 12:42, 10 August 2011 (UTC)
Margins
Hey. hope all is well...
I see you've come across an issue with margins or something and tried to edit the start & end templates. Well, I tried to bring the issue up earlier, but all those new GPO volumes kind of side-tracked everything.
The "problem" is up until now we've been trying to "squeeze" our desire to have our well known and used throughout default statutes at large layout work under the current dynamic layout setup by via some templates that basically wrap stuff forcing it to behave rather than creating our own prroject specific layout. This wasn't all that easy to begin with but it just so happens that a key pronlem - setting a transcluded USSaL page to display using a specific layout without forcing any change to the visting user's settings, etc., etc. - seems to have been solved (see proposal HERE).
If that goes through, and there is no reason it shouldn't, we can make the default view for the time being Layout 2 which at least keeps the the sidenotes in the view screen where as the global default, layout 1, does not.
At that point "we" can focus on doing away with all that extra template and div wrapping stuff for good, keep the benefits of having a fixed & full header no matter what the dynamic view used (the way page headers should be rendering now per my last start template edits), but once and for all tweak the margins, padding, positions, fonts etc. etc., to work across all volumes and all common browers both at the same time.
I've set up a test layout based on Layout 2 and some stuff I gleamed from the French and German sites using dynamic layouts by default (see my monobook file) and played around some with good results. There is a test layout too but that was only useful if you changed the the colors to display as background instead of just borders. -- George Orwell III (talk) 08:58, 2 September 2011 (UTC)
PDF2DJVU
Just a heads up - three updates to this program since last we spoke about it (check the changelog). Two new features worth mentioning 1.) the monochrome switch was sort-of-broken all this time but I haven't got around to testing the fixed version yet; and 2.) djvus can now be created from multiple PDF sources... though no English documentation on just how to accomplish this has been provided yet and I don't speak German/Polish either. -- George Orwell III (talk) 19:34, 8 September 2011 (UTC)
...and I got my hands on Adobe Acrobat Profession ver. 10. I can now rip any layers or watermarks from existing PDFS. Most non-Adobe PDF creators botch this so they typically need to be removed a page at a time rather than through their built in app. with a click or two. -- George Orwell III (talk) 19:39, 8 September 2011 (UTC)
USSaL v.45 to v.64
I understand you are the lead on the US Statutes at Large. What are your plans for the missing volumes 45 to 64? From what I can tell the Constitution Society has incomplete versions of the volumes on their website for download, but I am not sure if they are complete, which is doubtful. Even if they were there is still the problem of size and DJVU conversion, which I know nothing about. (I do use them as my primary source for proofreading, though.)
Is there any chance of using those copies in the interim, or must bonafide complete copies be uploaded from the start? Or is it another problem? These volumes cover 1929-1950; some pretty pivotal years in the legal history of the United States. (And the last volumes.) Of course, the work you have already done is superb, and I have already started proofreading some of the most important/famous/infamous Acts. Please let me know. Int21h (talk) 07:04, 10 September 2011 (UTC)
- I do want to add the remaining volumes so our set will be complete. At the moment, I am scanning Volume 45, Part 2 from microform, but I do not have a sense of when it will be complete.
- I am sort of hoping that the Government Printing Office will, in time, come to the rescue here; they have posted scans of all volumes from v65 on, and apparently they are ultimately planning to put the remaining volumes online as well. Their resources and time certainly exceed my own, plus they would certainly do a better job of providing scans that are both complete and indisputably authentic.
- I’m aware of the Constitution Society’s repository, which obviously has some distinctive strengths and weaknesses. If you locate copies of Volumes 46–64 on other sites that appear to be reasonably complete, please let me know; I can handle the DjVu conversion, text extraction, and so on. Tarmstro99 17:12, 12 September 2011 (UTC)
- You are scanning them all on your own? From scratch? Would not taking what the Constitution Society has already done and just completing the missing pages be more efficient? Int21h (talk) 05:56, 13 September 2011 (UTC)
- I should also note that the OCR in the Constitution Society (CS) scans are exemplary and usually better than what is in here. I usually start my proofreading by replacing the text of a page with the OCR text of the Constitution Society version. I am not sure what method the text is extracted or gathered, but checking against the CS text (and against a dictionary) may be something to think about. Also, are you embedding the OCR into the DJVU? How much extra space towards the 100MB limit is this adding? Int21h (talk) 06:01, 13 September 2011 (UTC)
- The Constitution Society’s archive is generally of good quality, but incomplete. What they call “Volume 45” is actually only Volume 45, Part 1 of the Statutes at Large; their site omits Volume 45, Part 2. (I believe understand why they did this; Part 2 of Volume 45 contains things like private laws, Presidential proclamations, and the like, which the Constitution Society might view as less “important” than the public laws in Part 1.) Similarly, their scanned copy of Volume 45, Part 1 skipped a number of pages that actually appear in the source document; those pages had to be separately scanned and added to our copy. I have no interest in duplicating efforts, and I did use the Constitution Society’s incomplete PDF of Volume 45, Part 1 as the basis for our own DjVu, with the additional necessary pages inserted. For Volume 45, Part 2, however, I haven’t located a copy elsewhere that will save me the effort of scanning. (I generally do copy over the embedded source text from the predecessor scans, regardless of where they originated, when creating our own scanned versions; text compresses very well and the text layers account for only a very small fraction of the final DjVu file size.) Tarmstro99 09:59, 13 September 2011 (UTC)
- I should also note that the OCR in the Constitution Society (CS) scans are exemplary and usually better than what is in here. I usually start my proofreading by replacing the text of a page with the OCR text of the Constitution Society version. I am not sure what method the text is extracted or gathered, but checking against the CS text (and against a dictionary) may be something to think about. Also, are you embedding the OCR into the DJVU? How much extra space towards the 100MB limit is this adding? Int21h (talk) 06:01, 13 September 2011 (UTC)
- Ah, I understand. It does not seem likely there is going to be a more complete digital version similar to the Constitution Society's quality anywhere; I was surprised this was even available. Wikisource seems to be the most complete digital collection available on the Internet for free. And as for the GPO, they only have a multi-billion dollar annual budget to spend on something as unimportant and insignificant as the (current) laws of the land, so don't count on them completing it within the decade, or ever. ;) In any event keep up the good work. Int21h (talk) 22:15, 15 September 2011 (UTC)
BTW how do you obtain and how do you scan them? It would be awesome if I could start on the California Statutes (Chaptered Bills). The California Codes (codified law) are not even printed from what I understand, and as such there is no official version. I understand that the California citizen is supposed to travel on a pilgrimage to Sacramento periodically to collect the statutes, or rely on informal sources to know the law. Int21h (talk) 22:15, 15 September 2011 (UTC)
- The law library at my university has a complete collection of the Statutes at Large on microform, with an attached scanner. I have been scanning Volume 45, Part 2 at the scanner’s default resolution of 800 dpi, although that tends to produce fairly massive files that will certainly need to be scaled down to a lower resolution to fit within the 100 MB upload limit on Commons. The quality of the microforms in our collection varies widely; some microform producers did an excellent job producing clear, legible scans, but others did not. The worst I have seen are scanned pages where the contrast setting was not adjusted and text on the opposite side of the sheet “bled through” into the microform image of the page as it was scanned. (This page is a pretty extreme illustration of the problem, but it’s far from the only one.) I have been trying to clean up the images a bit (mostly using this excellent tool to help with the eventual OCR’ing of the text, but there is ultimately no substitute for a page-by-page review of the content to make sure that I’m not simply uploading garbage. It’ll get done—weeks, not months—but I certainly wish the process was faster. There are plenty of copy-left/open access/creative commons types in law schools and universities on the west coast; I gotta think it would not be too great of an organizational challenge to kick-start an effort to digitize California law and put it online here! :-) Tarmstro99 13:42, 17 September 2011 (UTC)
USSaL Vol 117
Hi, I was dealing with some pages in Special:LonelyPages when I came across the tag on Index:United States Statutes at Large Volume 117 Front Matter.djvu to migrate the text to Index:United States Statutes at Large Volume 117.djvu. I've moved the two pages that had been validated/proofread. However, I'm not sure what should happen to that Index now. Is it intended that it be deleted? I'm really asking so that I know what to do in the future when I meet this situation again. Cheers, Beeswaxcandle (talk) 08:05, 9 October 2011 (UTC)
- Thanks for the reminder; this is a bit of cleanup that I have neglected to take care of. It has previously happened that someone (not infrequently, me) uploaded only a portion of a SaL volume that then needed to be moved and deleted when the complete volume later came online. I’ll look over the other “Front Matter” files in the next few days, move any pages that have been proofread at least once, delete the rest, and then initiate the necessary deletion requests at Commons since the “Front Matter” files are no longer needed. Tarmstro99 19:14, 9 October 2011 (UTC)