User talk:Tarmstro99

From Wikisource
(Redirected from User talk:TarmstroBot)
Latest comment: 3 years ago by Techie3 in topic Sherlock Holmes public domain books
Jump to navigation Jump to search
Archives


Sidenotes revisted

[edit]

Hi again, hope this finds you & your's well...

I saw you attempted some changes to experiment with recently. With the same goal in mind, I took the liberty of adding a test layout and a small script that forces layout 2 instead of layout 1 for pages using the USStatCols templates to your User:Tarmstro99/common.js file. I hope you find them instructive if not useful - feel free to delete it if not.

The whole "point", in my opinion, is not to "fix" layout 1 or to force a so-so layout 2 for visting users but to have a custom layout envoked only for USSaL pages when applicable - thereby reducing the amount of templates & tweaking currently needed just to get the same meh.. rendering that was the default almost 3 years ago in the process. -- George Orwell III (talk) 08:41, 18 March 2012 (UTC)Reply

Appreciate the help. I am trying to come up with a way to standardize the presentation of the materials to users who have not taken the trouble to edit their own javascript files, on the theory that users who arrive here from en.wp should find the Statutes at Large content to be legible. I’ll undoubtedly be conducting some further experimentation along these lines and will be very grateful for the benefit of your stronger grasp of scripting. Tarmstro99 13:58, 26 March 2012 (UTC)Reply
No problem. The ultimate point is the same - develop a layout that "works" for whatever browsers are in use by "us" to the point where we can present it to the community as a permanent default (i.e. nobody is in love with the current Layout 3 and I believe this one will easily replace it once "finished"). The scripting itself is primarily another avenue to force what amounts to common HTML and/or CSS parameters and values. I hope you can gleam as much from the file I created.

Once a default is secured its just as to force this new layout as a default for USSaL pages (just as Layout 2 is now being forced in the interim). Most everyone uses java even without specializing or customizing anything and that will be the case here in the end. Please note - I have a better understanding of how this scripting seems to work since I've been a real pain in the ass about getting to this point in its development but by no means is my understanding perfect nor complete (in other words - we could use any help we can from other folks!) -- George Orwell III (talk) 14:20, 26 March 2012 (UTC)Reply

US Legislative Data Workshop

[edit]
hi, i noticed your work at Template:Public acts of Congress. would you be interested in teaming with the Cato institute started at w:Wikipedia:Meetup/DC/Legislative Data Workshop and their open government xml data going forward?
hi i noticed your work at United States Statutes at Large duplicates a little the lists at Portal:Acts of the United States Congresses, maybe these should be combined. Slowking4 (talk) 18:51, 15 March 2013 (UTC)Reply

USSaL v124 is up

[edit]

Hi, hope this message finds you & your's well

Just an FYI... vol. 124 was released at somepoint at the end of last month and I took the liberty of fixing the 9,000 page mess (yup 9K - because some Pub. Laws were technically from the 1st session, GPO just restarted the pages for each assigned Pub. Law numbered gap).

I got it down to the proper 4830 pages but am having some issue uploading to Commons so I pulled the text layer in order to get something in place (kind of relevant - contains Obamacare, Dodd-Frank, etc.).

Index:United_States_Statutes_at_Large_Volume_124.djvu

Anyway, by the time you read this a new version should be up & with almost all of the text-layer in it "done" - of course there are Annex(es) to proclamations that never got OCR'd and some tables need rotating to make sense of the embedded text but I think its still one of the cleanest works I've ever managed to hobble together.)

@Tarmstro99: Drop me a line before you run that Bot thingy. Prost. -- George Orwell III (talk) 22:59, 9 June 2013 (UTC) George Orwell III (talk) 20:22, 9 May 2014 (UTC)Reply

I have looked over the 4,830-page DJVU file—looks exceptionally good! My DJVU conversions sometimes seem to add extra whitespace around the text (to fill up a letter-size page), which both enlarges the file and makes the text harder to see during editing, so I’m glad that your version of the file doesn’t suffer from that problem. Do you have any remaining concerns before I turn the bot loose on the text? Tarmstro99 14:51, 12 May 2014 (UTC)Reply
That's it - all I wanted was a second look by someone who knows the subject matter & now that you've done that; let 'er rip.

fyi - I cropped the margins while it was still a PDF using Acrobat Pro and that's how I managed to get just the content and little else in the way of "wasted space". -- George Orwell III (talk) 21:16, 13 May 2014 (UTC)Reply

DoneTarmstro99 13:28, 15 May 2014 (UTC)Reply

The GPO, Congressional Data Coalition, and Statutes at Large

[edit]

The Congressional Data Coalition, a major lobbying group pushing for government transparency, recently posted on their website about their effort to OCR and digitize the Statutes at Large. Since we, you, have been doing this for a number of years, I think it would be beneficial if you could confer with them about the current state of affairs on Wikipedia Commons and Wikisource. I would hate for them to pretty much duplicate everything that's already been done.

The main website (I guess) is http://legisworks.org/sal/

Whats the status of the missing volumes, 45-64? Int21h (talk) 04:51, 22 April 2014 (UTC)Reply

OK, now I'm thoroughly confused. User:Joecarmel (Special:Contributions/Joecarmel) is one of the contributors to that project, and he obviously knows about Wikisource... I'm not sure if they know you're the source for much of it, though. Int21h (talk) 05:48, 22 April 2014 (UTC)Reply
Naturally, I’m quite pleased to see the work that the congressionaldata.org people have been doing! Let a thousand flowers bloom, etc. I'll look forward to seeing the output they are generating for the portions of the volumes that we are still missing here; I gather that their work to date has mostly involved matching up the tables of contents, etc. from their sources to our page scans, which is terrific work.
I’m flattered to be credited as the source of the project based on my Early United States Statutes site, but as I acknowledge on that site, I have done very little original work in connection with this project. Most of the volumes I have been working with were scanned by the Library of Congress’ American Memory Project or by the GPO. My contributions have really been simply in collating their many thousands of individual page scans into volumes that can be accessed as a whole. I also set up the corresponding index pages for many (although hardly all) of the volumes that we are hosting here. For newer volumes, I’ve been beginning with the same collection that the congressionaldata.org people are using (namely, the Constitution Society’s). However, I have also been doing some work that the Constitution Society has not, namely, adding the indexes and tabular material that appear at the front of each printed volume (but which are often missing from their scans), and scanning the volumes that contain private laws, treaties, executive proclamations and the like (which I think are not of interest to the Constitution Society, which has focused on the portions of each volume that reproduce public laws).
Regrettably, my institution recently discarded a significant portion of its microform holdings in the interest of shelving space, including many of the volumes of the Statutes at Large that I had not yet scanned. I am hopeful of obtaining scans of the missing volumes via interlibrary loan, but it does add a little delay to the project.
At the moment, the most recent volume from the “missing” sets that is available online here is Volume 47, Part 1, which was constructed using the incomplete scans that the Constitution Society refers to as “Volume 47,” to which I appended my own scans of the pages omitted from their file. I am currently working on Volume 47, Part 2, the content of which unfortunately seems not to be available from the Constitution Society or any other source of which I am aware. My scans of this volume using the microfiches available to me is proceeding, although slower than I would wish. Tarmstro99 01:24, 9 May 2014 (UTC)Reply

TarmstroBot Issues

[edit]

I was looking at the page Page:United States Statutes at Large Volume 18 Part 2c.djvu/176. I noticed that the small capital letters seemed to screw with the bot. Just thought I’d mention it. There are some other strange characters, but such is to be expected.Ushakaron (talk) 22:50, 10 June 2014 (UTC)Reply

USSaL v125 is up

[edit]

... but not ready for BOT page creation just yet.

While the volume suffers from the usual "missing OCR" for a couple of proclamation annex tables, low-quality images & the like, the GPO's PDF embedded text-layer was surprisingly clean this time. I managed to strip the hidden GPO timestamp/metadata string from the page "footers" and cropped the white-space margins evenly for both left- and-right facing page scans with little trouble -- but introduced lord-knows-how-many errant "extra spaces" to the text-layer in the process. I fixed the majority of those manually before I uploaded the current source file though.

My gut tells me I can figure out how to make the text-layer even 'more perfect' given some time & tinkering so I ask you to hold off on running your BOT until I touchback with you.

See the new Index here...

Index:United_States_Statutes_at_Large_Volume_125.djvu

... and if you're interested or find it instructive, the processed PDF is here...

File:United_States_Statutes_at_Large_Volume_125.pdf


Prost. -- George Orwell III (talk) 01:16, 27 July 2014 (UTC)Reply

New Proposal Notification - Replacement of common main-space header template

[edit]

Announcing the listing of a new formal proposal recently added to the Scriptorium community-discussion page, Proposals section, titled:

Switch header template foundation from table-based to division-based

The proposal entails the replacement of the current Header template familiar to most with a structurally redesigned new Header template. Replacement is a needed first step in series of steps needed to properly address the long time deficiencies behind several issues as well as enhance our mobile device presence.

There should be no significant operational or visual differences between the existing and proposed Header templates under normal usage (i.e. Desktop view). The change is entirely structural -- moving away from the existing HTML all Table make-up to an all Div[ision] based one.

Please examine the testcases where the current template is compared to the proposed replacement. Don't forget to also check Mobile Mode from the testcases page -- which is where the differences between current header template & proposed header template will be hard to miss.

For those who are concerned over the possible impact replacement might have on specific works, you can test the replacement on your own by entering edit mode, substituting the header tag {{header with {{header/sandbox and then previewing the work with the change in place. Saving the page with the change in place should not be needed but if you opt to save the page instead of just previewing it, please remember to revert the change soon after your done inspecting the results.

Your questions or comments are welcomed. At the same time I personally urge participants to support this proposed change. -- George Orwell III (talk) 02:04, 13 January 2015 (UTC)Reply

Index:Crowdsourcing and Open Access.djvu

[edit]

Now seemingly proofread..

And on a side issue, Wikisource also has (although not proofread) a set of English Statutes from Magna Carta up to the first few years of George the Third. The version Wikisource has is ultimatly from a copy in the John Adams Library. Rather embrassingly, I couldn't find anything more recent as a complete set. ShakespeareFan00 (talk) 17:09, 12 September 2015 (UTC)Reply

Great news on both counts! Tarmstro99 16:40, 25 October 2015 (UTC)Reply

United States Statutes at Large/Volume 1/5th Congress/2nd Session/Chapters 72, 73, 74

[edit]

Hello.

I hope you will not mind my presumption. I noticed your change remark on United States Statutes at Large/Volume 1/5th Congress/2nd Session/Chapter 72 to the effect "transclude using instead of <pages>, because <pages> introduced an extra newline between pp. 595 & 596, in the middle of a single paragraph of the original source".

Now I should emphasise I am looking at this as an outsider with no prior background as to the agreed "look" of this project; so please take any suggestions with a suitable dosage of whatever is your choice of condiment!

Having said that, would it make sense to simplify (in Chapter 72 referred to above):

{{USStatCols start}}
{{Page|United States Statutes at Large Volume 1.djvu/716|section=chap72|num=594}}
{{Page|United States Statutes at Large Volume 1.djvu/717|section=chap72|num=595}}
{{Page|United States Statutes at Large Volume 1.djvu/718|section=chap72|num=596}}
<!--here's what didn't work:-->
<!--<pages index="United States Statutes at Large Volume 1.djvu" from=716 to=717 fromsection="chap72" tosection="chap72" /><pages index="United States Statutes at Large Volume 1.djvu" from=718 to=718 onlysection="chap72" />-->

{{USStatCols end}}

by replacing this fragment with:

{{USStatCols start}}
<pages index="United States Statutes at Large Volume 1.djvu" from=716 to=718 onlysection="chap72" />

{{USStatCols end}}

which I believe does the job equally well?

Which leads to another pair of issues: do you know why Chapters 73 and 74 both utilise {{sidenotes begin}} and {{sidenotes end}} pairs whereas Chapter 72 and all prior chapters I have so far examined utilise {{USStatCols start}} and {{USStatCols end}} pairings in corresponding situations? Is there a good reason for Chapters 73 and 74 changing the apparent pattern?

Thank you for your patience. I hope these questions are not too naïve. AuFCL (talk) 21:32, 24 May 2016 (UTC)Reply

Great suggestion; it worked like a charm!
On your second issue, the reason is historical; those chapters were added earlier in time when the convention was to use the two {{sidenotes}} templates. I have been (slowly, oh so slowly) working my way through the enactments of the Fifth Congress and have been cleaning things up along the way to conform to current best practices. I’ve only just reached Chapters 73 & 74, but will clean them up before continuing on. Tarmstro99 13:10, 25 May 2016 (UTC)Reply
I am delighted my suggestions worked out to your satisfaction, and thank you for the explanation of the history. I just happened to note the change in style and wondered why. (It is hard to suppress the expectation that later chapters are constructed using lessons learned from earlier ones—I completely omitted to check edit dates!) AuFCL (talk) 23:31, 25 May 2016 (UTC)Reply

Query as to TarmstroBot's capabilities

[edit]

As you know, I am working on USStat 33. Vast reams of the private acts are highly formulaic with a very small vocabulary (if I may call it that) of sentences, and I'm wondering if TarmstroBot could help.

Both chapter heading and preamble are common to probably hundreds of private acts dealing with pensions. The heading is either "CHAP. ###.—An Act Granting a pension to" or "CHAP. ###.—An Act Granting an increase of pension to" followed by the name of the recipient - with the names usually surprisingly well OCR'ed. The preamble should read (the OCR often blends the sidenotes into this part of the text): "Be it enacted by the Senate and House of Representatives of the United States of America in Congress assembled, that the Secretary of the Interior be, and he is hereby, authorized and directed to place on the pension roll, subject to the provisions and limitations of the pension laws, the name of ". There then follows a highly variable sentence giving name, historical military service, widow of etc., followed by another formulaic closing sentence. For grants, this will be: "and pay <him/her> a pension at the rate of <sum> dollars per month.", while for increases it will be "and pay <him/her> a pension at the rate of <sum> dollars per mouth in lieu of that <he/she> is now receiving." There is then "Approved, <full date>." on the next line, marking the end of each chapter.

Given that we know that these tracts of boilerplate are reused again and again, could a Bot with rules of engagement drafted by someone cleverer than me be employed to paste in the boilerplate (intelligently) time after time, complete with formatting and section tags? It would obviously risk overwriting any sidenote data which OCR has blended into the body text, but the date, name and grant/increase elements of this can be copied and pasted into the sidenotes from elsewhere in the same chapter. I have yet to find many instances where the [Private, no.] or [H.R. ####] has been correctly OCRed, so would we be losing anything valuable?

Just a thought - I hope my reasoning is clear!

Kind regards

CharlesSpencer (talk) 11:55, 2 July 2017 (UTC)Reply

Sorry for my delay in replying! This is the sort of problem potentially suited to the work of a bot, I suppose; although it would take some doing (possibly within but possibly beyond my capabilities) to write a regex that matched the “to-be-replaced” text in all its possibly garbled permutations. I also wonder whether the solution might not be to whip up a couple of templates that could be called by editors, with variables representing (for example) whether to print “him” or “her”, the amount of the pension, and so forth. When editing a page, an editor could simply replace the entire block of OCR-ed text with a single call to the relevant template, which would then reproduce the appropriate text on the page. Tarmstro99 18:21, 22 August 2017 (UTC)Reply
No apology required - my question was plum in the middle of holiday season, and not at all urgent.
I wrote {{USStatPension}} back in April 2010 and I (and a very few other brave souls) have used it a bit. It is pretty laborious, though, and, with 3,467 Private Acts in the 58th Congress, it would be ideal work for someone in a minimum security institution without possibility of parole! With Mpaa's help (and that of his/her bot) I have managed to complete the entire List (i.e. contents) of private acts of the 58th Congress (still some validation outstanding, hint, hint...), and am now working on adding to the same Excel spreadsheet the necessary data points to allow Mpaa's bot to go again with c.3,400 instances of {{USStatPension}} (a small percentage of the private acts aren't pension-related).
I did all of the post-processing for the List off-line in Word and Excel on a text file of the entire document. I am looking for ways to do the same again for the body text, but it has much more noise than the List, what with side notes randomly blended in, paragraphs breaking across pages and so incorporating header and footer text etc., which together make it much more manual. For some reason (maybe the developers of Tesseract had a pacifist bent) the word "Regiment" alone has what feels like hundreds of erroneous OCR mis-spellings, "Volunteers" and "Cavalry" ditto. It took me a day to review 408 Acts manually in Excel, and that's without bothering with the side notes. As a non-statistician, non-programmer, I think I just need to track down the right person/tool to help me. I don't know if this highly statistically competent chap [1] may be the right answer? I have emailed him and will report back if he replies. The formulation of the pension private acts is, I seem to recall, standard across many years/congresses, so if we can achieve automation of the post-processing challenge, we can potentially do a lot of good work very quickly.
Phew, hope that's still clear. CharlesSpencer (talk) 07:40, 29 August 2017 (UTC)Reply
Would it be beneficial to have a higher-quality OCR of the underlying page scans to use as a starting point for editing? Given the amount of work you have already invested in the project, the answer may very well be “no”; I’d hate to undo any progress that has already been made. I have access to the full version of Adobe Acrobat at work, however, and I think the quality of its OCR output is generally quite good. That would not eliminate the need to clean up the formatting (sidenotes, italics, etc.) of the resulting output, but it would likely cut down on the number of alternative misspellings of “Regiment” and other common words that appear in the unedited text. Is this possibility of interest? Tarmstro99 16:04, 29 August 2017 (UTC)Reply
High quality OCR output would be a godsend - because I'm working at the text level, I've got nothing to lose. When Mpaabot updated the List, it happily wiped out everything that was there before. The less noise there is in the output, the greater the ability to use regex in Word etc. with confidence. Thanks. CharlesSpencer (talk) 17:44, 29 August 2017 (UTC)Reply
A quick update: Tarmstro, I have had great success with the pdf of the List of Public Acts from the Constitution Society, the Excel output of which I have just sent to Mpaa for if s/he has time to give it to their Bot to process. You were absolutely right that higher quality OCR would help. Unfortunately, since the Constitution Society is only concerned with public law, the second volume isn't available on their website - which is a shame, since they have a far higher quality scan than the one from archive.org at Wikisource, which is dark to the point that it almost looks like the original was printed with too much ink. Weirdly, the OCR of the body of the Public Acts from the Constitution Society is pretty accurate on a word by word basis, but when you copy and paste big chunks of it, the lines often get jumbled up, so we're back to looking for a high quality original to apply high quality OCR to. CharlesSpencer (talk)
There is this scan of Volume 33, Part 2 on Google, but I do not know whether the quality is greater than what we currently have. Tarmstro99 16:33, 27 September 2017 (UTC)Reply

Field v. Google, Inc.

[edit]

You added Field v. Google, Inc. in 2007. I went back and cleaned it up tonight. I used the Google Scholar copy so I could crib the F. Supp. pagination, then proofed a few bits against the slip op. Cheers. Apt-ark (talk) 06:05, 8 April 2019 (UTC)Reply

Community Insights Survey

[edit]

RMaung (WMF) 14:31, 9 September 2019 (UTC)Reply

Reminder: Community Insights Survey

[edit]

RMaung (WMF) 19:12, 20 September 2019 (UTC)Reply

Reminder: Community Insights Survey

[edit]

RMaung (WMF) 17:02, 4 October 2019 (UTC)Reply

Let's connect

[edit]

As it happens, I am also an IP guy (or was, as I have since moved primarily into another area). We should connect. Cheers! BD2412 T 01:56, 3 November 2020 (UTC)Reply

@BD2412: Happy to hear about your practice! Drop me a note via the “Email this user” link any time you wish. Best wishes. Tarmstro99 13:49, 5 November 2020 (UTC)Reply

Sherlock Holmes public domain books

[edit]

As per your comment on Talk:The Casebook of Sherlock Holmes the following stories are currently in the public domain.

Could you undelete the previous version of these stories, so that the community does not need to start again? Techie3 (talk) 10:30, 19 October 2021 (UTC)Reply