Wikisource talk:WikiProject 1911 Encyclopædia Britannica/Archive 3
Add topicPlease do not post any new comments on this page.
This is a discussion archive first created in , although the comments contained were likely posted before and after this date. See current discussion. |
Using http://www.theodora.com/encyclopedia/ as a source of mostly-proofed text
Sort of following on from the discussion above — "Copy and pasting text from the searchlight version" — I’ve found the text at www.theodora.com/encyclopedia/ to be a good source. Right-clicking and choosing "Page Source" on an article page to get the markup code shows it’s formatted already with bold and italics; I use Notepad++ to do a global replace on <b> for ''' etc. — you could record a macro if desired.
I’ve found theodora superior to the OCR text in the wikisource version. I used it to help edit https://en.wikisource.org/wiki/Page:EB1911_-_Volume_28.djvu/978 . Before using the theodora version, I did a rough proof of the auto-generated text of that page and saved it (still as not-proofed). When I pasted in the theodora version (and copying back section headers etc.), I found about 22 corrections! e.g. “Pythageras” to “Pythagoras”, “Zaieucus” to “Zaleucus”.
(edit: I’m referring to volumes which don’t exist in Gutenberg (http://www.gutenberg.org/ebooks/search/?query=Britannica&go=Go) which is the best source of proofed text, but covers only articles Andros–Magnetism.) DivermanAU (talk) 05:51, 27 April 2016 (UTC)
- Thanks for the information -- PBS (talk) 07:24, 30 April 2016 (UTC)
Index:EB1911 - Volume 08.djvu pages needing transcluding
I am doing checks of the proofread works, and EB1911 vol. 8 is one of those on-site. Looking at https://tools.wmflabs.org/checker/?db=enwikisource_p&title=Index:EB1911_-_Volume_08.djvu shows multiple pages that have not been transcluded, and I was wondering whether is interested in undertaking the task. — billinghurst sDrewth 11:56, 3 May 2016 (UTC)
- Hi billingshurst, I agree that transcluding pages is good — but would we be better off by focussing on articles that have not been created, or pages that have not been proofed yet? Personally, I think it's better to spend time creating an article that didn't exist before than to convert an existing article (which may be 99–100 % accurate) to a transcluded version.
- All articles from "A" to "Céspedes y Meneses, Gonzalo de" have been created so far (at time of writing) see Vol 5:10 for the start of articles that need to be created. — DivermanAU (talk) 07:29, 6 May 2016 (UTC)
Titles in name of the article
History of the article 1911 Encyclopædia Britannica/Moltke, Helmuth Carl Bernhard, Count von :
- 22:11, 24 January 2009 Bob Burkhardt (talk | contribs)
- 22:15, 24 January 2009 Bob Burkhardt (talk | contribs) (1911 Encyclopædia Britannica/von Moltke, Helmuth Carl Bernhard, Count moved to 1911 Encyclopædia Britannica/Moltke, Helmuth Carl Bernhard, Count von: von should be at the end rather than the beginning)
- 14:39, 19 March 2010 Dan Polansky (talk | contribs) (1911 Encyclopædia Britannica/Moltke, Helmuth Carl Bernhard, Count von moved to 1911 Encyclopædia Britannica/Moltke, Helmuth Carl Bernhard: Keep the boldfaced text as the headword.)
- 16:48, 30 May 2011 Suslindisambiguator
- 15:21, 24 December 2012 MpaaBot (Bot Request: Volume information for EB1911) (undo)
- 23:53, 4 November 2014 Library Guy (Library Guy moved page 1911 Encyclopædia Britannica/Moltke, Helmuth Carl Bernhard to 1911 Encyclopædia Britannica/Moltke, Helmuth Carl Bernhard, Count von over redirect: now include title in article label)
The start of this article is:
- MOLTKE, HELMUTH CARL BERNHARD, Count von (1800-1891), Prussian field marshal...
@user:Library Guy you chose to move this article from just the bold part of the name to include the part in small caps. While the longer name is useful if disambiguation is needed, why do you think it a good idea to move pages so that the pages name does not reflect the bold title used in the original? -- PBS (talk) 12:49, 25 March 2017 (UTC)
Yes. And I discussed this with people also, some years ago, and I also put it in the style guide. The problem is, some of the entries just don't read well if the small-caps portion is omitted, and here, while I think an English-speaking person might be perfectly comfortable calling him Helmuth Carl Bernhard Moltke, I think the von was an essential part of his name, and the EB1911 editors would have agreed. Sometimes the name associated with the title is the same as a person's surname and in the boldface the name is repeated twice and it reads very oddly when you put the last name last without including the portion in small-caps. See 1 in the Style Manual. Notice it doesn't say all small-caps. Sometimes the small-caps are just alternative names etc. It can be a judgement call sometimes, even with just the boldface. For example, Reproductive System, I decided to omit part of the boldface, since it looked like a production error. Other similar articles I think put "in Anatomy" in small caps or body font. This is the only article where I have seen this happen. All the links to it omit "in Anatomy". Bob Burkhardt (talk) 15:03, 25 March 2017 (UTC)
Automated conversion script - Gutenberg to Wikisource format
To anyone who has used or would like to use the already-proofed text version of the EB1911 at Gutenberg.org (http://www.gutenberg.org/ebooks/search/?sort_order=title&go=Go&query=britannica) which has nearly all the articles from "Andros" to "Mecklenberg", I've been working on an automated script to convert HTML to Wikisource format e.g. <i>italic text</i> to ''italic text'' as well as the more complex "style=" statements, author initials etc. are handled. Features:
- HTML markup to Wikisource markup
- Table style conversions
- addition of extra italics marker if the line has an uneven number of italics markers (otherwise italics don't render properly)
- Nearly all author initials are converted e.g. <div class="author">(A. J. E.)</div> converts to {{EB1911 footer initials|Arthur John Evans|A. J. E.}}
- article links added
- {{smaller|A.D}} and {{smaller|B.C}} for A.D. and B.C., renders as A.D and B.C — copied text keeps capitals
- subscript and superscript conversion
- "sidenote" to "EB1911 Shoulder Heading" conversion
- Section tags automatically added (watch for cases where an existing section tag like "s1" is in use)
- use of <div align=center> ... </div> for centered text - this allows an equals sign in text where a Template:center won't render
- convert <div class="condensed"> to "EB1911 fine print"
- Greek text has the "{{Polytonic|" template added (where Gutenberg has "span class="grk" title") - but single Greek characters still need the template added manually.
- Hyphens "-" are converted to ndashes "–" where there are four numbers before the the hyphen, so 1850-1851 will be converted to 1850–1851 (I added code to handle all occurrences on a line on 3 March 2017)
Known limitations:
- Footnotes are not handled, these have to be done manually.
- Cases where a use of a template like "smallcaps" etc. spans multiple lines will have to corrected manually (you can end up with a </span> on the line following)
- Diagrams have to be added manually (but the caption formatting is handled)
It's not 100% perfect, but I've been using it recently and it's been a great help. I have to add italics manually to variables to most maths articles, as the Gutenberg version doesn't usually include these. As always, check with "Show changes" to see what has been changed before saving. Also be aware that although the Gutenberg version is usually very accurate but it has a few errors too (apart as the lack of italics for math variables and hyphens, not ndashes for year ranges).
The source code (written in AutoIt see https://www.autoitscript.com/site/) and executable (currently Guten2WikiV2.15.exe) as well as some converted text slices are here: https://www.dropbox.com/sh/dssahqtjtqleml9/AACOO55819IOefkYIXafYyZva?dl=0 . Run it from a Windows command-line with the source text as a parameter and it will convert the text into another file with "Wiki-" as a prefix. Or launch the executable and be prompted for the source text name. Use "View page source" when viewing the Gutenberg version and save that as the source text. Feel free to use and I'll try and answer questions you may have. Some likely interested users: billinghurst, PBS, Bellerophon5685, DavidBrooks, Library Guy, Suslindisambiguator, Slowking4, DutchTreat. DivermanAU (talk) 08:46, 14 February 2017 (UTC)
- This sounds great. It looks like a fun project. I don't have time to look at it this afternoon, but two questions occur. Links to other articles can be "See FOOBAR", in small caps, or "FooBar (q.v.)" using regular mixed case, which is enforced using the "nosc" parameter. Can those be distinguished?
- Also, whenever I've seen them, an "uneven number of italics markers" usually means the italics are opened on one line and closed on the next. When editing in Page space manually, I usually just replace the newline by a space and let auto-wordwrap do the window fitting. I prefer that to closing the italics at the end of one line and reopening them at the start of the next (if nothing else, it makes the space that's inserted by the paginator into a non-italic space). Or are you referring to another phenomenon? DavidBrooks (talk) 21:52, 15 February 2017 (UTC)
- (ETA) Sorry, I remembered one more thing. In Page space, I think we're using proper curly apostrophes and quote marks (’ and “”); at least, I have. The raw scans are inconsistent on rendering quotes as straight (ASCII) quotes sometimes, and two tick marks at others. Also, EB1911 has a wide gap after opening and before closing quotes, which I always close up. How does Gutenberg work with those? DavidBrooks (talk) 21:58, 15 February 2017 (UTC)
- Hi -good questions! :) Luckily, the Gutenberg version is quite consistent and for article links, I follow these rules: if the Gutenberg source has <span class="sc"><a href="#artlinks"> I replace that string with "{{1911link|" (and substitute the terminator) which produces a link in smallcaps; if the source has <a href="#artlinks">, I use the "{{11link|" template which produces a link without small-caps. {edit - It does depend on whether the Gutenberg version has the tags for article links — on some later pages, there is a (q.v.) but no link tag.}
- The italics-handling code adds an italics tag to the end of the line if there is a non-matching number of italics start and italics end tags on the line. If there isn't a match, I add an italics end tag to the end of the line and add an italics start tag to the start of the next line. This was the easiest way to automate the fixing up of italics, and it leaves the line-break intact. I had previously used macros in Notepad++ to partly automate the conversion but italics would often get out-of-sync on a page and I had to manually fix those (like you describe above).
- The Gutenberg version uses curly quotes and apostrophes (without spaces between the word and the quote mark) as html characters (e.g. & ldquo; & rdquo; &rsqou; etc. I just convert these to the literal equivalents “ ” ’ . Thanks for your interest. DivermanAU (talk) 06:59, 16 February 2017 (UTC)
- that’s excellent - i note you have gotten to volume 6, this should speed the work to volume 17.
- deleting the soft carriage returns is an improvement gutenburg does not do; curly quotes and apostrophes are are a problem, we have find and replace in VE to convert to straight.
- i’m creating articles now, but will try to pitch in in a few months. Slowking4 ‽ SvG's revenge 16:10, 1 March 2017 (UTC)
- I noticed a while back that there’s a gap in the Gutenberg version for articles in-between “Conduction, Electric” and “Constantine” (vol 6. pages 890 onwards), so I'm working on those currently to try and finish vol. 6. (http://www.theodora.com/encyclopedia/ is useful in these cases — though it’s not as accurate as Gutenberg).
- Regarding quotes, the consensus — see "Typographical changes in Page space" section above with comments by @PBS:, @DavidBrooks: and myself — seemed to be to maintain the curly quotes and apostrophes. The OCR scan usually has curly quotes & apostrophes, Gutenberg has them and they are more faithful to the original printed book.
- The soft carriage returns can be eliminated by the “clean up” script, Billinghurst added that functionality to my JavaScript page User:DivermanAU/common.js. But do some users prefer to maintain the soft line breaks as it makes it easier to proof? (The line breaks still match the print book when editing). Currently the script reads one line at a time and writes out each line terminated with <carriage return><line-feed> characters. This could be changed to replace the terminating <carriage return><line-feed> with a <space> character, but you'd need to check it didn't have adverse effects (e.g. in tables etc.). DivermanAU (talk) 23:36, 1 March 2017 (UTC) (update — I think it's best to maintain the soft line breaks, that way it makes it easier to compare changes that have been made by using the "Show changes" button. If needed, the linebreaks can be removed by the TemplateScript "clean up", or with a replace (\r\n with <space>) in NotePad++ if just one section needs the linebreaks removed.) DivermanAU (talk) 22:09, 2 March 2017 (UTC)
- I completely agree that the soft line breaks make proof-reading much easier. Unfortunately editors in the past have often removed them (using the editor's word-wrap) for various reasons of their own. I only fuss about the close-and-reopen-italics problem because it introduces a small semantic error: I assume the newline is replaced in HTML with a non-italic space. That's a distinction that's probably (literally) invisible though, and when or if I go back to WS I'll stop worrying about that. It's probably more significant to worry whether a final period or comma is part of the citation (and should be italic) or not. DavidBrooks (talk) 00:01, 2 March 2017 (UTC)
←I support the use the curly double quotes as we discussed before. If in the future we decide to change it is an easy bot job, going the other way is likely to be troublesome. I strongly support keeping the line lengths the same length as in the original it makes proof reading much easier. -- PBS (talk) 18:42, 25 September 2017 (UTC)
Problems with Gutenberg encoding
I find problems with character encodings in the Gutenberg material. Most recently ṃ when it should have been ṁ, and ṅ when it should have been ṇ (v. 13, p. 501). I think I've only found problems with dotted characters so far. Bob Burkhardt (talk) 18:04, 3 August 2017 (UTC)
- Gutenberg is not 100% accurate, I find the occasional issue (I found a superscript number instead of subscript recently for example) but overall the quality is very good. Remember you can use my conversion script to convert Gutenberg text (it automatically does the ndashes for year ranges and adds EB1911 footer initials etc.) It always pays to check it yourself though afterwards. DivermanAU (talk) 00:20, 18 September 2017 (UTC)
Two archiving bots available
FYI. The bots user:Wikisource-bot and user:SpBot are both available to automatically archive if you need them to do so. — billinghurst sDrewth 15:18, 25 March 2017 (UTC)
- Thanks, I guessed there was but did not know their names. I have installed the config file for user:Wikisource-bot (chosen because the do documentation was better). I hope it works :-) --PBS (talk) 16:00, 25 March 2017 (UTC)
Many link in an artilce
Please see Wikisource talk:WikiProject 1911 Encyclopædia Britannica/Style Manual#Many link in an article -- PBS (talk) 21:28, 1 October 2017 (UTC)
Pages numbers now missing when viewing transcluded articles?
Anybody also have missing page numbers when viewing transcluded articles? e.g. Crookes, Sir William. The page numbers used to appear on the left of the article text. DivermanAU (talk) 19:52, 7 October 2017 (UTC)
- Many others are experiencing this as well. [1] Londonjackbooks (talk) 20:41, 7 October 2017 (UTC)
Image names
Please see: Wikisource talk:WikiProject 1911 Encyclopædia Britannica/Style Manual#Image names for a question on what to name images in this project. -- PBS (talk) 19:14, 13 September 2018 (UTC)
Footer split over two pages
I recently came across a similar problem with a hyphenated word at the end and start of two pages and found two templates already written to handle the issue. I have documented the solution see:
- Wikisource:WikiProject 1911 Encyclopædia Britannica/Transclusion#Hyphenated word at end of page]]
The following pages
Have had a construction I had not seen before in EB1911: A footnote carried over from one page to another.
I tried the suggestion in:
- Wikisource:WikiProject 1911 Encyclopædia Britannica/Transclusion#Other page-crossing constructs
But it did not work.
So taking my lead from the templates used to solve the "Hyphenated word at end of page" problem, I have handled it by
- placing the text of the second half of the footnote in to the footer section of the Page 415
- then coping the text onto Page 414 and appended it to the footnote. I have wrapped the text up in a template so that it only displays in main space 1911 Encyclopædia Britannica/Tunisia.
At the moment the template is in my user area User:PBS/template all it contains is:
{{#ifeq: {{NAMESPACEE}} ||{{{1}}}<!-- -->}}
Does any one have an alternative way of dealing with footnotes split over two pages? If not, I would appreciate some suggestions for a name for this template. I will then move it into template space and document the footer and template combination in:
-- PBS (talk) 14:22, 18 October 2018 (UTC)
- You'll want to use the "ref follow=" construction. I think it's documented at Help:References. YOu can find an example at Page:History of Oregon volume 1.djvu/394 and Page:History of Oregon volume 1.djvu/395 and rendered at History of Oregon (Bancroft)/Volume 1/Chapter 13. -Pete (talk) 16:01, 18 October 2018 (UTC)
- Help:Footnotes and endnotes — billinghurst sDrewth 00:05, 19 October 2018 (UTC)
- Thanks for that information. I will try it out. -- PBS (talk) 12:16, 23 October 2018 (UTC)
- Help:Footnotes and endnotes — billinghurst sDrewth 00:05, 19 October 2018 (UTC)
@user:billinghurst. I tried out your idea and it works for articles where the pages included in the translucent includes footnotes that are split over two pages for example 1911 Encyclopædia Britannica/Saint Marys. However if the translucence pages do not include the second half of the footnotes as the case with 1911 Encyclopædia Britannica/Tunisia then the "ref follow=" construction does not work. Do you have a suggestion for how we ought to handle such cases?
-- PBS (talk) 09:14, 4 November 2018 (UTC)
Macron and diereses combined
On the page 1911 Encyclopædia Britannica/Osman (sultans) there is a combined Macron and diaereses diacritic above a small "a" like this "ǟ". The template {{small-caps}} makes such a letter with diacritics into a small cap with diacritics.
In the selection box for characters at the top of a [[:PAGE:]] to be transluded there are options for Macrons and Diereses and there is also a "other diacritics" options but none of them included an "ǟ". I found the html character using the following website:
- AG, Compart. "Find all Unicode characters from Hieroglyphs to Dingbats – Unicode Compart". www.compart.com. Retrieved 23 October 2018.
-- PBS (talk) 12:16, 23 October 2018 (UTC)
Keeping footnotes in their original place
Following on from the conversation above.
The construct of "ref name=tagname" "ref follow=tagname" could also be useful on pages that are not split because it can be used to keep the footnote text in the same place on the page as they are in the original as this will help is checking for OCR errors, as there would be no need to move the text footnote up the page to the place where it is displayed. Simply alter the original body of the text to include <ref name="note 1"/>like this,[1] and then at the bottom of the column (or wherever the footnote appears in the original text use <ref follow="tagname">like this:<ref follow="note 1">This is a footnote</ref>
Notes:
- ↑ This is a footnote
If there is a consensus I will include this as an option in the EN1911 MOS. -- PBS (talk) 09:28, 4 November 2018 (UTC)
- Thanks for the information about footnotes, I've used name/follow quite a few times now. Useful when copying from Gutenberg version to make comparison easier. Just have to be careful that the "follow" part doesn't reside past the section break (e.g. after ## s2 ## ) if the footnote is in the "s1" section, otherwise the ref doesn't work when transcluded. DivermanAU (talk) 22:45, 29 June 2019 (UTC)
Half a dozen left
At the moment there are last half a dozen Wikipedia pages that access EB1911 articles that have yet to be placed on Wikisource (over 20,000 done):
Wikipedia article | EB1911 article name | location | notes |
---|---|---|---|
w:Psalms of Solomon | "Solomon, Psalms of", 25, pp. 365–366. | Vol 25:5 index, first OCR page | |
w:Temperature measurement | "Thermometry", 26, pp. 821–836. | Vol 26:9 index, first OCR page | Long article with lots of mathematical formulas and embedded images |
w:Terpene | "Terpenes", 26, pp. 647–652. | Vol 26:8 index, first OCR page | Long article with many chemical structure diagrams; fixed |
w:Water supply | "Water Supply", 28, pp. 387–409 | Vol 28:7 index, first OCR page | Long article lots of images, formulae, a table and a plate |
w:Weighing scale | "Weighing Machines", 28, pp. 468–477 | Vol 28:8 index, first OCR page | long article lots of images; fixed? |
w:William Dent Priestman | "Oil Engine", 20, pp. 35-43 | Vol 20:1 index, first OCR page | long article lots of images (djvu source has a yellow background) |
w:Worsted | "Wool, Worsted and Woollen Manufactures", 28, pp. 805–816 | Vol 28:13 index, first OCR page | long article some images, shoulder headings and tables, and a plate |
If anyone is thinking of proofread some pages please consider doing one of these. They will be time consuming as they are not strait forwards (apart from "Solomon, Psalms of" which needs someone who can proofread the Hewbrew text). The others are largish and involve displaying images and a few tables. -- PBS (talk) 12:04, 30 July 2020 (UTC)
- I fixed the Psalms of Solomon article. I've learned how to add Hebrew diacritics (thanks to Wikipedia) and updated Wikisource with help from the high-resolution scan at https://archive.org/stream/encyclopaediabri25chisrich#page/366/mode/1up. For the other articles, the version at theodora.com may be of help for tables etc. e.g. https://theodora.com/encyclopedia/o/oil_engine.html — DivermanAU (talk) 20:35, 25 September 2020 (UTC)
- Working on Terpenes. I have 9 of the diagrams uploaded so far. DavidBrooks (talk) 20:47, 8 October 2020 (UTC) ... Terpenes done, although I'm going to review some of the typography decisions. DavidBrooks (talk) 21:57, 15 October 2020 (UTC)
Template:EB1911 Page Heading seems broken just now
Anybody know why the "EB1911 Page Heading" template seems to all-of-a-sudden display incorrectly for pages with four parameters? e.g. Page:EB1911 - Volume 04.djvu/1010. The template itself doesn't appear to have been edited recently. @Billinghurst: @DavidBrooks:. DivermanAU (talk) 13:11, 25 May 2021 (UTC)
- @DivermanAU: Define broken. There was a tab in the use, which is not null. [2]. Does it look better for you now? — billinghurst sDrewth 13:51, 25 May 2021 (UTC)
- It does look better now. The issue was that the page number appeared in the center (instead of at the side), with the two article names together at the far right (instead of being centered). It was happening on multiple pages, including those I had edited a short time beforehand which looked OK at the the time, but then had an issue when I looked later. Pages that had the alignment issue before now look OK (without any body editing the page) so I don't know what was happening. Anyway, thanks for the reply, I hope it will be OK from now on. DivermanAU (talk) 21:07, 25 May 2021 (UTC)
- I didn't see anything like you are mentioning. I am guessing that it is related to the mods that are going on in Template:RunningHeader by user:Inductiveload. Inductiveload, it may be worth doing some of those changes through the template's sandbox. — billinghurst sDrewth 00:20, 26 May 2021 (UTC)
- I actually did use a sandbox (just not that template's own sandbox) but I missed something really dumb because I am dumb. Sorry! On the up side, hopefully the central text will actually be in the centre of the pages more reliably now. Inductiveload—talk/contribs 13:18, 26 May 2021 (UTC)
- That answers my confusion. I realize now there were two things going on: the temporarily broken RH template, and the missing empty parameter in the example that billinghurst fixed. Yes, I used positional parameters in that template because I'm lazy and it was the first template I wrote from scratch, and I realize it's easy to overlook the empty ones. DavidBrooks (talk) 19:54, 26 May 2021 (UTC)
- I actually did use a sandbox (just not that template's own sandbox) but I missed something really dumb because I am dumb. Sorry! On the up side, hopefully the central text will actually be in the centre of the pages more reliably now. Inductiveload—talk/contribs 13:18, 26 May 2021 (UTC)
- I didn't see anything like you are mentioning. I am guessing that it is related to the mods that are going on in Template:RunningHeader by user:Inductiveload. Inductiveload, it may be worth doing some of those changes through the template's sandbox. — billinghurst sDrewth 00:20, 26 May 2021 (UTC)
- It does look better now. The issue was that the page number appeared in the center (instead of at the side), with the two article names together at the far right (instead of being centered). It was happening on multiple pages, including those I had edited a short time beforehand which looked OK at the the time, but then had an issue when I looked later. Pages that had the alignment issue before now look OK (without any body editing the page) so I don't know what was happening. Anyway, thanks for the reply, I hope it will be OK from now on. DivermanAU (talk) 21:07, 25 May 2021 (UTC)