Jump to content

User:Inductiveload/Requests/Batch uploads

From Wikisource
Requests

I can upload batches of files from the IA or HathiTrust. However, I will require the metadata to do so. I will not do uploads if you don't give me the data (unless I really, really want to anyway).

I can also create files from batches of images. In this case, you will need to provide details of where I can get the images from. I can help you with batch downloading images if you need. If you already have the images, probably the easiest way to share them with me is to upload to the Internet Archive as an "image ZIP" following these instructions.

Data file format

[edit]

I will need a spreadsheet (XLSX, CSV or ODS) with the following columns (the names are important, don't change them).

Column Required? Purpose Example
title Required The title of the work. For a batch, this is often the same for every row. The Atlantic
subtitle Optional the work subtitle. Optional (but give it if there is one) A magazine of Literature, Science, Art and Politics
author Optional Author(s), slash separated "Oscar Wilde" or "Q30875"
editor Optional Editors(s), slash separated
illustrator Optional Illustrator(s), slash separated
translator Optional Translator(s), slash separated
year Required The publication year 1868
volume Optional The volume number 22
subpage Optional The volume subpage at Wikisource (if it's not just "Volume XX"). Not required if the work doesn't have a subpage (e.g. a simple single-volume book), or if it does and it's "Volume XX" (in that case, it is inferred from the presence of volume).
vol_detail Optional Optional detail string for the volume for the book template and the index page July–December 1868
vol_disp Optional The volume display string for the Commons book template. Will not be used in a page title. If not given, "Volume XX" and then the vol_detail, if any, brackets. Volume 22 (July–December 1868)
filename Optional The target filename (no extension). If not given, a default will be attempted with a format like Brave New World - Huxley - 1932 or The Atlantic Monthly - Volume 22. The Atlantic Monthly - Volume 22
id Required The external source ID. The URL for "url" sources, blank if you provide a file archive to me somehow. Required otherwise. atlantic22bostuoft
source Required The source: either "ia", "ht" or "url" ia
file Optional If uploading files from some file archive you give me (rather than directly from the IA, or a URL etc), the filename in that collection File 1.pdf
oclc Optional The OCLC number 297234877
lccn Optional The LCCN number
city Optional The city of publications Boston
publisher Optional The publisher Fields, Osgood, & Co.
printer Optional The printer
license Required The license (so it can be inserted into {{pd-scan|) PD-US-expired
pagelist Optional Manual pagelist tag. If you don't provide this, one will be generated from the IA or HT metadata, if possible. This is usually incomplete, but it's generally a good start.
img_pg Optional The image page (used as the title page). Usually the source provides this information via the page list metadata.
language Required The work's language code en
commonscats Required Categories for the work at Commons, slash-separated The Atlantic Monthly, 1868
vollist Optional (required for multi-vol works) The volume list template (or wikitext) {{Atlantic Monthly volumes}}
only_pages Optional if only some pages should be include from the source, then one or more numbers of ranges, comma-separated. 1-100,103,105-199
rm_pages Optional if some pages should be excluded, then one or more numbers of ranges, comma-separated. Note: applies after the included pages. 1,5-8,1234
to_ws Optional if the file should be uploaded to Wikisource, rather than Commons, then y
ws_lang Optional the target Wikisource: where the index pages will be made, and where the files will be uploaded if to_ws is set. Default is en. Use mul for Multilingual Wikisource.
access Optional set to us if the work is not accessible outside the US (usually for Hathi)
no_commons_until Optional The date after which the file can me moved to Commons (used for the until parameter of {{Do not move to commons}}. Mandatory if to_ws is set 2035
no_commons_reason Optional The reason the file shouldn't be moved to Commons (used for the why parameter of {{Do not move to commons}}. Mandatory if to_ws is set Multi-author work published in UK
user Optional Requesting user name - will be used in the index page creation summary if given (which will both ping that user and make it clear who found that file) Inductiveload
  • All data, like printer, that is available should be provided. It's a lot easier to put it in now than patch it in later.
  • You can add as many other columns as you like for your own purposes, such as building up strings. They will be ignored.

There some examples here: https://drive.google.com/drive/folders/1fW5ozskDJiyVoQycUoGEB7d-L_Uh6N7b

Authors, etc

[edit]

If you provide strings like Oscar Wilde, they will be used as-is. If you provide a Wikidata ID like Q30875, then it will be used in the creator template at commons and the linked Wikisource author page (in this case, Author:Oscar Wilde) will be used for the index page.

Separate multiple authors with slashes, e.g. Oscar Wilde/Albert Einstein.

Sources

[edit]

I can download from the following sources using the relevant ID of the work at that source:

  • ia: The Internet Archive
  • ht: HathiTrust

I can also use direct URLs to any other online resource at a publicly-accessible location. Set url in this case.

I may be able to add other sources if generally useful - just ask. This can include something like a Dropbox or other (decent: no dodgy hosts, please) web drive, as long as the images are in a unique folder per work and are in order.

[edit]

I can upload files locally to Wikisources if needed, if they are not suitable for Commons for copyright reasons.

If the file is not a US work (e.g. a non-US author), you must not specify PD-US as the copyright if the file is going to go to Commons. You should specify a suitable template. Usually, this is PD-old-auto-expired: in that case you must also give deathyear to show why the work is PD in the country of origin.

If the file is coming to Wikisource (usually because it's copyright in the country of origin, but not in the US), you should set to_ws to yes, set ws_lang if not en and you must provide no_commons_until and no_commons_reason.

Spreadsheet automation

[edit]

Note, you can often use the volume number to build the other cells with spreadsheet equations. For example, if the volume number is col G and the title is col C, then the filename for row 2 might be =C2 & " - Volume " & G2.

Likewise, you can increment numbers. If row 2's volume is 1, then you can make row 3's 2 using =G2 + 1.

You can zero-pad number with, e.g. TEXT(G2, "00")

In this way, you can save a lot of tedious typing. However, do make sure that the data stays accurate. Very often things like publisher, printer or even the date ranges of volumes can change halfway though a series.

If you use formulae, I'd prefer to receive an XLSX file than a CSV file, since I can adjust the formulae if needed.

Authority control

[edit]

The OCLC number is optional, but highly recommended, because the OCLC ID is a very good way to link the files and indexes with structured data, as it (should be) a unique key.

Sending the file

[edit]

You can send me the file by creating a task on my Workboard at Phabricator and attaching your spreadsheet, or commenting on my talk page and providing a link to some other file host (e.g. Google Drive, Dropbox, etc).

If you use formulae in your spreadsheet, I'd rather have the original spreadsheet (XLSX/ODS) than an exported CSV file, because if I need to make changes to anything, it's easier if the formulae still work.

Known issues

[edit]
  • Pagelists are generated from the source's upstream data. The quality of this ranges from near-perfect to complete junk. It will be your responsibility to deal with that these. All indexes are created with "to be checked" statuses for this reason.
    • You can provide a pagelist field, then it will be set to "to be proofread".

Your tasks

[edit]

You have some work to do even once the batch upload is complete:

  • If the works are part of a series, any index volume list templates (e.g. {{American Printer volumes}}) in the vollist column should be created also
  • All the Commons categories you specify should exist and be categorised
  • Finishing the pagelists on the index pages (the upload will include an automatically-generated pagelist from the IA or HathiTrust metadata, but this is usually incomplete)
  • Adding {{small scan link}} templates to Author and Portal pages as appropriate
  • Generally tidying up if there are other rough edges.

By making a batch upload request, you agree to undertake these tasks.