User:BrandeisBot/Documentation
Appearance
BrandeisBot uses the lochner tool to scrape legal cases from Justia. These cases are then converted to wikitext using the brandeis tool, and uploaded to Wikisource using a pywikipediabot.
Process
[edit]- lochner scrapes a selection of cases (in HTML format) from Justia.
Validation
[edit]- brandeis performs some basic validation on the file to ensure the file is correctly-formatted. If it is not, the file will be skipped entirely.
- Pulls some metadata out of the file (case name, number, date, etc.), which it then uses to check if the file exists on Wikisource
- Checks the file (using the title and case number) against the list of files at the appropriate volume page of U.S. Reports
- If the file exists on Wikisource already, a duplicate in the list cannot be resolved, or if the case is not in the list, prompts the user to see if it should continue
- If the user chooses to continue, or if the file exists in the list as a redlink, the process continues on to the next section
Parsing
[edit]- Strips extraneous HTML from the file (e.g., the page header)
- Converts the file into a token stream using PLY
- Parses these tokens into basic wikitext
- Secondary parsing happens now, where more specific operations are performed:
- The case is split into sections (Syllabus, Opinion of the Court, etc.)
- Footnotes are pulled out of the main text and placed into
<ref></ref>
tags - Page numbers are replaced with the {{page break}} template. The line break is maintained so the specific position of the page break is not lost
- {{header}}, {{CaseCaption}}, {{USSCcase}}, and {{USSCcase2}} templates are added
- Extraneous spaces are removed
- Talk pages (containing {{textinfo}} templates) are added
- A redirect page is created from the case number
Throughout this process, logfiles are created. The "reports" logfile contains warnings about the text (missing footnotes, brandeis was unable to add header templates, etc.) so that the manual assistance is somewhat easier. The "summary" logfile contains a report of all of the pages that will be created when pywikipediabot runs through the text file.