User:GemmaBot
- This user was a bot account run by SnowyCinema, no longer used.
Gemma is a very real artificial intelligence-utilizing technology that is currently being developed to aid QuickTranscribe. Her goal is to make the QuickTranscribe process even faster and even more accurate, tailored to specific types of works. What are currently up for consideration are novels, silent films, and (maybe) sound films.
GPT has the ability, with its multimodal capabilities, to transcribe book pages with truly remarkable and unprecedented accuracy. The only issue is that it can be flaky in whether it wants to be that accurate, or say "Sorry, I can't help with that", or give you plain OCR that you could have just retrieved without it. There's no way to directly control it never to use OCR.
While this does use AI for transcription purposes, AI should be used as little as possible because GPT tokens cost money.
Novels
[edit]- Blank pages would be identified as such by the AI transcription mechanism.
Front matter
[edit]Front matter pages are a pain to have to deal with, and they're mostly similar for common publishers in most novels. To automate this would be pretty sweet.
Some specifics could be established without AI post-transcription:
- Cover will just be known when found, doesn't need AI to determine this.
- Based on what page number you give for where roman starts, and where non-roman starts, it will page-number the entire front matter for you.
- Half titles—just determine if it probably is one, and then put /half/
- Frontispiece and other images—see #Images.
- Is a title page on one line and a known publisher's name on another line? "{Title}" + "By" + "{accepted author name}"? If so, title page.
- If title page, which publisher mark does it use if any? Identify where it is, and identify which publisher mark by that publisher it's most similar to. You can use reverse-image algorithms to identify these.
- Develop a pretty safe method of determining which formatting is probably used, based on other sample data.
- "Copyright, 1..., by" / "All Rights Reserved" / others without mention of title, would be good indicators
- Develop a pretty safe method of determining which formatting is probably used, based on other sample data.
- Dedications
- Based on the line spacing, make an educated guess?
- Table of contents
- /toc//XI - based on what was transcribed from the TOC
- //toc/
- Illustrations
- If page is detected to be an illustrations page, just put its header with /illus/
Content pages
[edit]- It'll detect if the title was in there and replace this with /ti/.
- It'll detect "CHAPTER VI" etc. and give this the /ch/ treatment.
- It'll turn "CHAPTER VI\nTHE TITLE OF THIS CHAPTER" into "/ch//THE TITLE OF THIS CHAPTER" automatically.
- It'll detect and take out anything above "---" that GPT often gives.
- At the end of scrape: Detect if it's a "Sorry I can't help you" error message, or likely to be one, and the AI will go back over those with you at the end and you tell it which ones to correct or not correct, until the messages are gone.
- At the end of scrape: Detect if it's likely OCR was used by GPT, and review all those pages, and see which ones to regenerate.
Advertisement pages
[edit]- We really need some good models for ad pages to pump out, especially those from Grosset & Dunlap, since they appear to be redone in a lot of their novels (maybe templates are called for?). Perhaps the AI can generate these, too.
I, the copyright holder of this work, hereby release it into the public domain. This applies worldwide.
In case this is not legally possible:
I grant anyone the right to use this work for any purpose, without any conditions, unless such conditions are required by law.
Public domainPublic domainfalsefalse