User:Alien333/poemise+
User:Alien333/poemise+.js redoes stuff User:Alien333/poemise.js does, but much better, and with less human intervention. However, makes numerous assumptions about a work, among which:
- nothing else than one poem (with maybe a title) in body
- there are no big non-text things
- lines are approximately 1.7em tall
- the lines are spaced enough vertically to be distinct when the image is scaled to be 150px wide
- titles are separated from the body by something at least as large as a stanza break
- title pages have a blank at the top (not counting headers) wider than 22%, and only these pages have such breaks
- only at the end of poems are there breaks at the bottom wider than 22%
(It can still give okayish results even if one or two of these are false. depend on which)
Use
[edit]On your end, if you think it might work, what you should do is
- correct the OCR
- leave the lines wrapped as they are in the image (if a line longer than three characters should be unwrapped, but doesn't start by a lowercase letter, add a ^ at the beginning, which will be removed)
- if there is a header, add it in the header field. do the same for the footer
- press ctrl-meta-l if you've got User:Alien333/cuts.js, else put window.poemiseplus() in the console. yeah, not very user-friendly yet.
This has a tendency to find titles too often; if it does, add a $ somewhere in the text and it will assume it isn't a title.
How this works
[edit]The key point is image manipulation, classifying pixels as text pixels or background pixels based on their lightness.
And then, it's just a terrible lot of experimenting, applied statistics, and knowledge of typographical conventions.
The code is extensively documented, look at that for details.
Why this exists
[edit]The marvelous thing with this is that it greatly reduces human intervention. A good example (taken from Page:Passion Flowers (Watson).djvu/82):
from
Header (noinclude):
{{ph|Passion Flowers.}}
Page body (to be transcluded):
A Confession.
Say, what doth it profit, my soul, my soul,
That I weep and cry as I longing wait?
Alas! the most worthless of earthly things
Is repentance, my soul, when it comes too late.
I loved him? Yes, I will swear it now,
With a madness never confessed nor told;
I loved him, and yet for a triumph small,
His heart I broke—his honor I sold.
Could I draw near to his distant place,
Where he might know each passionate tear,
And the anguished cry of my tortured soul,
I would rend the heavens, but he should hear.
"Oh! Love," I would cry; "Forgive, forgive!"
If he answered, then I could bear my fate;
But, ah! the most hopeless of earthly things
Is repentance, my soul, when it comes too late.
|
it can make near-instantly
Header (noinclude):
{{ph|Passion Flowers.}}
Page body (to be transcluded):
{{tpp|A Confession.|
Say, what doth it profit, my soul, my soul,
:That I weep and cry as I longing wait?
Alas! the most worthless of earthly things
:Is repentance, my soul, when it comes too late.
I loved him? Yes, I will swear it now,
:With a madness never confessed nor told;
I loved him, and yet for a triumph small,
:His heart I broke—his honor I sold.
Could I draw near to his distant place,
:Where he might know each passionate tear,
And the anguished cry of my tortured soul,
:I would rend the heavens, but he should hear.
"Oh! Love," I would cry; "Forgive, forgive!"
:If he answered, then I could bear my fate;
But, ah! the most hopeless of earthly things
:Is repentance, my soul, when it comes too late.
}}
|
which gives the correct output of Passion Flowers.
A Confession. Say, what doth it profit, my soul, my soul, That I weep and cry as I longing wait?Alas! the most worthless of earthly things Is repentance, my soul, when it comes too late. I loved him? Yes, I will swear it now, With a madness never confessed nor told;I loved him, and yet for a triumph small, His heart I broke—his honor I sold. Could I draw near to his distant place, Where he might know each passionate tear,And the anguished cry of my tortured soul, I would rend the heavens, but he should hear. "Oh! Love," I would cry; "Forgive, forgive!" If he answered, then I could bear my fate;But, ah! the most hopeless of earthly things Is repentance, my soul, when it comes too late. |
Proofreading, at its core, is correcting OCR and adding a few formatting templates occasionally. Poetry has been, for too long, an overly technical and time-consuming endeavor. This is my furthest advance in reducing poetry proofreading to no more than novel proofreading.
TODO
[edit]- [not sure] reform line demerging tests. the current way is unreliable when we reach the 30 l/p. An idea; which sadly will require some factorisation and repetition:
- try without demerging
- if you've got more l than s, capitulate
- if you've got more s than l, split the largest of l, rinse and repeat
- the Crusade Against Specific, Experimental Values (CASEV). Should pull these from a few user-made pages.
- Experimental values that might differ from work to work, and should be eliminated are:
- 1.7 ems per line.
- It's not an easy data to acquire from finished pages, though. Maybe make ppe through other stuff? need to be imaginative here.
- the 22% for a break to be considered significant. depending on work typography, this can be a pest.
- Take inspiration from what poemise currently does, ie take an example of "normal break" and use 1.8 (EVA) times that
- the 5 for edge noise.
- 1.7 ems per line.
- Those that are specific but that should work in most cases, or that don't have a better replacement:
- the 0.9 in m. this is from applied statistics, which is a fancy way of saying "in most cases this gives better results".
- I know why this works, for once. I've made the graph and basically you've got a low scatter of dark pixels, and then a ginormous spike in the 200-220 range. the average would then be right in the middle of the background spike. for some arcane reason, 0.9 precisely brings it just to the left of that.
- eroding 6 times.
- gives better results than lower (risk of missing short lines) or greater (risk of counting spots & co)
- one-pixel rules.
- may break with extremely small fsize, but in general with these ~200px-tall images, this should never catch text.
- +1 in mbr.
- The issue here is that we're kind of assuming that there'll be more non-stanza breaks than stanza breaks, and so moving the average one pixel up won't hurt anyone and will better distinguish the two. For poems with one-line stanzas, it will mess up. but when you think about it in general, if all lines have the same spacing, no matter what happens, it's going to assume that either everything is a stanza break or nothing is.
- the 0.9 in m. this is from applied statistics, which is a fancy way of saying "in most cases this gives better results".
- The width 150 doesn't count.
- If we take less, then lines can be one pixel high; if we take more, takes too long to process.
- should think of dot size.
- currently a dot is a patch of black fitting in a 3*3 square with a 3-px margin of white around it. larger? smaller? IDK. possibly a CASEV issue
- Experimental values that might differ from work to work, and should be eliminated are:
- On partial page sections. after is mostly fine (just ignore everything past the sign), before is a tiny bit more complicated (essentially, the same but bottom-up)