User:Inductiveload/Tesseract
Image pre-processing
[edit]Removing small specks can have a major effect on the OCR quality:
- phab:F34558171 → pl’fint the pen,
- phab:F34558172 → point the pen,
An image processor to nerf these specks could be a major uplift in OCR performance.
Removing islands
[edit]Simple function to remove unconnected islands under a certain area with OpenCV. Expects a white-on-black binary image:
def remove_islands(img, min_area):
nlabels, labels, stats, centroids = cv.connectedComponentsWithStats(
img, None, None, None, 8, cv.CV_32S)
areas = stats[1:, cv.CC_STAT_AREA]
print(areas)
result = np.zeros((labels.shape), np.uint8)
for i in range(0, nlabels - 1):
if areas[i] >= min_area: # keep
result[labels == i + 1] = 255
return result
The island size needs to be carefully chosen to avoid deleting things like colons and dots of i's.
By inverting the image, you can also delete small white specks in letters, though these do not seem to be as lethal to the OCR as black specks.
Fonts
[edit]18th century text is often printed using either w:Caslon (the original) or something very like it. It usually has more ligatures than the modern fonts.
A derivative of Adobe Caslon Pro may be possible.
Notable changes:
- Much tighter kerning after a long-s (ſ) in the regular font (italic already kerned well)
- Bar on t reduced in length (modern fonts have made that more obvious, which causes t's to be easily mistaken as i's or l's)
- Less prominent serifs on r
- More space before :;!?
- Heavier top serifs on u to try to avoid mistaken o more often
- Variants: (in PUA at U+E100)
- Higher bar on 'e' - option
ss01
, since this is not always true - otherwise e → c errors - i,j with a missing dot - option
ss02
- t with truncated bar:
ss03
- Higher bar on 'e' - option
To try:
- Variant chars:
- Add glyphs representing more damaged glyphs to the font to prevent overfitting of the model (the model becomes too "fixated" on the perfect form of the 't'). Probably put them as a
<nowiki>ssXX?<nowiki>
font feature.- E.g. t with a truncated top is mistaken i, r or c
- e with a light centre line -> c
- i with a heavy dot -> r
- Add glyphs representing more damaged glyphs to the font to prevent overfitting of the model (the model becomes too "fixated" on the perfect form of the 't'). Probably put them as a
Generate the ground-truth data
[edit]Construct "clean" text for the fonts, variants, styles, etc. that you want:
model: eng_oldcaslon_longs text: dir: corpus/eng_longs fonts: - face: Old Caslon sizes: - 25 variants: regular: {} italic: italic: true smallcaps: smallcaps: true ratio: 0.1 features: - features: - ss01 rate: 0.05 - features: - ss02 rate: 0.005 - features: - ss03 rate: 0.005 process: - noise: 0.2 erode: 3 - noise: 0.3 erode: 2 include_clean: false
- Generate the images and output to the tesstrain
data
directory
./generate.py -c configs/eng_oldcaslon_longs.yml -o ~/src/tesstrain/data -m eng_oldcaslon_longs
Once you have ground truth data
[edit]- Set your model name in the shell (match the model name used above)
export MODEL_NAME=eng_oldcaslon_longs
- Train the model
- This will take a long time (hours, if you set a high
MAX_ITERATIONS
), go and proofread something - 20000 iterations seems to work OK, after that overfitting seems more likely than improvment (0.2% error seems around the lower limit for now)
make training MODEL_NAME=$MODEL_NAME START_MODEL=eng TESSDATA=~/src/tessdata_best
First it will read the training files and set up ltsm and box files and you will see thousands of lines like this:
Tesseract Open Source OCR Engine v5.0.0-alpha-20210401-158-ge1761 with Leptonica PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/eng_oldprint-ground-truth/agrippa-occult.00420.png" -t "data/eng_oldprint-ground-truth/agrippa-occult.00420.gt.txt" > "data/eng_oldprint-ground-truth/agrippa-occult.00420.box" + tesseract data/eng_oldprint-ground-truth/agrippa-occult.00420.png data/eng_oldprint-ground-truth/agrippa-occult.00420 --psm 13 lstm.train
Then, it will start generating training output, and you will see the errors start to decrease.
At iteration 2132/30400/30400, Mean rms=0.148000%, delta=0.023000%, char train=0.071000%, word train=0.109000%, skip ratio=0.000000%, New worst char error = 0.071000 wrote checkpoint.
At this point you can take any recent checkpoint file (one is generated every time the result gets 2% "better") for testing:
- Create and copy the most recent
.traineddata
for use
make traineddata CHECKPOINT_FILES="$(ls -t data/$MODEL_NAME/checkpoints/*.checkpoint | head -1)" MODEL_NAME=$MODEL_NAME TESSDATA=~/src/tessdata_best cp $(ls -t data/$MODEL_NAME/tessdata_best/*.traineddata | head -1) ~/.local/share/tessdata/$MODEL_NAME.traineddata
- Use it!
tesseract --tessdata-dir ~/.local/share/tessdata -l $MODEL_NAME image.jpg -
- When it's done, the
.traineddata
is ready
cp data/$MODEL_NAME.traineddata ~/.local/share/tessdata/$MODEL_NAME.traineddata
- Continue training from that point (may need to increase
MAX_ITERATIONS
). - Beware that too much training on too little source data leads to overfitting - while the model may get better at the GT images, it gets less able to handle real life images that are not quite the same.
make training MODEL_NAME=$MODEL_NAME START_MODEL=$MODEL_NAME TESSDATA=data MAX_ITERATIONS=50000
Generate evaluation text
[edit]This can also be used to generate training data (but you will need a lot of it).
- Generate a HOCR file of the image - using the model in question (hopefully!) gets you pretty close
tesseract /tmp/theimage.jpg /tmp/hocr --tessdata-dir ~/.local/share/tessdata -l eng_oldcaslon_longs hocr
- Extract the HOCR file to image/text pairs (
/tmp
is where the image is).
hocr-extract-images -b /tmp /tmp/hocr.hocr theimage-%03d.png
- Correct the text lines as needed.
- This is a pain and really needs some kind of a tool to help
- Copy/emplace the evaluation ground truths
- Remember the text files need to end
.gt.txt
, not just.txt
cp -r your_images ~/src/tesstrain/data/eval_$MODEL_NAME
- Generate
.lstmf
files - This also generates a file
all-lstmf
file which lists all the lstmf files.
make lists MODEL_NAME=eval_$MODEL_NAME
- Evaluate the model against those files
lstmeval --model "data/${MODEL_NAME}.traineddata" --eval_listfile "data/eval_${MODEL_DATA}/all-lstmf"
Progress so far
[edit]- Long-s usually recognised
- Some confusion between italic h and b
- Occasional mistaken t → c/r
58 Terræ-F1lius. n" x1, founded upon this politick ſuppoſition, that when they had got a new Frmcng houſe, they could ne- ver want new books; but by what means ſocver it was bu't, my lord Clarendon has the honour, and we, his happy poſlcrity, the invaluable beneſic of it, I ſhould think it an undertaking well worthy the laborious Mr. Hearne, to give the world an ac- count, from year to year, of the many incompa- rable tomes, which iſſue from that illuſtrious preſs. This, I apprehend, would do great honour to the univerſity, and to its leamed authors, ſince the cata- logue would not be crouded with any of thoſe he- retical, pernicious, and free-thinking tracts, which are the noiſom ſpawn of other modern preſſes: we ſhould ſind there no ill.meaning Eſſays upon human Underſtandmg, no Oceana's, no Hypotheſes of Liber- ty, no deſcants upon Original Contracls, nor en- quiries into the Stare of Nature, no Appeals to the Laity and common Senſe in matters of religion, no vindications of Conſcience and privare Judgment, no defences of Reſiſtance in any poſſible caſes, no apologies for the Revolution, and the preſent Go- vernment, &c. to ſully the Academical Types, and reproach the ſclemn Imprimatur of the univerſity ——New, accurate Editions of primitive Fathers, and antient Chronicles, or modern ſermons, and long ſyſturas of Logick, Metaphyſicks, and School-divinity are the ſolid productions of this auguſt Typographa- um————Such are the effects, and ſuch the advan- tages of reſtraining the lrcence of the preſs! How would letters flouriſh? how would arts revive? bow would religiou lift up her awful front? and how wculd the church rejoyce, if ſuch a whole- ſome check were put upon the preſs throughout the world ? l But Printixg is not the only, not the principal uſe, tar which theſe ſupendous ſtone-walls weie erected 3
Links
[edit]- GT4HistOCR: ground truth of Fraktur https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR