Mining from manga
When we read manga, sometimes there's a need to quickly OCR a portion of the screen to look up new words and add sentences to Anki. To do so, you're going to use an optical character recognition program and a few helper tools.
Setting up OCR
Install the following dependencies:
$ sudo pacman -S --needed sxiv maim tesseract xclip imagemagick unzip
- sxiv
is an excellent image viewer.
For this setup you can replace it with any image viewer, but
sxiv
is what I use. - tesseract is the OCR engine. It is considered fairly accurate, and many people like it.
- maim is a utility for taking screenshots which can take parts of the screen.
- xclip is a tool for copying text to clipboard.
- imagemagick is a command-line image editor. It's going to come handy to edit the screenshots before Tesseract analyzes them.
- unzip is a tool for extracting zip archives.
Download
maimocr
and save it as ~/.local/bin/maimocr
.
maimocr
is a script we are going to use to recognize Japanese text.
Make the file executable:
$ chmod +x ~/.local/bin/maimocr
The directory ~/.local/bin
should be in your
PATH.
Bind this script to any key in your DE, WM, sxhkd, xbindkeysrc, etc. Here's an example for i3wm:
bindsym $mod+o exec --no-startup-id maimocr
Now you can quickly call maimocr
anywhere by pressing the keyboard shortcut.
Usage
Tesseract doesn't work without
trained data files.
These files tell Tesseract how to read and recognize text from images.
When you first run maimocr
, it should download Japanese data files automatically.
Check the terminal output to see if the process succeeds.
When you run it the second time,
maimocr
will ask you to select an area with Japanese text and try to OCR it.
The resulting text will be saved to the system clipboard.
Use it in combination with Yomichan Search
to quickly lookup Japanese words in real-time.
To open Yomichan Search, open your Web Browser and press
Alt+Insert
. Yomichan should be already installed.
Video demonstration.
Expanding data set
By default, maimocr
automatically downloads
tessdata.zip
(mirror)
with Tesseract data files,
then saves the files to ~/.local/share/tessdata
.
To use additional data files with maimocr
,
copy any new *.traineddata
files to ~/.local/share/tessdata
.
Capture2Text files
These instructions are no longer necessary. The files are included by default.
Download capture2text.
We won't need the program itself because it's garbage
but the trained data files are going to be useful.
Extract the contents of the tessdata
folder to ~/.local/share/tessdata
:
$ unzip -j Capture2Text_v*_64bit.zip 'Capture2Text/tessdata/*' -d ~/.local/share/tessdata
Alternatively, download just the Capture2Text Japanese files from here.
Contents of the ZIP archive.
Troubleshooting
If you notice that the script fails to OCR certain images, try to zoom in or find a scan with a better resolution. Tesseract works poorly at low resolutions.
Nonstandard fonts often fail to OCR properly.
In this case I don't have a definitive answer at the moment.
Try searching for more *.traineddata
files online
and adding them to the tessdata
folder.
Adding screenshots
If you want to add a screenshot from a manga to your Anki card, maim
can do that too.
maimpick
is a script that uses maim
to screenshot parts of the screen and copy them to the clipboard.
Install it to the same location as maimocr
, make it executable and bind it to a key.
In addition to maim
, maimpick
requires
dmenu
and
xdotool
to work.
Note: ames is another program that can add screenshots to Anki.
Other software
- kanjitomo. It's quite bloated and forces you to use a Japanese to English dictionary instead of a Japanese to Japanese one.
- manga-ocr. Can be used to OCR Japanese text instead of Tesseract. Unfortunately, I haven't been able to install it and can't comment on it.
Tags: guide