How to use Free Software to learn Japanese, and more.

Mining from manga

April 01, 2021 — Tatsumoto

When we read manga, sometimes there's a need to quickly OCR a portion of the screen to look up new words and add sentences to Anki. To do so, you're going to use an optical character recognition program and a few helper tools.


Setting up OCR

Install the following dependencies:

$ sudo pacman -S --needed sxiv maim tesseract xclip imagemagick unzip
  • sxiv is an excellent image viewer. For this setup you can replace it with any image viewer, but sxiv is what I use.
  • tesseract is the OCR engine. It is considered fairly accurate, and many people like it.
  • maim is a utility for taking screenshots which can take parts of the screen.
  • xclip is a tool for copying text to clipboard.
  • imagemagick is a command-line image editor. It's going to come handy to edit the screenshots before Tesseract analyzes them.
  • unzip is a tool for extracting zip archives.

Download maimocr and save it as ~/.local/bin/maimocr. maimocr is a script we are going to use to recognize Japanese text.

Make the file executable:

$ chmod +x ~/.local/bin/maimocr

The directory ~/.local/bin should be in your PATH.

Bind this script to any key in your DE, WM, sxhkd, xbindkeysrc, etc. Here's an example for i3wm:

bindsym $mod+o exec --no-startup-id maimocr

Now you can quickly call maimocr anywhere by pressing the keyboard shortcut.

Usage

Tesseract doesn't work without trained data files. These files tell Tesseract how to read and recognize text from images. When you first run maimocr, it should download Japanese data files automatically. Check the terminal output to see if the process succeeds.

When you run it the second time, maimocr will ask you to select an area with Japanese text and try to OCR it. The resulting text will be saved to the system clipboard. Use it in combination with Yomichan Search to quickly lookup Japanese words in real-time.

To open Yomichan Search, open your Web Browser and press Alt+Insert. Yomichan should be already installed.

Video demonstration.

Expanding data set

By default, maimocr automatically downloads tessdata.zip (mirror) with Tesseract data files, then saves the files to ~/.local/share/tessdata.

To use additional data files with maimocr, copy any new *.traineddata files to ~/.local/share/tessdata.

Capture2Text files

These instructions are no longer necessary. The files are included by default.

Download capture2text. We won't need the program itself because it's garbage but the trained data files are going to be useful. Extract the contents of the tessdata folder to ~/.local/share/tessdata:

$ unzip -j Capture2Text_v*_64bit.zip 'Capture2Text/tessdata/*' -d ~/.local/share/tessdata

Alternatively, download just the Capture2Text Japanese files from here.

Capture2Text archive

Contents of the ZIP archive.

Troubleshooting

If you notice that the script fails to OCR certain images, try to zoom in or find a scan with a better resolution. Tesseract works poorly at low resolutions.

Nonstandard fonts often fail to OCR properly. In this case I don't have a definitive answer at the moment. Try searching for more *.traineddata files online and adding them to the tessdata folder.

Adding screenshots

If you want to add a screenshot from a manga to your Anki card, maim can do that too. maimpick is a script that uses maim to screenshot parts of the screen and copy them to the clipboard. Install it to the same location as maimocr, make it executable and bind it to a key.

In addition to maim, maimpick requires dmenu and xdotool to work.

Note: ames is another program that can add screenshots to Anki.

Other software

  • kanjitomo. It's quite bloated and forces you to use a Japanese to English dictionary instead of a Japanese to Japanese one.
  • manga-ocr. Can be used to OCR Japanese text instead of Tesseract. Unfortunately, I haven't been able to install it and can't comment on it.

Tags: guide