Scanbot SDK for Android provides a simple and convenient API (
OpticalCharacterRecognizer) to run Optical Character Recognition (OCR) on images.
As result you get:
- a searchable PDF document with the recognized text layer (aka. sandwiched PDF document);
- recognized text as plain text;
- bounding boxes of all recognized paragraphs, lines and words;
- text results and confidence values for each bounding box.
The Scanbot OCR feature is based on the Tesseract OCR engine with some modifications and enhancements.
The OCR engine supports a wide variety of languages. For each desired language a corresponding OCR training data file (
.traineddata) must be provided.
Furthermore, the special data file
osd.traineddata is required (used for orientation and script detection).
The Scanbot SDK package contains no language data files so as to keep the SDK small in size. You have to download and include the desired language files in your app.
A perfect document for OCR is flat, straight, in the highest possible resolution and does not contain large shadows, folds, or any other objects that could distract the recognizer. Our UI and algorithms do their best to help you meet these requirements. But as in photography, you can never fully get the image information back that was lost during the shot.
You can use multiple languages for OCR. But since the recognition of characters and words is a very complicated process, increasing the number of languages lowers the overall precision. With more languages, there are more results that the detected word could match. We suggest using as few languages as possible. Make sure that the language you are trying to detect is supported by the SDK and added to the project.
Put the document on a flat surface. Take the photo from straight above in parallel to the document to make sure that the perspective correction does not need to be applied much. The document should fill as much of the camera frame while still showing all of the text that needs to be recognized. This results in more pixels for each character that needs to be detected and hence, more detail. Skewed pages decrease the recognition quality.
More ambient light is always better. The camera takes the shot at a lower ISO value, which results in less grainy photos. You should make sure that there are no visible shadows. If you have large shadows, it is better to take the shot at an angle instead. We also do not recommend using the flashlight - from this low distance it creates a light spot at the center of the document which decreases the recognition quality.
The document needs to be properly focused so that the characters are sharp and clear. The auto-focus of the camera works well if you meet the minimum required distance for the lens to be able to focus. This usually starts at 5-10cm.
The OCR trained data is optimized for common serif and sans-serif font types. Decorative or script fonts drastically decrease the quality of the recognition.
The OCR feature is provided in Scanbot SDK Package II. You have to add the corresponding dependency for Package II
io.scanbot:sdk-package-2 or higher in your
Get the latest
$scanbotSdkVersion from [[Release History]].
You can find a list of all supported OCR languages and download links on this Tesseract site page.
Please choose and download the proper version of the language data files:
- For the latest version of Scanbot SDK 1.50.0 or newer - LSTM Data Files for Version 4.00.
- For the older versions of Scanbot SDK <= 1.41.0 - Data Files for Version 3.04/3.05.
Download the files and place them in the assets sub-folder
assets/ocr_blobs/ of your app.
assets/ocr_blobs/osd.traineddata// required special data file
assets/ocr_blobs/eng.traineddata// english language file
assets/ocr_blobs/deu.traineddata// german language file
In order to initialize the Scanbot SDK with provided OCR data files, you have to call
prepareOCRLanguagesBlobs(true) on initialization of the SDK.
import io.scanbot.sdk.ScanbotSDKInitializer ... ScanbotSDKInitializer() .prepareOCRLanguagesBlobs(true) ... .initialize(this)
Then get an instance of the
import io.scanbot.sdk.ScanbotSDKimport io.scanbot.sdk.ocr.OpticalCharacterRecognizer ... val ocrRecognizer = ScanbotSDK(this).createOcrRecognizer()
To achieve a better OCR quality you can enable an additional image binarization in
ScanbotSDKInitializer() .useOcrSettings(OcrSettings.Builder().binarizeImage(true).build()) ... .initialize(this)
You can run OCR on arbitrary image files (JPG or PNG) provided as file URIs:
import io.scanbot.sdk.process.PDFPageSizeimport io.scanbot.sdk.entity.Languageimport io.scanbot.sdk.ocr.process.OcrResult ... val imageFileUris: List<Uri> = ... // ["file:///some/path/file1.jpg", "file:///some/path/file2.jpg", ...] val languages = mutableSetOf<Language>() languages.add(Language.ENG) var result: OcrResult // with PDF as result: result = ocrRecognizer.recognizeTextWithPdfFromUris(imageFileUris, false, PDFPageSize.FIXED_A4, languages) // without PDF: result = ocrRecognizer.recognizeTextFromUris(imageFileUris, false, languages)
OpticalCharacterRecognizer#recognizeTextWithPdfFromUris() doesn't compress input images under the hood, the resulting PDF file might become bigger. Make sure to compress images before passing them to
On RTU UI
If you are using our RTU UI Components, you can use the corresponding methods to pass a list of RTU UI
import io.scanbot.sdk.persistence.Pageimport io.scanbot.sdk.process.PDFPageSizeimport io.scanbot.sdk.entity.Language ... val pages: List<Page> = ... // e.g. snap some pages via RTU UI DocumentScannerActivity val languages = mutableSetOf<Language>() languages.add(Language.DEU) var result: OcrResult // with PDF as result: result = ocrRecognizer.recognizeTextWithPdfFromPages(pages, PDFPageSize.FIXED_A4, languages) // without PDF: result = ocrRecognizer.recognizeTextFromPages(pages, languages)
Please note: The OpticalCharacterRecognizer uses the document image (cropped image) of a
Page object. Thus, make sure all
Page objects contain document images.
In case of running OCR with PDF the result object contains the searchable PDF document with the recognized text layer (aka. sandwiched PDF document):
val pdfFile: File = result.sandwichedPdfDocumentFile
In all cases the OCR result also contains the recognized plain text as well as the bounding boxes and text results of recognized paragraphs, lines and words:
val text: String = result.recognizedText // recognized plain text // bounding boxes and text results of recognized paragraphs, lines and words:val paragraphs: List<OcrResultBlock> = result.paragraphsval lines: List<OcrResultBlock> = result.linesval words: List<OcrResultBlock> = result.words
See the API reference of the
OcrResult class for more details.