In 2020, the World Health Organization (WHO) estimated that at least 2.2 billion people worldwide suffer from some form of blindness or visual impairments [1]. Blindness and visual impairments are not the only cause that may limit access to text. Other forms of print disabilities exist which are rooted in certain learning disabilities due to genetic and/or neurobiological factors. These tend to alter the brain’s cognitive process and as a result, interfere with learning basic skills such as reading and writing [2]. Irrespective of the cause, limited access to text has a long-lasting effect on an individual’s life, limiting the access to educational and work opportunities as well as access to public services [3]. With proper support and appropriate tools, people with visual impairment and print disabilities may achieve success in school, work and in the community.
Inclusivity and representation are essential for society to function well. And indeed, inclusivity and representation are a key national strategy, so much so the first benchmark of the inclusion framework proposed by Malta's Ministry of Education and Employment [4] states that:
All learners have access to opportunities for participation in educational systems and structures
Technology to support access to text
Text-to-Speech is the process by which written text is read out, often, through the use of artificial intelligence. The technology has improved substantially over the last years, moving away from the monotonic, robotic-sounding voices associated with early text-to-speech systems to obtain more natural sounding voices, which follow the inflection of the sentences being read. Text-to-speech, however, often requires that the text to be read is already present in a digital format as a string of characters and words. While this is readily available for digital-born documents, such as the text on this very page, printed documents need additional processing to extract this information. This is because, when we obtain digital versions of the printed documents, for example, by scanning them, or by taking pictures of them, the result is an image that does not contain information other than grey-level values.
So, to prepare such documents for use with text-to-speech, the images need to be first passed on to Optical Character Recognition (OCR) algorithms. The role of these algorithms is to convert any form of text-containing documents into an editable, digital format for any kind of processing. This means that a machine running the recognition will recognize the text, mimicking the combination of the human eyes and mind [5].
As with text-to-speech algorithms, OCR methods have improved significantly over the years. By borrowing from knowledge gained from language, OCR methods can produce text with high accuracy, linguistics to predict characters when the image clarity is not always good. The performance of OCR methods, however, depends on the quality of the input images, and may require prior processing to clean and enhance the contrast between the text pixels and the page background. Such processing may involve image binarization to present the text as black pixels over a white background, segmentation to distinguish text regions from other information in the document and normalisation to correct for page distortions such as skew, among others.
Focusing on children's books
Stories will typically feature a lot throughout our lifetime, more so, in our early years. The characters that children meet in books help children learn about everyday tasks as well as complex ideas such as showing compassion for others, the idea of time, sharing with others, the difference between good and evil and allow children to come to terms with difficult situations such as death and bereavement. Stories build up the child's resilience, demonstrating that no matter what the problem may be, there is a solution to be found. Stories also enhance the child's imagination and creativity. In short, stories are essential in the child's growth and development. Providing accessibility to all story books then becomes something that is of paramount importance, giving all children equal chances to delve into the wonderous adventures that are stored within a book's pages.
Fairy tales do not give the child his first idea of bogey. What fairy tales give the child is his first clear idea of the possible defeat of bogey. The baby has known the dragon intimately ever since he had an imagination. What the fairy tale provides for him is a St. George to kill the dragon.
G.K. Chesterton, Tremendous Trifles
From a computer vision perspective, story books provide an interesting and challenging problem. Unlike grown-up books and documents, children's books will contain a lot of colourful illustrations, which are highly stylised, where the text can use non-standard fonts, and where the text can be overlaid on the graphics. All of these facets, while making the book exciting, pose difficulties at the most fundamental image pre-processing steps for OCR, namely the figure-background separation, or, binarization. Addressing these challenges will not only have benefit to improving the accessibility of children's books, but will also increase the robustness of OCR and text-to-speech techniques, with benefit to the processing of all kinds of documents.
The challenge
The quality of the binarization result affects the whole OCR pipeline, since without proper extraction of the text, the classification and recognition of the text characters will necessarily perform poorly. If the imaging conditions are controlled, binarization can be tackled fairly easily with off-the-shelf solutions such as those proposed by Otsu [6], Sauvola [7] or Gatos [8]. Images of documents are, however, increasingly being obtained with digital cameras, notably with cameras built in mobile phones. While these cameras are getting better and better in terms of image quality and resolution, in comparison to dedicated document scanners, these cameras introduce a greater variability in terms of image quality due to variable illumination. One also needs to take into account that document images may suffer from paper degradation which introduces faint characters, bleed-throughs and stains [9].
Classical binarization algorithms typically compute some threshold about the pixel intensities around a neighbourhood of the pixel to determine an ideal two-way separation of image foreground and background. However, other statistical models such as mixture of Gaussians, Conditional Random Fields have also been used to model the image background and thus, separate the text characters from the page background [10]. More recent approaches, however, recognize the binarization problem as a classification problem and thus, employ machine learning methods to perform the classification. To this extent, shallow models such as self-organising maps [11] or support vector machines [12] as well as deep neural networks [13, 14] have been proposed with various success in results.
One commonality among these methods is that the degradation of the image page, while difficult at times to separate from the foreground, is still clearly a property of the background. In children's books, however, particularly when the text is placed on the graphics, the image page can be divided into three layers, rather than two: the page paper layer; a layer of graphics or illustrations which is placed on top of the page paper layer; and a layer of text are printed which is placed on top of the previous two layers. In the instances where the text overlaps the graphics or illustrations, as shown in Figure 1, separating the image into just two foreground and background classes is most likely to cause misclassification errors. For this reason, we believe that rather than a classical binary separation, it would be better to perform a three-way separation of the image, dividing the image into the text regions, the illustrations and the general page background. This will increase the likelihood of the OCR to perform well.
Figure 1: A photo image from a children's book which is binarized by six different methods. One may observe that these methods either do not completely remove the image background, or, when they do, degrade the text.
Tackling the challenges
The research challenges mentioned above are being investigated by Luke Abela, Erika Spiteri-Bailey, Andre Tabone, Stefania Cristina and Alexandra Bonnici as part of the Doc2Speech research project funded by the MCST under the Research Excellence Programme.
As part of our work, we are collecting samples of illustrations that can be used to create a dataset for training algorithms to localise text overlaid on illustrations. Anyone interested in contributing to the works can contact Alexandra Bonnici (alexandra.bonnici@um.edu.mt).
To follow our progress, follow the Doc2Speech Facebook page.
References
World Health Organisation (2021) Blindness and Vision Impairment.
Learning Disabilities Association of America, Types of Learning Disabilities.
World Health Organisation (2019) World Report on Vision.
Ghirxi J., Borg D., Camenzuli J., Pace M., Schembri Muscat A., (2019). A National Inclusive Education Framework, Ministry of Education and Employment.
Patel, C., Patel, A., and Patel, D. (2012). Optical Character Recognition by Open Source OCR Tool Tesseract: A Case Study, in International Journal of Computer Applications, 55(10):50–56.
Otsu, N. (1979). A Threshold Selection Method from Gray-Level Histograms, in IEEE Transactions On Systems, Man, And Cybernetics, vol. 20, no. 1, pp. 62–66.
Sauvola J., and Pietikäinen, M., (2000) Adaptive Document Image Binarization, in Pattern Recognition, vol. 33. pp. 225–236.
Gatos, B., Pratikakis, I., and Perantonis, S. J., (2008). Improved document image binarization by using a combination of multiple binarization techniques and adapted edge information. In 19th International Conference on Pattern Recognition, pp. 1– 4.
Mustafa, W., Mydin, M., Kader, M., (2018). Binarization of Document Images: A Comprehensive Review. In Journal of Physics: Conference Series.
Tensmeyer, C., Martinez, T., (2020). Historical Document Image Binarization: A Review. SN Computer Science, 1, 173.
Hamza, H, Smigiel, E, Belaid, E. (2005) Neural Based Binarization Techniques. In: International Conference on Document Analysis and Recognition (ICDAR), IEEE.
Kasmin, F., Abdullah, A., Prabuwono, A. S., (2017). Ensemble of Steerable Local Neighbourhood Grey-level Information for Binarization. In Pattern Recognition Letters, 98:8–15.
LeCun Y., Bottou L., Bengio Y., Haffner, P., (1998). Gradient-based Learning Applied to Document Recognition. In Proceedings of the IEEE. 86(11):2278–324.
Long, J., Shelhamer, E., Darrell, T. (2015). Fully Convolutional Networks for Semantic Segmentation. In Computer Vision and Pattern Recognition (CVPR), 3431–40.
Commenti