Data sets of printed Tamil characters and printed documents


The Tamil digitization project has been started with the aims to develop a software to convert the printed Tamil books into digital form, and to publish through the Internet a collection of valuable books in digital form.

Click here to view the project website

Under this project printed manuscripts of various categories were scanned as images. From these scanned images, a data set of printed Tamil character images and documents was created.

This data set is used for the development of the OCR software for Tamil. In addition this data set is made available for research communities to test their work on developing a better OCR for Tamil.  In this regard, two different data sets of printed Tamil characters and printed documents were constructed:

  1. Data set of printed Tamil characters  – UJTDchar
  2. Scanned Tamil documents from four diverse types (UJTDdocC)
  3. Scanned desktop published documents of 20 different font faces – UJTDdocF


Click the above links to download the data sets.