The Central Institute of Indian Languages in the past co-ordinated the development of 45 plus million word corpora in Scheduled Languages under the scheme of Technology Development for Indian Languages(TDIL) of the Ministry of Communication and Information Technology.

This corpora was created following sampling methodologies and hence this is a balanced corpora. This is available in Indian Standard Code for Information Interchange or Indian Script Code for Information Interchange( ISCII ) format.The Institute also intends to enhance this corpora to the tune of twenty million in each language.

The Institute in collaboration with the Lancaster University has converted the same into UNICODE format. Corpora in this format it in addition to the Lancaster University corpora is also available for users at :

On its own the Institute is now developing corpora in languages recently included into the Eighth Schedule like Bodo, Dogri, Maithili and Santali. All the prose texts available in these languages are keyed since it is not possible to obtain texts from all the sampling domains.

Corpora in Indian languages thus developed is maintained and distributed free of cost to the scholars by the Institute for academic purposes.

However, in near future the institute will also be able to provide different kinds of resources in Indian languages against payment/subscription.This activity as is being planned now will form part of the proposal : Linguistic Data Consortium in Indian Languages ( LDCIL ).

Register here to receive corpus

Text Corpora
Parallel Corpora
Spoken Corpora
Corpora Tools
Submit Corpora
Submit Tools
Corpora Group
Credit Lines
Contact us



Site hosted on March 11, 2005 by Ms. Bela Banerjee, Joint Secretary (L), Department of Secondary and Higher Education, Ministry of Human Resource Development, New Delhi | | | | | |