The
Central Institute of Indian Languages in the past co-ordinated
the development of
45 plus million word corporain
Scheduled Languages under the scheme of Technology
Development for Indian Languages(TDIL)
of the Ministry of Communication and Information Technology.
This
corpora was created following
samplingmethodologies and hence this is a balanced corpora. This
is available in Indian Standard Code for Information Interchange
or Indian Script Code for Information Interchange(
ISCII) format.The Institute
also intends to enhance this corpora to the tune of twenty
million in each language.
The
Institute in collaboration with the Lancaster University has
converted the same into UNICODE
format. Corpora in this format it in addition to the Lancaster University
corpora is also available for users at :http://www.emille.lancs.ac.uk/home.htm
On
its own the Institute is now developing corpora in languages
recently included into the Eighth Schedule like Bodo, Dogri,
Maithili and Santali. All the prose texts available in these
languages are keyed since it is not possible to obtain texts
from all the sampling domains.
Corpora
in Indian languages thus developed is maintained and distributed
free of cost to the scholars by the Institute for academic
purposes.
However,
in near future the institute will also be able to provide
different kinds of resources in Indian languages against payment/subscription.This
activity as is being planned now will form part of the proposal
: Linguistic Data Consortium in Indian Languages ( LDCIL
).
Site hosted on March 11, 2005 by Ms. Bela Banerjee, Joint Secretary (L), Department of Secondary and Higher Education, Ministry of Human Resource Development, New Delhi