Introduction
A Model
Indian Languages and LDC-IL
Current Status of Indian Languages Data
Proposed Activities
Participating Institutions
Location
Funding - Grants
Annual Membership Fee
Governance of the LDC-IL
About this Proposal
Budget
Annexure 1:Speech Recognition and Synthesis
Annexure 2:Character Recognition
Annexure 3:Natural Language Processing
Annexure 4:Corpora Creation in Indian Languages
Annexure 5:Notes on Items
Annexure 6:International Workshop on LDC-IL for Language Technology R & D August 17-18, 2003
Annexure 7:Minutes of the Meeting on LDC in the Chamber of ES on 24-2-2004
Annexure 8:Minutes of the Meeting to Consider the Project Proposal on LDC-IL held on 26.7.2004 at 3.00 PM in the Chamber of ES

A Proposal

Linguistic Data Consortium for Indian Languages (LDC-IL)


 

1. Introduction: The concept of creating a Linguistic data Consortium for Indian languages takes advantage of the giant strides in Information Technology (IT) that India has made, and follows the directives of the meeting of the Hindi Committee presided over by the Prime Minister where it was agreed that the Government would take necessary steps to enhance the machine readable language data in Hindi and other Indian languages on a large scale.

As is obvious, language data is the key ingredient in terms of research and development in the area of language technology. The issues surrounding collection, processing and annotation of the quantities of linguistic data encompasses make it necessary to involve a number of disciplines like linguistics, statistics, engineering etc. The data thus collected will be of high quality with defined standards. It is important that India creates a data consortium for sharing resources and avoids duplication of efforts, so that the entire research community is benefited. This consortium will not only create and manage large Indian languages databases, it will l also provide a forum for researchers in India and other countries working on Indian languages to publish and build products for use based on such databases that would not otherwise be possible.


2. A Model: An ideal model of such a Consortium is best exemplified by the success of Linguistic Data Consortium (LDC) hosted by the University of Pennsylvania, USA. It is an open consortium of universities, companies and government research laboratories that creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. This initiative, known as 'LDC' was established in 1992 with an initial US government grant to provide a new mechanism for large-scale development and widespread sharing of resources for research in linguistic technologies. It now includes more than 100 companies, universities, and government agencies as its active users and members. The core consortium operations of the LDC are now fully self-supporting after a period of ten years. The activities include maintaining the data archives, producing and distributing CD-ROMs, and arranging networked data distribution, among other things. This has provided a great impetus to R&D in the field of language technology for English and other European languages. The LDC is hosted by the University of Pennsylvania but has eventually grown from a government-funded project into an independent initiative. It is proposed to adopt a similar approach in the Indian context, too.


3. Indian Languages and LDC-IL: In this context the Central Institute of Indian Languages, Mysore and other like-minded institutions working on Indian Languages technology like Indian Institute of Science, Bangalore, Indian Institute of Technology, Bombay, Indian Institute of Technology, Madras, and the International Institute of Information Technology, Hyderabad, etc., propose to set up a Linguistic Data Consortium for Indian Languages (LDC-IL) that will help the researchers and developers worldwide in the field of corpus linguistics and language technology related to Indian Languages. It will be known as 'LDC for Indian Languages' (LDC-IL), and these institutions will be known as the Lead Institutions in this initiative.


4. Current status of Indian Languages Data: During the 1990s, collection of linguistic data began to form an Indian languages corpora under Technology Development in Indian Languages (TDIL) project of the Ministry of Communication and Information Technology of the Government of India. This initiative resulted in creation of the 45-million word corpora in 15 Indian languages which is housed in CIIL, and is distributed through this Institute free of cost for non-commercial and research purposes as per the understanding of the Ministry of CIT, Government of India.. This has helped other applied linguistic researches, such as in Language Teaching, Discourse Analysis, and Language Acquisition. This also formed the basis for technology development for providing pen and OCR-based applications, voice-based services, and information retrieval mechanisms in Indian languages.

In the proposed initiative under the LDC-IL, all the Institutions and agencies as well as industries interested in Indian languages in Indian languages technology will be requested contribute their collection to initiate the activities of the LDC-IL. However, due to the complexity of human languages any development in such areas requires a truly large amount of data in the form of speech, text, dictionaries etc., as a resource and be available to the needy as shared resource. This enlargement of databases and enhancement of high-quality corpora will be the responsibility of the lead institutions.

Thus, the proposed project will get all those interested in data-based R&D in Indian language technology and sharing resources and tools to be a part of a consortium in the interest of the entire Language Technology community in India.


5. To sum up the proposed activities, LDC-IL will focus on:

" Becoming a repository of linguistic resources in all Indian languages in the form of text, speech and lexical corpora.
" Facilitating creation of such databases by different organizations.
" Setting standards for data collection and storage of corpora for different research and development activities.
" Supporting development and sharing of tools for data collection and management.
" Facilitating training through workshops, seminars etc. in technical as well as process related issues.
" Creating and maintaining the LDC-IL website that would be the primary gateway for accessing LDC-IL resources.
" Designing or providing help in creation of appropriate language technology for mass use.
" Providing the necessary linkages between academic institutions, individual researchers and the masses.

. Major areas of Linguistic Resource Development :
a. Speech Recognition and Synthesis
b. Character Recognition
c. Corpora Creation in Indian Languages
d. Several by-products like lexicon, thesauri etc.

The details of coverage of major areas of activity under LDC-IL are given in the Annexure 1 to 4


6. Participating Institutions : All academic institutes, research organizations and Corporate R&D groups from India and abroad working on Indian languages will be encouraged to participate in LDC-IL. Besides the lead institutions already mentioned, other possible interested institutes and organizations may include: Different Indian Universities with major departments of Linguistics and computer science/ ISI Calcutta; TIFR Mumbai; HP Labs India; IBM; C-DOT; C-DAC; Tata InfoTech All other IITs; KHS; NCPUL; Rashtriya Sanskrit Sansthan; TDIL, MIT. Departments of Linguistics of various universities in India, to begin with CALTS of the University of Hyderabad. This list is open ended and suggestive and not exclusive.


7. Location : The services under the proposed LDC-IL will be hosted and managed by the Central Institute of Indian Languages, Mysore. This would leverage expertise available in the institution and their prior experience in creating databases. The host institute will require additional space, human resources and other facilities to run the full-time activity of the development and maintenance of the LDC-IL. It is important to mention that all engagements of the personnel for the LDC-IL will be on contractual basis selected through appropriate procedure following the norms fixed by the Government of India for reservation etc. However, as a total R&D activity under the LDC-IL, it is natural that a number of national Institutions will be involved. They will be invited to contribute to the areas listed under Annexures 1 through 4 where specific research areas are identified and elaborated. All these activities and the funding required to carry on such R & D work will be coordinated by the CIIL. However, depending on the experience of the lead institutions, the relevant sub-committees under a project Advisory Committees (PAC) will help in arriving at appropriate decisions.


8. Funding - Grants : For the Core funding, the Department of Secondary and Higher Education will create an appropriate scheme through a substantial one-time grant during Xth Plans, and possibly another set of grants during the next plan to the host institution, i.e. CIIL will be released. As the nodal agency, CIIL will further distribute the relevant funding for specific sub-components of the scheme to other academic institutions based on an intent of understanding to be signed between Heads of such institutions and CIIL. It is also understood that this release of funds and its management will be in a project mode through CIIL's PL account. CIIL may be allowed to set up a separate account for only LDC-IL activities. All receipts and payments will be to this account for potential users of the services and the linguistic data - creating a corpus fund and generating revenues which would make the scheme progress towards self-sustenance in a ten year's time frame. It is needless to mention that an annual accounts as to utilizations of funds as well as revenue generation will be submitted by CIIL to the MHRD along with a progress report in every quarter. Out of the revenue generated 40% will go towards creation of corpus fund and the 60% will be rolled back to towards implementing the tasks.

Since TDIL-MCIT has promised full support to this activity in its Advisory Committee meeting held in Indian Institute of Science, Bangalore on 16th April 2004, in the presence of its FA and since all linguistic data collected or generated through TDIL projects would also be submitted to the LDC-IL, through a mutual arrangement between the two Ministry, a copy of the project report as well as all periodic reports will also be sent to the TDIL-MCIT by CIIL.


9. Annual Membership fee :

(a) Open to academic institutes, Research Organizations, and Corporate sector from all over the world.

(b) Members will be encouraged to contribute databases and offered a monetary share of the revenues earned by the databases they contribute

(c) The databases will be available for R&D purposes to all members and non-members on payment of the appropriate fee, with a license for use only. The organization will be asked to sign a License Agreement to the effect that the databases will only be used for research and development and may not be distributed by it to any other institute or individual either free or for a fee. However, the IP and the copyright of any product developed as a result of such an R&D activity shall lie with the organization that has created the product.

Differential rate of annual fee will be charged as given below:

India:

1. Individual Researchers: Rs.2000/- per annum
2. Educational Institutions: Rs.20,000/- per annum
3. Software and related industry : Rs.2,00,000/- per annum

Other countries

1. Individual Researchers: $ 2000/- per annum
2. Educational Institutions: $ 20,000/- per annum
3. Software and related industry : $ 50,000/- per annum


10. Governance of the LDC-IL :

The LDC-IL will have a Project Advisory Committee consisting of members from each of the participating lead Institutions.

The permanent members will include Directors or nominees of the lead institutions already identified. The PAC may be expanded later in the event of the other major Institutions joining in this activity and some representatives from private sector IT initiatives may also be included in the PAC at a later date. The total number of PAC members shall not ordinarily exceed 21, including the ex-officio members.

1. It is to be understood that even if institutions from abroad join this Consortium the administration/governance of it will remain with Indian members only.

2. Two officials of the language Bureau nominated by the Ministry of HRD will be a members of the PAC. At least one of them will be from the finance division.

3. One official representative from the Ministry of Information and Communication Technology, Government of India.

4. One Expert in IPR matters normally drawn from Institutions like National Law School University, Bangalore etc.

5. The Director, Central Institute of Indian Languages will be the Head of the LDC-IL. He will be assisted by a Project Director nominated/appointed for the purpose.

6. The Project Director will be the ex-officio convener of the advisory committee.


11. About this Proposal : This proposal on LDC-IL has evolved through participation of scholars from different Institutions in India and abroad. Summary of the same is as follows:

(a) On August 13, 2003 ) Prof. Udaya Narayana Singh, Director , Central Institute of Indian Languages made a presentation about the concept of such a consortium and need for creating the same by the Ministry of HRD.

(b) On August 17 and 18, 2003 an International Workshop on Linguistic Data Consortium for Indian Languages was held at the Institute. This was inaugurated by Smt. Kumud Bansal, Additional Secretary, and Smt. Bela Banerjee Joint Secretary (L) from the Ministry also participated. This was conducted in collaboration with IIIT Hyderabad and HPLabs, India, attended by participants from LDC, University of Pennsylvania and most of the Institutes involved in computational linguistics in India.

(c) On August 19, 2003 a follow up meeting was held in the Indian Institute of Science, Bangalore where in a select group of experts from Institutions involved in language technology applications including HPLabs, MAIT participated and formed a group of experts from IITB, IITM, IISc, IIIT Hyderabad with the Director CIIL as Coordinator to formulate a formal proposal to be submitted to the Ministry.

(d) Through teleconference and email correspondences a proposal was formulated and sent to the Ministry vide letter number No.F.2-1/2003/LDC-IL November 18, 2003.

(e) On December 19, 2003 a meeting of representatives of lead Institutes was held at the Institute to discuss the proposal submitted to the Ministry.
(f) With additional inputs received from them, the present proposal is formulated and presented on February 24, 2004 in the MHRD.

(g) Further modified to include suggestions after the presentation by the Director CIIL before the Secretary, Additional Secretary, Joint Secretary(L), Director(Finance) in the Department of Education, Ministry of HRD on February 24, 2004.

(h) On April 16, 2004, the present version of the proposal was presented by the Director, CIIL before the Advisory Committee of the TDIL-MCIT (Technology Development in Indian Languages, Ministry of Communication & Information Technology). The Financial Advisory of the MCIT was also present during the presentation, and the suggestion of the members are included.

(i) Preparation of EFC should start after submission of the final version of the proposal.

(j) Within six months the project should start functioning.


12. Budget

Broad Outline :Rs. 221.60 lakhs per year. Total: Rupees 1772.8 lakhs (variable, please see note below) for the next eight years covering the present and next plan period.

Break up of figures for one year will be as follows :

Sl.No.
Head
Amount
1.
Human Resources
78,00,000
(a)
Project Director (1) Rs. 30,000 (variable)30,000x12 m
4,20,000
(b)
Scientist A (3) 29,000 x 3 persons x 12 man-months
10,80,000
(c)
Scientist B (4) 21,000 x 8 x 12 m
21,12,000
(d)
Scientist C (5) 14,000 x 6 x 12 m
11,52,000
(e)
Scientist D (8) 11,000 x 8 x 12 m
12,48,000
(f)
Project technicians (Rs.5,000 x 20 x 12 m)
14,40,000
(g)
Maint Personnel - Accounts (Rs.11,000 x 1 12m)
1,56,000
(h)
Maint Personnel - Sales & Promo (Rs.7,000 x 1 x 12m)
96,000
(i)
Maint Personnel - General (Rs.7,000 x 1 x 12m)
96,000
 
Tasks
56,60,000
2.
Tasks at various Participating Institutions (as in Annex)
56,60,000
 
Events
50,00,000
3.
Academic Meetings in diff. Instt x 2
2,00,000
4.
LDC-IL PAC meetings at CIIL x 2
2,00,000
5.
Seminars & Events in diff Instt - 7 every year in participating Instt
15,00,000
(a)
Seminars (National) in diff Instt x 2
4,00,000
(b)
Seminars (Regional) in diff Instt x 4
6,00,000
(c)
Seminars (Int'l) rotating in diff participating Instt x 1 per year
5,00,000
6.
(Prod) Workshops for production (6)
6,00,000
7.
Training Programmes x 4 per year
2,00,000
8.
Travel & Incidentals
8,00,000
 
Equipments & Maintenance
27,00,000
9.
Hardware
20, 00,000
10.
Software/Tools
4,00,000
11.
Equipment maintenance
Variable (From OE-Non-Pl)
12.
Maintenance of LDC-IL
3,00,000
IPR
10,00,000
13.
IPR/Copyright payments (variable)
5,00,000
14.
Publications, incl E-pub (10 a year)
5,00,000
TOTAL
Rs. 2,21,60,000



NOTE:

1. This is a broad indication of proposed expenditure under different heads. The Director CIIL on the advise of the Project Advisory Committee of the LDC-IL may be authorized to make re-appropriation of funds from among the heads indicated above, without exceeding the over all sanctioned budget.
2. The first three years of the project period are incubation years, wherein no heavy flow of income is anticipated from the project.
3. From the third year onwards, flow of income is expected. It is estimated that in the initial period, the annual revenue may be 8% to 10% of the annual investment as projected in the budget proposal given above. If this income is placed in a corpus fund to be used for LDC-IL, in the initial period, there will be a turn over of Rs.17.73 lakhs to Rs.22.16 lakhs per year. From the sixth year of the project, the revenue is expected to be around 25% to 35% of the amount invested, i.e. Rs.55.4 lakhs to Rs.66.48 lakhs annually. Once these are contributed to create a corpus, it is likely that at the end of eight years, the corpus funds will have a figure between Rs. 201.66 lakhs to Rs. 243.76 lakhs plus interests (whatsoever). It is proposed that beyond eight years, the Ministry may consider giving only grants for events (Rs.50 lakhs), tasks of software development (Rs.64.76 lakhs), and maintenance of equipments (Rs.15.24 lakhs), i.e. Rs.130 lakhs a years. The services of the personnel and the IPR costs will be paid from the interests of the corpus funds (Rs.14.63 lakhs) plus further income, i.e. 66.48 lakhs, i.e. Rs.81.11 lakhs generated annually. Added to the maintenance cost of Rs.130 lakhs, the total comes to Rs.211.11 lakhs or so.
4. The project is anticipated to be cent percent self sufficient/financing by the end of Eleventh Plan period or by the end of eight years of its commencement.
5. In case the people in service in the Government or Autonomous Institutions in substantial capacity are selected their service and salary will be protected.


Annexure-1: Speech Recognition and Synthesis

1. Introduction :

The objective of data collection effort is to primarily build speech recognition and synthesis systems for Indian languages. Although there are such ASR and TTS systems available around the world for a number of mainstream languages, commercially viable speech systems for Indian Languages are not available

Voice User Interfaces for IT applications and services have become more and more prevalent for languages like English, and are valued for their ease of access, especially in telephony-based applications. In a country like India, where the majority of the population is not comfortable using English and given the relatively lower rates of literacy, local language speech interfaces can provide access to IT applications and services, through internet and/or telephones, to the masses. If such technology is available in Indian languages, people in various semi-urban and rural parts of India will be able to use telephones and Internet to access a wide range of services and information on health, agriculture, travel, etc. However, for this to become a reality, a computer has to be able to accept speech input in the user's language and provide speech output. Also, in multilingual India, if speech technology is coupled with translation systems between the various Indian languages, services and information can be provided across languages more easily.

Although speech technology has been the focus of research in India for a number of years and the technology itself has matured for real-world applications, the main obstacle in customizing this technology for various Indian languages is the lack of appropriate annotated speech databases in these languages. The focus here is (i) to collect data that can be used for building speech enabled systems in Indian languages and (ii) to develop tools that facilitate collection of high quality speech data.

2. Background - Speech Recognition :

The task of automatic Speech recognition is the task of converting any speech signal into its orthographic representation. There are two different categories of speech recognition systems:

oIsolated word recognition and connected word systems as in command and control applications.
oContinuous speech recognition systems. In continuous speech recognition there are two different categories; read speech and spontaneous speech.

3. Background- Speech Synthesis :

The task of speech synthesis to convert written text (orthographic representation to speech). The vocabulary should not be restricted for speech synthesis and synthesized speech must be close to natural speech. To enable unrestricted speech synthesis, the sentence is normally converted to a sequence of basic units. Then appropriate rules of synthesis are employed to produce speech sounding natural.

The focus is primarily on building (a) vocabulary independent speech to speech translation (for a pair of Indian languages) and (b) vocabulary dependent isolated word recognition in the Indian languages.

4. Long Term Goal:

The grand vision of this project is to collect data to provide speech-to-speech translation from each and every language to each and every other language spoken in India (including Indian English). Such a system would include unlimited vocabulary speech synthesis and recognition systems for every Indian language coupled with machine translation systems between those languages. The block diagram given below describes the basic architecture of such a system.

Speech input
in language A Recognized Text in Language A


5. Short Term Goal:

To create databases for building (a) bi-directional speech to speech translation system of read speech for a pair of Indian languages, namely, Hindi-Telugu, (b) a speech recognition system for Indian English. Further, it is desired to collect large vocabulary isolated data for the 22 Scheduled Indian languages.


Text in Language A

Translated Text in Language B

Speech Output
in Language B

6. Methodology for Short Term Effort:

Methodologies for data collection and development of tools required for the short-term and long-term goals are given below:

Data collection Effort for Automatic Speech Recognition (ASR)

The data collection effort will involve collection of read and spontaneous speech.

Data required:
Read speech corpora for two Indian languages and Indian English.
Channels:
1. Close talking microphone, on a desktop or laptop.
2. Telephone, both landline and mobile .
Annotation:
The data will be annotated at phoneme, syllable, word and sentence levels.

Data Collection for Isolated Speech Recognition

Channels:
1. Close talking microphone, on a desktop or laptop
2. Telephone, both landline and mobile
Demography:
10,000 words from 300 speakers (150 male, 150 female)

Data Collection for Text to Speech Synthesis

Data Required:
Data will be collected in the form of read-out phonetically balanced text which will ensure coverage of all speech sounds of the language concerned in different prosodic and phonological contexts. The phonetically balanced text will be extracted from a huge text corpus.
Channels:
Speech Synthesis requires high quality recording in an anechoic chamber using high quality microphones and recording equipment.
Demography:
6 speakers: 3 males and 3 females per language.
Annotation:
Data to be annotated at phone, phoneme, syllable, word, and phrase level.

Tools Required for Data Collection and Annotation for ASR and TTS:
Standardization of tools for data collection and annotation is required. Number of different tools are available for annotation of speech data like EMULAB and PRAAT. Convergence on representation, annotation and storage format is required. However, the project will also focus on providing converters across different formats.

Tools for Speech Recognition:
The data will be annotated at phoneme, syllable, word and sentence levels. Tools need to be developed for semiautomatic annotation of speech data. These tools will also be useful for annotating speech synthesis databases. One could adopt LDC's recording of data in the NIST format. This format is comprehensive in that it contains ALL the information about the recording environment, speaker information, sampling rate, number of channels, number of bits/sample, etc.

Tools For Speech Synthesis:
Other than tools for annotating speech databases, text annotation is also required for speech synthesis:
Text Annotation Tools Required:
1. PoS taggers, phrase boundary markers and intonation markers.
2. Identification and standardization of feature vectors for speech synthesis , vocalic/non- vocalic, pause break, distance from a pause break, duration of break, characteristics of the intonation across the entire sound unit.

7. Applications:

oSpeech to Speech translation for a pair of Indian languages, namely, Hindi and Telugu.
oCommand and control applications.
oMultimodal interfaces to the computer in Indian languages.
o E-mail readers over the telephone,
o Readers for the visually disadvantaged.
o Speech enabled Office Suite.

The effort for both Speech Recognition and Speech Synthesis will be repeated across all 22 Scheduled languages. For Speech Recognition, spontaneous speech data will be collected along with read speech. For speech synthesis, data will be collected from professional speakers, with very good voice quality. Additional speech data will be collected to come out with models for prosody (intonation, duration, etc.) to improve the naturalness of synthesized speech. A database (lexicon) of proper names (of Indian origin) will be created, with the equivalent phonetic representation for each of the names.


Annexure - 2:

Character Recognition

1. Introduction:

Character Recognition refers to the conversion of printed or handwritten characters to a machine-interpretable form, or in other terms, the "reading" of text. The term has been used to address three very distinct language technologies with different applications.

"Online" handwriting recognition or Online HWR refers to the interpretation of handwriting captured dynamically using a handheld or tablet device. It allows the creation of more natural handwriting-based alternatives to keyboards for data entry in Indian scripts, and also for imparting of handwriting skills using computers.

"Offline" handwriting recognition or Offline HWR refers to the interpretation of handwriting captured statically as an image. It can be used for the interpretation of handwriting already recorded on paper, ranging from filled-in forms to handwritten manuscripts.

Optical character recognition or OCR refers to the interpretation of printed text captured as an image. It can be used for conversion of printed or typewritten material such as books and documents into electronic form.

These different areas of language technology require different algorithms and linguistic resources. However for convenience, they have been combined under the "character recognition" umbrella. They are all hard research problems because of the variety of writing styles and fonts encountered. Of these, OCR has seen some research in a few Indian scripts because of support from the TDIL program. However the technology is not yet mature and there is only one commercial offering. Also, there are no common linguistic resources that can be used by the community. The other areas of Online and Offline HWR have seen very little research overall in the context of Indian scripts and no linguistic resources exist.


2. Objectives:

Long-term objectives

(i) Development of standards, tools and linguistic resources (datasets) for the fields of Online HWR, Offline HWR and OCR.
(ii) Promotion of development of these technologies.
(iii) Promotion of development of important and challenging applications of these technologies in the context of Indic languages and scripts.

This will be achieved in variety of ways:

" Standards development will primarily be via a mixture of email discussions and face-to-face meetings of working group members organized under the aegis of LDC-IL.

" Tool development will be given as projects to technology institutions with the necessary inclination, skills and resources.

" Linguistic data collection, annotation and validation will be given as projects to linguistics/computational linguistics departments of Institutes and universities with the necessary inclination, skills and resources. However for each linguistic resource developed, validation will be performed by a different institution than the one doing the collection and annotation. Use of the linguistic resources for technology development will be promoted by arranging periodic competitions (for example, for recognition of online handwritten words in specific scripts) and by objective evaluation of performance.

3. Implementation Phases:

Specifically linguistic data collection will be done two phases.

Phase I (year 1-3)

" Development of standards

Standards are key to the creation of shared linguistic resources. The LDC-IL will adopt established processes for proposing and advancing standards, working with international standards bodies wherever applicable. Standards will be proposed for datasets of offline handwriting, offline handwriting and documents, and for printed characters.

" Development of tools for data collection

The availability of good tools will allow researchers to start collecting data in different Indian scripts, and contribute data to LDC-IL. They are a must in order to extend support to all Indian scripts quickly. The design and development of tools for data collection and dataset creation in all three target technologies will be done.

" Promotion of technology development for specific tasks in selected scripts

The LDC-IL will promote the development and implementation of technology for Online HWR, Offline HWR and OCR in the context of specific tasks and selected scripts.

The tasks could be
(i) to interpret a line of handwriting captured using a handheld computer
(ii) to interpret a form that has been filled in and scanned
(iii) to interpret a page from a book

Though all major Indian languages are objects of research to begin with Devanagari, Tamil and Telugu will be addressed to. These offer considerable variety in terms of visual complexity (and hence the challenge for recognition). Other scripts will be taken up in due course of time.

" Development of linguistic resources in selected scripts

The working group will drive the creation of significant linguistic resources for the tasks and scripts outlined above.

Some examples of linguistic resources are:
Online handwritten word samples from at least 500 writers in each script
Samples of handwritten characters extracted from forms representing at least 500 writers, and at least 500 samples of each handwritten character in each script
Synthetic data covering all printed characters and at least 1000 pages in each script

Phase II (year 4-8)

" Refinement of standards:

Since standardization requires consensus among creators and users of linguistic resources, it is expected that the process of standardization would continue as an activity beyond the first three years.

" Refinement of tools:

The tools created in the first phase will be continuously refined during this second phase, as more and more researchers start to use them and provide feedback and suggestions for improvement.

" Extension of technology tasks and linguistic resources to remaining scripts:

The technologies developed for the initial set of scripts will be adapted for other scripts during the second phase. As in the first phase, technology development will be supported by the creation of linguistic resources to support the technology development in other scripts, subject to budget constraints and interest from researchers working on those scripts.

" Promotion of significant applications:

A major activity during the second phase will be the promotion of significant applications with high potential impact on society. These will typically involve solving of challenging problems, multiple years of concerted effort, and close interaction between participating institutions and other researchers in India and abroad.

It is envisaged that these applications will be developed for selected languages and scripts such as Hindi, and the same will be extended to other languages and scripts with participation from researchers from all over India in due course of time.

4. Applications:

The list is meant to be indicative rather than exhaustive.

Handwriting Interface to Computers

Indian scripts are complex and not suitable for keyboard-based entry. Replacing the keyboard with a simpler and more natural interface based on handwriting would make computers much more accessible to the common man and to educators in particular.

Imagine that the keyboard is replaced with a special writing pad for handwriting input. As one writes, the writing is converted using HWR technology into words and entered into the target application. The solution would also need to support numerals, punctuation, and editing gestures, and functionally replace the keyboard.

Language technologies used: Online HWR for Indian Languages

Handwriting Tutor

The solution described above can also be adapted to provide computer-based instruction in handwriting to improve writing skills of school children, improve literacy as part of adult education programs, or allow literate adults to learn new scripts.

Language technologies used: Online HWR for Indian Languages

Multilingual Digital Libraries for Education

A wealth of literature and other education material in Indian languages is trapped in books, which require storage and are subject to physical decay. Online books on the other hand have no such problems, and may be made available to students all over in their schools, homes or hostels over the Internet.
The proposed solution will use a complete OCR pipeline for converting scanned images of book pages into electronic form, which will then be used to create a multilingual digital library. The library can then be searched using the local language, using either spoken (using Speech Recognition) or written (using Online HWR) queries. Results can be viewed on screen, and also read out using Text-to-Speech conversion. In addition, an annotation system will allow students to make private annotations on the book. This solution can be used by individual libraries or to create district, state or national level online educational resources.

Language technologies used: OCR, Online HWR, Speech Recognition, Text-to-Speech for Indian Languages

Automatic Forms Processing/Educational Testing

With millions of application forms filled in every year in Indian languages especially in the education sector, a solution for automatically reading handwritten entries from scanned images of forms is clearly very valuable. As a result of a growing school-going population, manual evaluation of answer papers has become very difficult. By using Offline HWR technology, there is the possibility of automatically reading and evaluating responses (for at least the fill-in-the-blanks style of questions where there is one (or a few) correct answers).

The proposed solution is a complete forms-processing system that can be used to read handwriting from a scanned image of a paper form. The interpreted results can be stored into a database (for applications) or compared with correct responses (for educational testing).

Language technologies used: Offline HWR for Indian Languages


Annexure - 3 :

Natural Language Processing


1. Introduction:

Some of the important language data resources required in Indian languages for various NLP applications are given below:

2. Electronic dictionaries:

Electronic dictionaries are a primary requisite for developing any software in NLP.

ED 1 Monolingual/bilingual dictionaries
25,000 words per year (per language)

ED 2. Transfer Lexicon and Grammar(TransLexGram) (per language)

Transfer Lexicon and Grammar above involves developing a language resource which would contain
o English Headwords
o Their grammatical category
o Their various senses in Hindi
o Corresponding sense in the other Indian language
o An example sentence in English for each sense of a word
o Corresponding translation in the concerned Indian language
o In case of verbs, parallel verb-frames from English to Indian language.

As is obvious from the above, TransLexGram will be a rich lexicon which will not only contain the word level information but also the crucial information of verb-argument structure and the vibhaktis which various languages use with specific senses of a verb.

The resource, once created will be a parallel resource not only between English and Indian languages but also across all Indian languages.

If the bilingual TransLexGram are created as aligned resources, there would be several advantages which will accrue. It will also reduce the work to be done for each individual resource.


Annexure - 4:

Corpora Creation in Indian Languages

1. Introduction:

The Central Institute of Indian Languages has a corpora of around 3.5 million words in many major Indian languages. The same will be enlarged to the extent of 25 million words in each language. Also, the existing corpora is raw corpora and it will be cleaned for use. Apart from 22 major Indian languages there are hundreds of minor and tribal languages that deserve attention from the researchers for their analysis and interpretation. Creation of corpora in these languages will help in comparing and contrasting structure and functioning of Indian languages. So, at least 100 minor languages corpora will be collected to a tune of around 3 to 5 million words in each language depending upon availability of text for the purpose.

2. Domain Specific Corpora:

Apart from these basic text corpora creation an attempt will be made to create domain specific corpora in the following areas :

a. Newspaper corpora
b. Child language corpus
c. Pathological speech/language data
d. Speech error Data
e. Historical/Inscriptional databases of Indian languages which is one of the most important to trace not only as the living documents of Indian History but also historical linguistics of Indian languages.
f. Grammars of comparative/descriptive/reference are needed to be considered as corpus of databases.
g. Morphological Analyzers and morphological generators.


3. POS tagged corpora:

Part-of-speech (or POS) tagged corpora are collections of texts in which part of speech category for each word is marked.

POS tagged corpora will be developed in a bootstrapping manner. As a first step, manual tagging will be done on some amount of text. A POS tagger which uses learning techniques will be used to learn from the tagged data. After the training, the tool will automatically tag another set of the raw corpus. Automatically tagged corpus will then be manually validated which will be used as additional training data for enhancing the performance of the tool. This process will be repeated till the accuracy of the tool reaches a satisfactory level. With this approach, the initial man hours per 10,000 words will be more. Thereafter, the tagging process will speed up.

4. Chunked corpora:

The chunked corpora will also be prepared in a manner similar to the POS tagging. Here also the initial training set will be a complete manual effort. Thereafter, it will be a man-machine effort. That is why, the target in the first year is less and double in the successive years. Chunked corpora is a useful resource for various applications.

5. Semantically tagged corpora:

The real challenge in any NLP and text information processing application is the task of disambiguating senses. In spite of long years of R & D in this area, fully automatic WSD with 100% accuracy has remained an elusive goal. One of the reasons for this shortcoming is understood to be the lack of appropriate and adequate lexical resources and tools. One such resource is the "semantically tagged corpora".

In semantically tagged corpora , words in the text documents will be marked with their correct senses. For example, in
"Can a can can soup"
apart from POS tagging, it is also necessary to tag the text as
"Can a can <included-in-set: container> can
<included-in-set:hold-action>" soup"
which is a example of semantic tagging.

The question that arises is "What should be the set of such tags and where should these come from?". A natural answer to this is obtained when we look at the "WordNet". WordNet is a semantic structure where "relational semantics" is exploited to encode the senses of words. The basic machinery for sense representation is the accumulation of synonyms into 'synsets' and also enumerating the semantic relations like 'hypernyms', 'meronyms' etc. For example, the 'included-in-set' tag above is the hypernmy
(super ordinate) relation which disambiguates the sense.

Following are the steps towards creating semantically tagged corpora:

1. Develop, refine and make widely available Indian language WordNets. (IITB is developing Hindi and Marathi WordNets; AU-KBC and Tanjavur university are working on Tamil WordNets. Similarly other language WordNets are being created at a other places.)

2. Link the WordNets into the "Indo-WordNet"- a massive semantic structure of Indian language WordNets.
3. Link the Indo WordNet to English and Euro-WordNets.

4. Create large amounts of sense tagged corpora manually for the purpose of training a 'sense tagging machine'. The tags are the INDO-WORDNET SYNSET NUMBERS.

5. Devise algorithms for the training task. Hidden Markov Model, Entropy maximization etc. are the possible candidates.

6. For the purpose of semi-automatic semantic tagging, invest on user friendly and intelligent user interfaces.

The semantically tagged corpora is a valuable resource which will be constructed using the Indian language WordNets and then employing machine learning algorithms (as in the case of POS taggers discussed above).

6. Syntactic tree bank:

Preparation of this resource requires higher level of linguistic expertise and needs more human effort. For preparing this corpora experts will manually tag the data for syntactic parsing. A tool can then automatically extract various tree structures for the tree bank. Since it requires more manual effort and also a higher degree of linguistic expertise, building of this resource will be a relatively slower process. The initial take-off time will also be more in this case.

Since, a crucial point related to this task is to arrive at a consensus regarding the tags, degree of fineness in analysis and the methodology to be followed. This calls for some discussions amongst the scholars from varying fields such as linguistics and computer science . It will be achieved through conduct of workshops and meetings. First some Sanskrit scholars, linguists and computer scientists will review the existing tagging scheme developed for Indian languages by IIIT, Hyderabad and define standards for all Indian languages (extendable to any language). On this basis some experiments will be carried out on the selected Indian languages to test the applicability and quality of the defined standards. After testing these actual tagging task will start.

7. Parallel aligned corpora:

A text available in multiple languages through translation constitutes parallel corpora. The National Book Trust, Sahitya Akademi are some of the official agencies who develop parallel texts in different languages through translation. Such Institutions have given permission to the Central Institute of Indian Languages to use their works for creation of electronic versions of the same as parallel corpora. The magazines, news paper houses who bring out translated versions of their output are another source to provide texts for parallel corpora. First wherever necessary the text have to be keyed in and then computer programmes have to be written for creating

[I] Aligned texts; [II] Aligned sentences; and [III] Aligned chunks.


8. Tools:

1. Tools for Transfer Lexicon Grammar (including creation of interface for building Transfer Lexicon Grammar).
2. Spellchecker and corrector tools
3. Tools for POS tagging. (Trainable tagging tool with an Interface for editing POS tagged corpora).
4. Tools for chunking (Rule-based language-independent chunkers).
5. Interface for chunking (Building an interface for editing and validating the chunked corpora).
6. Tools for syntactic tree bank, incl. interface for developing syntactic tree bank.
7 Tools for semantic tagging with basic resources are the Indian language WordNets showing a browser that has two windows - one showing the senses (i.e., synsets) from the WordNet appear in the other window, after which a manual selection of the sense can be done.
8. (Semi) automatic tagger based on statistical NLP (the preliminary version of
which is ready in IITB).
9. Tools for text alignment, including Text alignment tool, Sentence alignment tool and Chunk alignment tool as well as an interface for aligning corpora .
Annexure-5


Annexure-5:

Notes on Items


Item 1 Server & Proxy Server (Specs):

Example: IBM Mainframe Servers; zSeries 990
Price: $63,000 (approx)

Physical Configuration
Weight (unpacked) - 2007 kg
Footprint - 2.49 Sq. meters
Service Clearance - 5.45 Sq. meters
Input Power - 21.39 kVA
Heat Output - 72.73 KBTU/hr
Air Flow - CFM 3250, m3/m
Height - 194.1 cm (76.4 inches)
Hardware Model - D32 upwards
Coupling Links - Max # Links 64¹
Channels - 512/120/48/16 (ESCON/FICON Express/OSA-
Express/HiperSockets)
Cryptographic - PCI Crypto Accelerator - up to 12 optional (up to 6 cards)
Processor Memory - 256 GB

Software - z/OS 1.2 and subsequent releases
z/VM 3.1, z/VM 4.2 and subsequent releases
Red Hat, SuSE, Turbolinux
OS/390 2.10
VSE/ESATM 2.5 and subsequent releases
TPF 4.1 (ESA mode only)
Red Hat, SuSE, Turbolinux


Item 2: Large Serial Disk System or equivalent

Large storage systems like IBM 7133 Serial Disk System advanced models D40 and T40 provide highly available storage for UNIX, Windows NT, and Novell NetWare servers. By implementing a powerful industry-standard serial technology, the 7133 advanced models D40 and T40 provide outstanding performance, availability, and attachability.

Product highlights

" Should provide outstanding disk storage performance with advanced SSA bandwidth of 160 MB/sec
" High availability with redundant data paths, redundant cooling units, and additional (at least, two) power supplies;
" Should be an enterprise-strength storage for distributed systems
" High availability to safeguard data access
" Ultra high speed 15,000 rpm hard disk drives, now available in higher capacities (capacities of 36.4 GB or 72.8 GB, at least)
" Simplified storage management;
" Shared storage for all major types of servers plus scalability for fast-growing environments
" Facilitates of remote mirroring-up to 10 km connection distances-with the Advanced SSA Optical Extender

Item 3: High-end scanners

Possible models:

Creo

www.creo.com
EverSmart Pro II
8000-pixel trilinear CCD
11.8x17
11.8x17
3175x8200
$29,950

EverSmart Select
8000-pixel trilinear CCD
11.8x17
11.8x17
5600x11,400
$34,950

EverSmart Supreme
8000-pixel trilinear CCD
11.8x17
11.8x17
5600x14,000
$44,950

Fujifilm

Fujifilm Superlinear CCD
18.5x13.8
18.5x13.8
13,500
$25,000

Screen USA
Cezanne Elite
8000-pixel trilinear CCD
13x20.9
13x20.9
5300
$34,000

Item 4: High Speed LAN printers

Standard

Item 5: Satellite PCs attached to servers or equivalent

Like Compaq Alphaserver DS20 System

Specifications: $18,000 each

Processor/cache
Up to two 500-MHz Alpha 21264; each with 64-KB I-cache, 64-KB D-cache on
chip & 4 MB of ECC onboard cache
Memory
128 MB ECC SDIMM memory, expandable up to 4 GB
System architecture
Dual 256-bit wide memory data paths & cross-bar switch technology
providing 5.2 GB/s (peak) memory bandwidth; dual 64-bit PCI buses
providing 532 MB/s I/O throughput
Performance
11,616 tpmC @ $50.58/tpmC/18.01.99
6,065 Specweb96 (2 CPU), 4,092 (1 CPU)
27.7 Specint95, 58.7 Specfp95 (1 CPU); 76.1 Specfp SMP (2 CPU)
Internal expansion
6 slots: 5 PCI, 1 PCI/ISA
Storage controllers
Dual-channel Fast SCSI-2, FW SCSI-2, FWD SCSI-2, Ultra SCSI RAID, CI, DSSI
Network controllers
10/100 Ethernet, FDDI, Token Ring, asynchronous communications
Drive bays (removable)
Seven hot-pluggable StorageWorks drive bays for 4-GB, 9-GB or 18-GB disks
(maximum 128-GB). Three removable media bays: one 3.5" bay for diskette drive;
one 5.25" for CD-ROM; one for tape or hard disk
Power supply
Optional 675 watt redundant power supply
Interfaces
Two serial, one parallel, keyboard, mouse
High availability
Server management software, optional redundant power supply, auto reboot, thermal
management software, remote system management, RAID, hot-pluggable drives,
memory failover, ECC memory, ECC cache, SMP CPU failover, error logging,
optional uninterruptible power supply (UPS) & UPS Power Management Software
Operating systems
Tru64 UNIX 4.0E, OpenVMS 7.1-2, MS Windows NT 4.0 SP3, LINUX


Annexure-6:

International Workshop On Linguistic Data Consortium In Indian Languages
For Language Technology R & D

Central Institute of Indian Languages

in collaboration with
HP Labs India & IIIT, Hyderabad

LIST OF PARTICIPANTS

01. Prof. Rajeev Sangal (sangal@iiit.net)
02. Prof. Yegnanarayana (yegna@cs.iitm.ernet.in)
03. Prof. Mark Liberman (myl@unagi.cis.upenn.edu)
04. Dr. Hemant Darbari (darbari@cdac.ernet.in)
05. Dr.G. Uma Maheshwara Rao (guraosh@uohyd.ernet.in)
06. Dr. SriGanesh Madhvanath (srig@india.hp.com)
07. Dr.A.G. Ramakrishnan (agram@india.hp.com)
08. Dr. KSR Anjaneyulu (anji@india.hp.com)
09. Dr.Roger Tucker (roger_cf_tucker@yahoo.co.uk)
10. Sri Joel Pinto (joelp@india.hp.com)
11. Dr. Dipti Mishra Sharma (dipti@iiit.net)
12. Dr. Pushpak Bhattacharya (pb@cse.iitb.ac.in)
13. Dr. Hema Murthy (hema@lantana.tenet.res.in)
14. Ms. Kalika Bali (kalika@india.hp.com)
15. Dr.Ksenia Shalonova (kacniya-spb@hotmail.com)
16. Dr. Nandini Chatterjee Singh (nandini@nbrc.ac.in)
17. Sri Sai Jay Ram (sai@cdotb.ernet.in)
18. Dr.V. Ramasubramaniam (vram@ece.iisc.ernet.in)
19. Dr. Nitendra Rajput (rnitendra@in.ibm.com)
20. Sri GVD Prasad Rao (gdattu@india.hp.com)
21. Sri Deepu Vijayasenan (deepuv@india.hp.com)
22. Sri B.Ajay Sivaram (joyshiv@india.hp.com)
23. Sri C.S.Ramalingam (ramali@ee.iitm.ernet.in)
24. Ms. Aparna Subramanian (aparnasubramanian@indiatimes.com)
25. Dr. Ravi Shankar (ravi_ling@yahoo.com)
26. Dr.Ananthapadmanabha (anantha@blr.vsnl.net.in)
27. Sri Partha Pratim Talukdar (partha@india.hp.com)
28. Dr.K. Narayanamurthy (knmuh@yahoo.com)
29. Dr.R.N.V. Sitaram (sirn@india.hp.com)
30. Dr.C. Chandrasekhar (chandra@cs.iitm.ernet.in)
31. Dr. V Jawahar (jawahar@iiit.net)
32. Dr. Anvita Abbi (anvita@jnuniv.ernet.in)
33. Dr.S.Ramani (ramani@india.hp.com)
34. Prof. Etienne Barnard
35. Ms. Marelie Davel

PARTICIPANTS FROM MHRD

01. Ms. Kumud Bansal, Additional Secretary (kumudb@sb.nic.in)
02. Ms. Bela Banerjee, Jt. Secretary (L) (bela.edu@sb.nic.in)


PARTICIPANTS FROM CIIL

1. Prof. Udaya Narayana Singh (udaya@ciil.stpmy.soft.net)
2. Prof. J.C. Sharma (sharma@ciil.stpmy.soft.net)
3. Dr. Rajesh Sachdeva (rajesh@ciil.stpmy.soft.net)
4. Dr.K. Ramasamy (ramaswamy@ciil.stpmy.soft.net)
5. Dr.K.S.Rajyashree(rajya@ciil.stpmy.soft.net)
6. Dr.A.K. Basu (basu@ciil.stpmy.soft.net)
7. Dr.Pon Subbiah (subbiah@ciil.stpmy.soft.net)
8. Dr.Sam mohan Lal(mohan@ciil.stpmy.soft.net)
9. Ms.Rekha Sharma(rekha@ciil.stpmy.soft.net)
10. Dr.B. Mallikarjun (mallikarjun@ciil.stpmy.soft.net)
11. Dr.V. Saratchandra Nair (nair@ciil.stpmy.soft.net)
12. Dr.I.S. Borkar (borkar@ciil.stpmy.soft.net)
13. Dr.P.P. Giridhar (giridhar@ciil.stpmy.soft.net)
14. Sri N.H. Itagi (nhitagi@ciil.stpmy.soft.net)
15. Dr.B.A. Sharada (sharada@ciil stpmy.soft.net)
16. Sri G.Vijayasarathi(gvsarathi@ciil.stpmy.soft.net)


Annexure - 7:

Minutes of the Meeting on Linguistic Data Consortium held on 24.2.2004
at 11.30 AM in the Chamber of ES

A meeting on the proposal of Central Institute of Indian Languages, Mysore to set up a Linguistic Data Consortium for Indian Languages was held in the chamber of the Education Secretary on 24.2.2004 at 11.30 AM. The meeting was chaired by the Education Secretary and attended by the following members:

1) Additional Secretary, Dept. of Sec. & Higher Education, Ministry of HRD
2) Joint Secretary (Languages)
3) Director (Finance)
4) Director, Central Institute of Indian Languages, Mysore
5) Academic Secretary, Central Institute of Indian Languages, Mysore.

2. At the outset JS(L) welcomed the chairman and the members and asked Director, CIIL, Mysore to give a presentation on the proposal to set up the Linguistic Data Consortium. Director, CIIL gave a presentation explaining the meaning and background of Linguistic Data Consortium and need to set up such a Linguistic Data Consortium for Indian Languages (LDC-IL) on the model of Linguistic Data Consortium (LDC) hosted by the University of Pennsylvania, USA in 1992. Director, CIIL explained that CIIL, Mysore proposes to set up a Linguistic Data Consortium in collaboration with institutions working on Indian Language Technology like Indian Institute of Science, Bangalore, Indian Institute of Technology, Madras, Indian Institute of Technology, Bombay and International Institute of Information Technology, Hyderabad, etc.

3. Director, CIIL stated that CIIL has 45 million corpora of 15 Indian Languages and explained that he has made a revised proposal after consulting the institutions mentioned in para 1 above through, tele-conferencing and E-mail, Chats and exchanges.

4. He informed the members that LDC-IL has mass application tools and products, which include SMS in Hindi, and application of the Technology in Call Centres etc. The creation of consortium will necessitate a need to undertake a vast survey to collect the high quality data useful for the purpose. It was emphasized that resources must be shared between the Government and the Non-Government institutions/individuals in this regard. Major areas of Linguistic Data Consortium are :-

1) Speech Recognition and Synthesis
2) Character Recognition
3) Corpora Creation, in Indian Languages
4) Several byproducts like Lexicon, Thesauri etc.

5. The LDC-IL will be located in CIIL, Mysore. CIIL's prior experience in creating data base and their expertise in Linguistics makes it an ideal location to host and manage the LDC-IL. CIIL will require Human Resource and other facilities to run and develop the project. Director, CIIL stated that all engagements of the manpower will be on contract basis.

6. The budget of LDC-IL which covers Human Resources, tasks, events, equipments, maintenance and intellectual Property Rights (IPR) will be Rs.220.10 lakhs per year. Rs.1772.8 lakhs will be required for next eight years covering the present and the next plan period. Director, CIIL stated that they should be granted money in their Personal Ledger Account so that money generated out of the project can also be credited to this account and used as required for the project with the approval of Project Advisory Committee. JS(L) stated that project is to become self-sufficient in eight years times.

7. AS stated that the requirement of funds of CIIL should go down progressively as the project is to become self sufficient in eight years.

8. Director (Finance) remarked that the proposal looks promising but it needs to be specifically indicated how the project will become self-sufficient in managing LDC-IL over 8 years period. CIIL may indicate approximate inflow of funds during the project period.

9. Education Secretary stated that the names of the institutions, which are willing to collaborate in the project with CIIL, Mysore should be clearly indicated. He also desired that differential rates of utilization for various categories of users such as private institutions, University Departments and individual users should be clearly indicated and got approved. Education Secretary stressed upon the need to spell out adequate safeguards vis-a-vis Intellectual Property Rights of LDC-IL so that its contents/resources may not be hijacked by foreign collaborators.

10. JS(L) suggested that an IPR lawyer or expert may be engaged by CIIL while drafting the LDC-IL agreement. Director, CIIL stated that they can have one IPR specialist in the Governing body. AS observed that LDC-IL application will have big scope in the business and industry.

11. Director (Finance) enquired about the manpower requirement for the proposal. At this Director, CIIL explained that all the people may not be located in CIIL but in other institutions (collaborating with CIIL) also. He stated that a Project Advisory Committee will do the selections to recruit the staff on contract basis. It was suggested by JS(L) that a representative of Ministry of IT may be included in the PAC.

12. After detailed discussions it was decided that (a) CIIL will submit a detail proposal after incorporating the suggestions given in this meeting. (b) Based on the prepare, EFC should be prepared for the project.

The meeting ended with a vote of thanks to be chair from JS(L).


Annexure- 8:

Minutes of the Meeting to Consider the Project Proposal on
Linguistic Data Consortium for Indian Languages
held on 26.7.2004 at 3.00 PM in the Chamber of ES

A meeting to consider the Project Proposal of Central Institute of Indian Languages, Mysore to set up a Linguistic Data Consortium for Indian Languages (abbreviated as LDC-IL) was held in the chamber of the Education Secretary on 26.7.2004 at 3.00 PM. The meeting was chaired by Shri B.S.Baswan, Education Secretary and attended by the following members:

1) Shri Sudeep Banerjee, Additional Secretary, Dept. of Sec. & Higher Education, Ministry of HRD
2) Smt. Bela Banerjee, Joint Secretary (Languages)
3) Shri V.K. Pipersenia, Financial Advisor, Ministry of HRD
4) Shri Madhukar Sinha, Director (Languages)
5) Prof. Udaya Narayana Singh, Director, Central Institute of Indian Languages, Mysore
6) Dr. B. Mallikarjun, Academic Secretary, Central Institute of Indian Languages, Mysore.

The Director, Central Hindi Directorate, Delhi, was also present.

2. At the outset JS(L) welcomed the chairman and the members and asked Director, CIIL, Mysore to give a presentation on the revised proposal to set up the Linguistic Data Consortium. Director, CIIL gave a presentation explaining the need for setting up of Linguistic Data Consortium for Indian languages and the background of this proposal. He suggested that the proposed Linguistic Data Consortium for Indian Languages (LDC-IL) may be set up on the model of Linguistic Data Consortium (LDC) hosted by the University of Pennsylvania, USA in 1992, with initial and core funding from the Ministry of Human Resource Development, Government of India. He further pointed out that although the LDC at University of Pennsylvania had huge databases of some non-western languages like Chinese, Japanese, Korean and Arabic, there are, at present, no resources of comparable standards for Indian languages, whereas there are already demands for such data from different sectors, including software and telecom sectors. Director, CIIL explained that CIIL, Mysore proposes to set up the LDC-IL in project mode where it will work closely with some other major technical institutions working on Indian Language Technology such as Indian Institute of Science, Bangalore, Indian Institute of Technology, Madras, Indian Institute of Technology, Bombay and International Institute of Information Technology, Hyderabad, etc., - besides some software giants from the private sector who might be interested in this activity, such as HP Labs, IBM, Infosys, Wipro, etc. He also presented the communication from Shri R.K.Arora, Senior Director, Ministry of Communication & Information Technology (looking after the TDIL or Technology Development in Indian Languages programme) assuring both active participation and full support to this activity if undertaken by the Institute.

3. Director, CIIL stated that since CIIL already houses and distributes 45 million corpora of 15 Indian Languages and because it has already collaborated with the University of Lancaster to join with Emille resources of another 45 million words in 5 Indian languages (as spoken in the UK) to release an Emille-CIIL Corpora in Unicode format since January 2004, it is logical that this next major step be taken now to expand this activity in several directions to realize the goal of LDC-IL.

4. He informed the members that LDC-IL has mass application once databases, tools and products are generated as a result of this activity. This would include Voice Interface products in Hindi and other Indian languages such as telephone queries, mobile services, dictation software, speech driven systems, and application of the Technology for automated recognition systems such as OCRs, Handwriting recognition, etc. However, he pointed out that LDC-IL will require to undertake a systematic survey to collect the high quality linguistic data of different kinds from different domains and from different segments of population for it to be useful to software application tools. The major areas of focus of LDC-IL are :-

1) Speech Recognition and Synthesis
2) Character Recognition
3) Natural Language Processing (NLP)
4) Corpora Creation in Indian Languages, including Parallel Corpora, Spoken Corpus, etc.
5) Several by-products like Word finders, Lexicon, Thesauri, Spell-checkers, Grammar-checkers, Auto-summarization, Tree-banking Tools, Skeletal and Shallow Parsers, Statistical Probabilities Models, Idioms Dictionaries and Chunkers, etc.

5. The LDC-IL will be located in CIIL, Mysore. CIIL's prior experience in creating data base and their expertise in Linguistics makes it an ideal location to host and manage the LDC-IL. CIIL will require additional Human Resources and other facilities to run and develop the project. It may require a few Core Staff at the Professorial/Senior level or maintenance personnel. However, Director, CIIL pointed out that most of the manpower, i.e., the remaining 40-odd manpower requirement will be met with contractually. If and when the core operations of LDC-IL becomes self-supporting, all such contractual staff will continue to be engaged only out of the resources generated.

6. The budget of LDC-IL will be Rs.221.60 lakhs per year, Rs.1772.8 lakhs over a period of eight years. This will cover Contractual Human Resources, Project grants or Expenses towards tasks for participating national institutions and agencies, Expenses on Training programmes, Workshops, Seminars and other events, Specialized software and equipments to strengthen already existing infrastructure, Equipment maintenance expenses, and Payment of Royalties for data for which others have the Intellectual Property Rights (IPR).

7. Director, CIIL stated that the institute be granted money in its Personal Ledger Account and/or be allowed to maintain a special bank account for transactions of LDC-IL as well as to house its Corpus Funds, so that money generated out of the project could also be credited to this account and used as required for the project with the approval of Project Advisory Committee.

8. ES suggested that the enterprise should not be confined to only government-run institutions and that management of the project has to be evolved in a manner so that the balance between Public and Private players in the field is maintained.

9. AS pointed out that the proposal is worthy of support even if no private sector group is ready to invest in this activity. He commented that even in the USA, the private groups will think about investing in the activity only if they are demonstrated the range and richness of the databases and tools as well as services by whoever running the LDC-IL. Director (L) intervened to suggest that the project is worthy of support even as an intellectual exercise, without considering when exactly it would achieve self-sufficiency.

10. JS(L) commented that only after the project makes significant progress, the software and telecom industries would be interested in joining the LDC-IL group either as corporate members or on its board. However, the requirement of funds flow to the CIIL for this project would go down progressively as the project is to become self sufficient in eight years.

11. The Financial Advisor remarked that the proposal looks promising and considering its importance and demand, operationalizing the entire project should be taken up immediately in the project mode without going into creation of a Society for this purpose. However, the Project Management Structure needs to be considered carefully to include Government, Academic Institutions and Private Enterprises in managing the affairs of the LDC-IL.

12. After detailed discussions it was agreed that (a) the LDC-IL proposal of the CIIL is approved in principle. (b) It was also agreed that an SFC-note for establishment of LDC-IL will be prepared immediately for further processing. Towards that end, the Language Bureau will provide advice to the CIIL to prepare the document at an early date. (c) A two-page note on LDC-IL will be prepared for inclusion into the folder to be distributed at the time of CABE meeting on August 10-11, 2004. (d) A wide publicity may be given to this proposed activity of the MHRD to attract potential investors/members of LDC-IL.

The meeting ended with a vote of thanks to be chair from JS(L).


(MADHUKAR SINHA)
Director (L)