2. A Model: An ideal model of such a Consortium is
best exemplified by the success of Linguistic Data Consortium
(LDC) hosted by the University of Pennsylvania, USA.
It is an open consortium of universities, companies
and government research laboratories that creates, collects
and distributes speech and text databases, lexicons,
and other resources for research and development purposes.
This initiative, known as 'LDC' was established in 1992
with an initial US government grant to provide a new
mechanism for large-scale development and widespread
sharing of resources for research in linguistic technologies.
It now includes more than 100 companies, universities,
and government agencies as its active users and members.
The core consortium operations of the LDC are now fully
self-supporting after a period of ten years. The activities
include maintaining the data archives, producing and
distributing CD-ROMs, and arranging networked data distribution,
among other things. This has provided a great impetus
to R&D in the field of language technology for English
and other European languages. The LDC is hosted by the
University of Pennsylvania but has eventually grown
from a government-funded project into an independent
initiative. It is proposed to adopt a similar approach
in the Indian context, too.
3. Indian Languages and LDC-IL: In this context
the Central Institute of Indian Languages, Mysore and
other like-minded institutions working on Indian Languages
technology like Indian Institute of Science, Bangalore,
Indian Institute of Technology, Bombay, Indian Institute
of Technology, Madras, and the International Institute
of Information Technology, Hyderabad, etc., propose
to set up a Linguistic Data Consortium for Indian Languages
(LDC-IL) that will help the researchers and developers
worldwide in the field of corpus linguistics and language
technology related to Indian Languages. It will be known
as 'LDC for Indian Languages' (LDC-IL), and these institutions
will be known as the Lead Institutions in this initiative.
4.
Current status of Indian Languages Data: During
the 1990s, collection of linguistic data began to form
an Indian languages corpora under Technology Development
in Indian Languages (TDIL) project of the Ministry of
Communication and Information Technology of the Government
of India. This initiative resulted in creation of the
45-million word corpora in 15 Indian languages which
is housed in CIIL, and is distributed through this Institute
free of cost for non-commercial and research purposes
as per the understanding of the Ministry of CIT, Government
of India.. This has helped other applied linguistic
researches, such as in Language Teaching, Discourse
Analysis, and Language Acquisition. This also formed
the basis for technology development for providing pen
and OCR-based applications, voice-based services, and
information retrieval mechanisms in Indian languages.
In
the proposed initiative under the LDC-IL, all the Institutions
and agencies as well as industries interested in Indian
languages in Indian languages technology will be requested
contribute their collection to initiate the activities
of the LDC-IL. However, due to the complexity of human
languages any development in such areas requires a truly
large amount of data in the form of speech, text, dictionaries
etc., as a resource and be available to the needy as
shared resource. This enlargement of databases and enhancement
of high-quality corpora will be the responsibility of
the lead institutions.
Thus,
the proposed project will get all those interested in
data-based R&D in Indian language technology and
sharing resources and tools to be a part of a consortium
in the interest of the entire Language Technology community
in India.
5.
To sum up the proposed activities, LDC-IL will focus
on:
"
Becoming a repository of linguistic resources in all
Indian languages in the form of text, speech and lexical
corpora.
" Facilitating creation of such databases by different
organizations.
" Setting standards for data collection and storage
of corpora for different research and development activities.
" Supporting development and sharing of tools for
data collection and management.
" Facilitating training through workshops, seminars
etc. in technical as well as process related issues.
" Creating and maintaining the LDC-IL website that
would be the primary gateway for accessing LDC-IL resources.
" Designing or providing help in creation of appropriate
language technology for mass use.
" Providing the necessary linkages between academic
institutions, individual researchers and the masses.
.
Major areas of Linguistic Resource Development :
a. Speech Recognition and Synthesis
b. Character Recognition
c. Corpora Creation in Indian Languages
d. Several by-products like lexicon, thesauri etc.
The
details of coverage of major areas of activity under
LDC-IL are given in the Annexure 1 to 4
6.
Participating Institutions : All academic institutes,
research organizations and Corporate R&D groups
from India and abroad working on Indian languages will
be encouraged to participate in LDC-IL. Besides the
lead institutions already mentioned, other possible
interested institutes and organizations may include:
Different Indian Universities with major departments
of Linguistics and computer science/ ISI Calcutta; TIFR
Mumbai; HP Labs India; IBM; C-DOT; C-DAC; Tata InfoTech
All other IITs; KHS; NCPUL; Rashtriya Sanskrit Sansthan;
TDIL, MIT. Departments of Linguistics of various universities
in India, to begin with CALTS of the University of Hyderabad.
This list is open ended and suggestive and not exclusive.
7. Location : The services under the proposed
LDC-IL will be hosted and managed by the Central Institute
of Indian Languages, Mysore. This would leverage expertise
available in the institution and their prior experience
in creating databases. The host institute will require
additional space, human resources and other facilities
to run the full-time activity of the development and
maintenance of the LDC-IL. It is important to mention
that all engagements of the personnel for the LDC-IL
will be on contractual basis selected through appropriate
procedure following the norms fixed by the Government
of India for reservation etc. However, as a total R&D
activity under the LDC-IL, it is natural that a number
of national Institutions will be involved. They will
be invited to contribute to the areas listed under Annexures
1 through 4 where specific research areas are identified
and elaborated. All these activities and the funding
required to carry on such R & D work will be coordinated
by the CIIL. However, depending on the experience of
the lead institutions, the relevant sub-committees under
a project Advisory Committees (PAC) will help in arriving
at appropriate decisions.
8.
Funding - Grants : For the Core funding, the Department
of Secondary and Higher Education will create an appropriate
scheme through a substantial one-time grant during Xth
Plans, and possibly another set of grants during the
next plan to the host institution, i.e. CIIL will be
released. As the nodal agency, CIIL will further distribute
the relevant funding for specific sub-components of
the scheme to other academic institutions based on an
intent of understanding to be signed between Heads of
such institutions and CIIL. It is also understood that
this release of funds and its management will be in
a project mode through CIIL's PL account. CIIL may be
allowed to set up a separate account for only LDC-IL
activities. All receipts and payments will be to this
account for potential users of the services and the
linguistic data - creating a corpus fund and generating
revenues which would make the scheme progress towards
self-sustenance in a ten year's time frame. It is needless
to mention that an annual accounts as to utilizations
of funds as well as revenue generation will be submitted
by CIIL to the MHRD along with a progress report in
every quarter. Out of the revenue generated 40% will
go towards creation of corpus fund and the 60% will
be rolled back to towards implementing the tasks.
Since
TDIL-MCIT has promised full support to this activity
in its Advisory Committee meeting held in Indian Institute
of Science, Bangalore on 16th April 2004, in the presence
of its FA and since all linguistic data collected or
generated through TDIL projects would also be submitted
to the LDC-IL, through a mutual arrangement between
the two Ministry, a copy of the project report as well
as all periodic reports will also be sent to the TDIL-MCIT
by CIIL.
9. Annual Membership fee :
(a)
Open to academic institutes, Research Organizations,
and Corporate sector from all over the world.
(b) Members will be encouraged to contribute databases
and offered a monetary share of the revenues earned
by the databases they contribute
(c) The databases will be available for R&D purposes
to all members and non-members on payment of the appropriate
fee, with a license for use only. The organization will
be asked to sign a License Agreement to the effect that
the databases will only be used for research and development
and may not be distributed by it to any other institute
or individual either free or for a fee. However, the
IP and the copyright of any product developed as a result
of such an R&D activity shall lie with the organization
that has created the product.
Differential
rate of annual fee will be charged as given below:
India:
1.
Individual Researchers: Rs.2000/- per annum
2. Educational Institutions: Rs.20,000/- per annum
3. Software and related industry : Rs.2,00,000/- per
annum
Other
countries
1.
Individual Researchers: $ 2000/- per annum
2. Educational Institutions: $ 20,000/- per annum
3. Software and related industry : $ 50,000/- per annum
10. Governance of the LDC-IL :
The
LDC-IL will have a Project Advisory Committee consisting
of members from each of the participating lead Institutions.
The
permanent members will include Directors or nominees
of the lead institutions already identified. The PAC
may be expanded later in the event of the other major
Institutions joining in this activity and some representatives
from private sector IT initiatives may also be included
in the PAC at a later date. The total number of PAC
members shall not ordinarily exceed 21, including the
ex-officio members.
1.
It is to be understood that even if institutions from
abroad join this Consortium the administration/governance
of it will remain with Indian members only.
2.
Two officials of the language Bureau nominated by the
Ministry of HRD will be a members of the PAC. At least
one of them will be from the finance division.
3.
One official representative from the Ministry of Information
and Communication Technology, Government of India.
4.
One Expert in IPR matters normally drawn from Institutions
like National Law School University, Bangalore etc.
5.
The Director, Central Institute of Indian Languages
will be the Head of the LDC-IL. He will be assisted
by a Project Director nominated/appointed for the purpose.
6.
The Project Director will be the ex-officio convener
of the advisory committee.
11.
About this Proposal : This proposal on LDC-IL has
evolved through participation of scholars from different
Institutions in India and abroad. Summary of the same
is as follows:
(a)
On August 13, 2003 ) Prof. Udaya Narayana Singh, Director
, Central Institute of Indian Languages made a presentation
about the concept of such a consortium and need for
creating the same by the Ministry of HRD.
(b) On August 17 and 18, 2003 an International Workshop
on Linguistic Data Consortium for Indian Languages was
held at the Institute. This was inaugurated by Smt.
Kumud Bansal, Additional Secretary, and Smt. Bela Banerjee
Joint Secretary (L) from the Ministry also participated.
This was conducted in collaboration with IIIT Hyderabad
and HPLabs, India, attended by participants from LDC,
University of Pennsylvania and most of the Institutes
involved in computational linguistics in India.
(c) On August 19, 2003 a follow up meeting was held
in the Indian Institute of Science, Bangalore where
in a select group of experts from Institutions involved
in language technology applications including HPLabs,
MAIT participated and formed a group of experts from
IITB, IITM, IISc, IIIT Hyderabad with the Director CIIL
as Coordinator to formulate a formal proposal to be
submitted to the Ministry.
(d) Through teleconference and email correspondences
a proposal was formulated and sent to the Ministry vide
letter number No.F.2-1/2003/LDC-IL November 18, 2003.
(e) On December 19, 2003 a meeting of representatives
of lead Institutes was held at the Institute to discuss
the proposal submitted to the Ministry.
(f) With additional inputs received from them, the present
proposal is formulated and presented on February 24,
2004 in the MHRD.
(g) Further modified to include suggestions after the
presentation by the Director CIIL before the Secretary,
Additional Secretary, Joint Secretary(L), Director(Finance)
in the Department of Education, Ministry of HRD on February
24, 2004.
(h) On April 16, 2004, the present
version of the proposal was presented by the Director,
CIIL before the Advisory Committee of the TDIL-MCIT
(Technology Development in Indian Languages, Ministry
of Communication & Information Technology). The
Financial Advisory of the MCIT was also present during
the presentation, and the suggestion of the members
are included.
(i) Preparation of EFC should start after submission
of the final version of the proposal.
(j) Within six months the project should start functioning.
12. Budget
Broad
Outline :Rs. 221.60 lakhs per year. Total: Rupees 1772.8
lakhs (variable, please see note below) for the next
eight years covering the present and next plan period.
Break up of figures for one year
will be as follows :
Sl.No.
|
Head
|
Amount
|
1.
|
Human
Resources
|
78,00,000
|
(a)
|
Project
Director (1) Rs. 30,000 (variable)30,000x12 m
|
4,20,000
|
(b)
|
Scientist
A (3) 29,000 x 3 persons x 12 man-months
|
10,80,000
|
(c)
|
Scientist
B (4) 21,000 x 8 x 12 m
|
21,12,000
|
(d)
|
Scientist
C (5) 14,000 x 6 x 12 m
|
11,52,000
|
(e)
|
Scientist
D (8) 11,000 x 8 x 12 m
|
12,48,000
|
(f)
|
Project
technicians (Rs.5,000 x 20 x 12 m)
|
14,40,000
|
(g)
|
Maint
Personnel - Accounts (Rs.11,000 x 1 12m)
|
1,56,000
|
(h)
|
Maint
Personnel - Sales & Promo (Rs.7,000 x 1 x
12m)
|
96,000
|
(i)
|
Maint
Personnel - General (Rs.7,000 x 1 x 12m)
|
96,000
|
|
Tasks
|
56,60,000
|
2.
|
Tasks
at various Participating Institutions (as in Annex)
|
56,60,000
|
|
Events
|
50,00,000
|
3.
|
Academic
Meetings in diff. Instt x 2
|
2,00,000
|
4.
|
LDC-IL
PAC meetings at CIIL x 2
|
2,00,000
|
5.
|
Seminars
& Events in diff Instt - 7 every year in participating
Instt
|
15,00,000
|
(a)
|
Seminars
(National) in diff Instt x 2
|
4,00,000
|
(b)
|
Seminars
(Regional) in diff Instt x 4
|
6,00,000
|
(c)
|
Seminars
(Int'l) rotating in diff participating Instt x
1 per year
|
5,00,000
|
6.
|
(Prod)
Workshops for production (6)
|
6,00,000
|
7.
|
Training
Programmes x 4 per year
|
2,00,000
|
8.
|
Travel
& Incidentals
|
8,00,000
|
|
Equipments
& Maintenance
|
27,00,000
|
9.
|
Hardware
|
20,
00,000
|
10.
|
Software/Tools
|
4,00,000
|
11.
|
Equipment
maintenance
|
Variable
(From OE-Non-Pl)
|
12.
|
Maintenance
of LDC-IL
|
3,00,000
|
|
IPR
|
10,00,000
|
13.
|
IPR/Copyright
payments (variable)
|
5,00,000
|
14.
|
Publications,
incl E-pub (10 a year)
|
5,00,000
|
|
TOTAL
|
Rs.
2,21,60,000
|
NOTE:
1.
This is a broad indication of proposed expenditure under
different heads. The Director CIIL on the advise of
the Project Advisory Committee of the LDC-IL may be
authorized to make re-appropriation of funds from among
the heads indicated above, without exceeding the over
all sanctioned budget.
2. The first three years of the project period are incubation
years, wherein no heavy flow of income is anticipated
from the project.
3. From the third year onwards, flow of income is expected.
It is estimated that in the initial
period, the annual revenue may be 8% to 10% of the annual
investment as projected in the budget proposal given
above. If this income is placed in a corpus fund to
be used for LDC-IL, in the initial period, there will
be a turn over of Rs.17.73 lakhs to Rs.22.16 lakhs per
year. From the sixth year of the project, the revenue
is expected to be around 25% to 35% of the amount invested,
i.e. Rs.55.4 lakhs to Rs.66.48 lakhs annually. Once
these are contributed to create a corpus, it is likely
that at the end of eight years, the corpus funds will
have a figure between Rs. 201.66 lakhs to Rs. 243.76
lakhs plus interests (whatsoever). It is proposed that
beyond eight years, the Ministry may consider giving
only grants for events (Rs.50 lakhs), tasks of software
development (Rs.64.76 lakhs), and maintenance of equipments
(Rs.15.24 lakhs), i.e. Rs.130 lakhs a years. The services
of the personnel and the IPR costs will be paid from
the interests of the corpus funds (Rs.14.63 lakhs) plus
further income, i.e. 66.48 lakhs, i.e. Rs.81.11 lakhs
generated annually. Added to the maintenance cost of
Rs.130 lakhs, the total comes to Rs.211.11 lakhs or
so.
4. The project is anticipated to be cent percent self
sufficient/financing by the end of Eleventh Plan period
or by the end of eight years of its commencement.
5. In case the people in service in the Government or
Autonomous Institutions in substantial capacity are
selected their service and salary will be protected.
Annexure-1: Speech Recognition
and Synthesis
1. Introduction :
The
objective of data collection effort is to primarily
build speech recognition and synthesis systems for Indian
languages. Although there are such ASR and TTS systems
available around the world for a number of mainstream
languages, commercially viable speech systems for Indian
Languages are not available
Voice
User Interfaces for IT applications and services have
become more and more prevalent for languages like English,
and are valued for their ease of access, especially
in telephony-based applications. In a country like India,
where the majority of the population is not comfortable
using English and given the relatively lower rates of
literacy, local language speech interfaces can provide
access to IT applications and services, through internet
and/or telephones, to the masses. If such technology
is available in Indian languages, people in various
semi-urban and rural parts of India will be able to
use telephones and Internet to access a wide range of
services and information on health, agriculture, travel,
etc. However, for this to become a reality, a computer
has to be able to accept speech input in the user's
language and provide speech output. Also, in multilingual
India, if speech technology is coupled with translation
systems between the various Indian languages, services
and information can be provided across languages more
easily.
Although
speech technology has been the focus of research in
India for a number of years and the technology itself
has matured for real-world applications, the main obstacle
in customizing this technology for various Indian languages
is the lack of appropriate annotated speech databases
in these languages. The focus here is (i) to collect
data that can be used for building speech enabled systems
in Indian languages and (ii) to develop tools that facilitate
collection of high quality speech data.
2.
Background - Speech Recognition :
The
task of automatic Speech recognition is the task of
converting any speech signal into its orthographic representation.
There are two different categories of speech recognition
systems:
oIsolated
word recognition and connected word systems as in command
and control applications.
oContinuous speech recognition systems. In continuous
speech recognition there are two different categories;
read speech and spontaneous speech.
3.
Background- Speech Synthesis :
The
task of speech synthesis to convert written text (orthographic
representation to speech). The vocabulary should not
be restricted for speech synthesis and synthesized speech
must be close to natural speech. To enable unrestricted
speech synthesis, the sentence is normally converted
to a sequence of basic units. Then appropriate rules
of synthesis are employed to produce speech sounding
natural.
The
focus is primarily on building (a) vocabulary independent
speech to speech translation (for a pair of Indian languages)
and (b) vocabulary dependent isolated word recognition
in the Indian languages.
4. Long Term Goal:
The
grand vision of this project is to collect data to provide
speech-to-speech translation from each and every language
to each and every other language spoken in India (including
Indian English). Such a system would include unlimited
vocabulary speech synthesis and recognition systems
for every Indian language coupled with machine translation
systems between those languages. The block diagram given
below describes the basic architecture of such a system.
Speech
input
in language A Recognized Text in Language A
5. Short Term Goal:
To
create databases for building (a) bi-directional speech
to speech translation system of read speech for a pair
of Indian languages, namely, Hindi-Telugu, (b) a speech
recognition system for Indian English. Further, it is
desired to collect large vocabulary isolated data for
the 22 Scheduled Indian languages.
Text in Language A
Translated Text in Language B
Speech Output
in Language B
6.
Methodology for Short Term Effort:
Methodologies
for data collection and development of tools required
for the short-term and long-term goals are given below:
Data
collection Effort for Automatic Speech Recognition (ASR)
The
data collection effort will involve collection of read
and spontaneous speech.
Data
required:
Read speech corpora for two Indian languages and Indian
English.
Channels:
1. Close talking microphone, on a desktop or laptop.
2. Telephone, both landline and mobile .
Annotation:
The data will be annotated at phoneme, syllable, word
and sentence levels.
Data
Collection for Isolated Speech Recognition
Channels:
1. Close talking microphone, on a desktop or laptop
2. Telephone, both landline and mobile
Demography:
10,000 words from 300 speakers (150 male, 150 female)
Data
Collection for Text to Speech Synthesis
Data
Required:
Data will be collected in the form of read-out phonetically
balanced text which will ensure coverage of all speech
sounds of the language concerned in different prosodic
and phonological contexts. The phonetically balanced
text will be extracted from a huge text corpus.
Channels:
Speech Synthesis requires high quality recording in
an anechoic chamber using high quality microphones and
recording equipment.
Demography:
6 speakers: 3 males and 3 females per language.
Annotation:
Data to be annotated at phone, phoneme, syllable, word,
and phrase level.
Tools
Required for Data Collection and Annotation for ASR
and TTS:
Standardization of tools for data collection and annotation
is required. Number of different tools are available
for annotation of speech data like EMULAB and PRAAT.
Convergence on representation, annotation and storage
format is required. However, the project will also focus
on providing converters across different formats.
Tools
for Speech Recognition:
The data will be annotated at phoneme, syllable, word
and sentence levels. Tools need to be developed for
semiautomatic annotation of speech data. These tools
will also be useful for annotating speech synthesis
databases. One could adopt LDC's recording of data in
the NIST format. This format is comprehensive in that
it contains ALL the information about the recording
environment, speaker information, sampling rate, number
of channels, number of bits/sample, etc.
Tools
For Speech Synthesis:
Other than tools for annotating speech databases, text
annotation is also required for speech synthesis:
Text Annotation Tools Required:
1. PoS taggers, phrase boundary markers and intonation
markers.
2. Identification and standardization of feature vectors
for speech synthesis , vocalic/non- vocalic, pause break,
distance from a pause break, duration of break, characteristics
of the intonation across the entire sound unit.
7.
Applications:
oSpeech
to Speech translation for a pair of Indian languages,
namely, Hindi and Telugu.
oCommand and control applications.
oMultimodal interfaces to the computer in Indian languages.
o E-mail readers over the telephone,
o Readers for the visually disadvantaged.
o Speech enabled Office Suite.
The
effort for both Speech Recognition and Speech Synthesis
will be repeated across all 22 Scheduled languages.
For Speech Recognition, spontaneous speech data will
be collected along with read speech. For speech synthesis,
data will be collected from professional speakers, with
very good voice quality. Additional speech data will
be collected to come out with models for prosody (intonation,
duration, etc.) to improve the naturalness of synthesized
speech. A database (lexicon) of proper names (of Indian
origin) will be created, with the equivalent phonetic
representation for each of the names.
Annexure - 2:
Character Recognition
1.
Introduction:
Character
Recognition refers to the conversion of printed or handwritten
characters to a machine-interpretable form, or in other
terms, the "reading" of text. The term has
been used to address three very distinct language technologies
with different applications.
"Online"
handwriting recognition or Online HWR refers
to the interpretation of handwriting captured dynamically
using a handheld or tablet device. It allows the creation
of more natural handwriting-based alternatives to keyboards
for data entry in Indian scripts, and also for imparting
of handwriting skills using computers.
"Offline"
handwriting recognition or Offline HWR refers
to the interpretation of handwriting captured statically
as an image. It can be used for the interpretation of
handwriting already recorded on paper, ranging from
filled-in forms to handwritten manuscripts.
Optical
character recognition or OCR refers to the interpretation
of printed text captured as an image. It can be used
for conversion of printed or typewritten material such
as books and documents into electronic form.
These
different areas of language technology require different
algorithms and linguistic resources. However for convenience,
they have been combined under the "character recognition"
umbrella. They are all hard research problems because
of the variety of writing styles and fonts encountered.
Of these, OCR has seen some research in a few Indian
scripts because of support from the TDIL program. However
the technology is not yet mature and there is only one
commercial offering. Also, there are no common linguistic
resources that can be used by the community. The other
areas of Online and Offline HWR have seen very little
research overall in the context of Indian scripts and
no linguistic resources exist.
2. Objectives:
Long-term
objectives
(i)
Development of standards, tools and linguistic resources
(datasets) for the fields of Online HWR, Offline HWR
and OCR.
(ii) Promotion of development of these technologies.
(iii) Promotion of development of important and challenging
applications of these technologies in the context of
Indic languages and scripts.
This
will be achieved in variety of ways:
"
Standards development will primarily be via a
mixture of email discussions and face-to-face meetings
of working group members organized under the aegis of
LDC-IL.
"
Tool development will be given as projects to
technology institutions with the necessary inclination,
skills and resources.
"
Linguistic data collection, annotation and validation
will be given as projects to linguistics/computational
linguistics departments of Institutes and universities
with the necessary inclination, skills and resources.
However for each linguistic resource developed, validation
will be performed by a different institution than the
one doing the collection and annotation. Use of the
linguistic resources for technology development will
be promoted by arranging periodic competitions (for
example, for recognition of online handwritten words
in specific scripts) and by objective evaluation of
performance.
3.
Implementation Phases:
Specifically
linguistic data collection will be done two phases.
Phase
I (year 1-3)
"
Development of standards
Standards
are key to the creation of shared linguistic resources.
The LDC-IL will adopt established processes for proposing
and advancing standards, working with international
standards bodies wherever applicable. Standards will
be proposed for datasets of offline handwriting, offline
handwriting and documents, and for printed characters.
"
Development of tools for data collection
The
availability of good tools will allow researchers to
start collecting data in different Indian scripts, and
contribute data to LDC-IL. They are a must in order
to extend support to all Indian scripts quickly. The
design and development of tools for data collection
and dataset creation in all three target technologies
will be done.
"
Promotion of technology development for specific tasks
in selected scripts
The
LDC-IL will promote the development and implementation
of technology for Online HWR, Offline HWR and OCR in
the context of specific tasks and selected scripts.
The
tasks could be
(i) to interpret a line of handwriting captured using
a handheld computer
(ii) to interpret a form that has been filled in and
scanned
(iii) to interpret a page from a book
Though
all major Indian languages are objects of research to
begin with Devanagari, Tamil and Telugu will be addressed
to. These offer considerable variety in terms of visual
complexity (and hence the challenge for recognition).
Other scripts will be taken up in due course of time.
" Development of linguistic resources in selected
scripts
The
working group will drive the creation of significant
linguistic resources for the tasks and scripts outlined
above.
Some
examples of linguistic resources are:
Online handwritten word samples from at least 500 writers
in each script
Samples of handwritten characters extracted from forms
representing at least 500 writers, and at least 500
samples of each handwritten character in each script
Synthetic data covering all printed characters and at
least 1000 pages in each script
Phase
II (year 4-8)
"
Refinement of standards:
Since
standardization requires consensus among creators and
users of linguistic resources, it is expected that the
process of standardization would continue as an activity
beyond the first three years.
"
Refinement of tools:
The
tools created in the first phase will be continuously
refined during this second phase, as more and more researchers
start to use them and provide feedback and suggestions
for improvement.
"
Extension of technology tasks and linguistic resources
to remaining scripts:
The
technologies developed for the initial set of scripts
will be adapted for other scripts during the second
phase. As in the first phase, technology development
will be supported by the creation of linguistic resources
to support the technology development in other scripts,
subject to budget constraints and interest from researchers
working on those scripts.
"
Promotion of significant applications:
A
major activity during the second phase will be the promotion
of significant applications with high potential impact
on society. These will typically involve solving of
challenging problems, multiple years of concerted effort,
and close interaction between participating institutions
and other researchers in India and abroad.
It
is envisaged that these applications will be developed
for selected languages and scripts such as Hindi, and
the same will be extended to other languages and scripts
with participation from researchers from all over India
in due course of time.
4.
Applications:
The
list is meant to be indicative rather than exhaustive.
Handwriting
Interface to Computers
Indian
scripts are complex and not suitable for keyboard-based
entry. Replacing the keyboard with a simpler and more
natural interface based on handwriting would make computers
much more accessible to the common man and to educators
in particular.
Imagine
that the keyboard is replaced with a special writing
pad for handwriting input. As one writes, the writing
is converted using HWR technology into words and entered
into the target application. The solution would also
need to support numerals, punctuation, and editing gestures,
and functionally replace the keyboard.
Language
technologies used: Online HWR for Indian Languages
Handwriting
Tutor
The
solution described above can also be adapted to provide
computer-based instruction in handwriting to improve
writing skills of school children, improve literacy
as part of adult education programs, or allow literate
adults to learn new scripts.
Language
technologies used: Online HWR for Indian Languages
Multilingual
Digital Libraries for Education
A
wealth of literature and other education material in
Indian languages is trapped in books, which require
storage and are subject to physical decay. Online books
on the other hand have no such problems, and may be
made available to students all over in their schools,
homes or hostels over the Internet.
The proposed solution will use a complete OCR pipeline
for converting scanned images of book pages into electronic
form, which will then be used to create a multilingual
digital library. The library can then be searched using
the local language, using either spoken (using Speech
Recognition) or written (using Online HWR) queries.
Results can be viewed on screen, and also read out using
Text-to-Speech conversion. In addition, an annotation
system will allow students to make private annotations
on the book. This solution can be used by individual
libraries or to create district, state or national level
online educational resources.
Language
technologies used: OCR, Online HWR, Speech Recognition,
Text-to-Speech for Indian Languages
Automatic Forms Processing/Educational Testing
With
millions of application forms filled in every year in
Indian languages especially in the education sector,
a solution for automatically reading handwritten entries
from scanned images of forms is clearly very valuable.
As a result of a growing school-going population, manual
evaluation of answer papers has become very difficult.
By using Offline HWR technology, there is the possibility
of automatically reading and evaluating responses (for
at least the fill-in-the-blanks style of questions where
there is one (or a few) correct answers).
The
proposed solution is a complete forms-processing system
that can be used to read handwriting from a scanned
image of a paper form. The interpreted results can be
stored into a database (for applications) or compared
with correct responses (for educational testing).
Language
technologies used: Offline HWR for Indian Languages
Annexure
- 3 :
Natural
Language Processing
1. Introduction:
Some
of the important language data resources required in
Indian languages for various NLP applications are given
below:
2. Electronic dictionaries:
Electronic
dictionaries are a primary requisite for developing
any software in NLP.
ED
1 Monolingual/bilingual dictionaries
25,000 words per year (per language)
ED
2. Transfer Lexicon and Grammar(TransLexGram) (per language)
Transfer
Lexicon and Grammar above involves developing a language
resource which would contain
o English Headwords
o Their grammatical category
o Their various senses in Hindi
o Corresponding sense in the other Indian language
o An example sentence in English for each sense of a
word
o Corresponding translation in the concerned Indian
language
o In case of verbs, parallel verb-frames from English
to Indian language.
As
is obvious from the above, TransLexGram will be a rich
lexicon which will not only contain the word level information
but also the crucial information of verb-argument structure
and the vibhaktis which various languages use with specific
senses of a verb.
The
resource, once created will be a parallel resource not
only between English and Indian languages but also across
all Indian languages.
If
the bilingual TransLexGram are created as aligned resources,
there would be several advantages which will accrue.
It will also reduce the work to be done for each individual
resource.
Annexure - 4:
Corpora Creation in Indian Languages
1.
Introduction:
The
Central Institute of Indian Languages has a corpora
of around 3.5 million words in many major Indian languages.
The same will be enlarged to the extent of 25 million
words in each language. Also, the existing corpora is
raw corpora and it will be cleaned for use. Apart from
22 major Indian languages there are hundreds of minor
and tribal languages that deserve attention from the
researchers for their analysis and interpretation. Creation
of corpora in these languages will help in comparing
and contrasting structure and functioning of Indian
languages. So, at least 100 minor languages corpora
will be collected to a tune of around 3 to 5 million
words in each language depending upon availability of
text for the purpose.
2.
Domain Specific Corpora:
Apart
from these basic text corpora creation an attempt will
be made to create domain specific corpora in the following
areas :
a.
Newspaper corpora
b. Child language corpus
c. Pathological speech/language data
d. Speech error Data
e. Historical/Inscriptional databases of Indian languages
which is one of the most important to trace not only
as the living documents of Indian History but also historical
linguistics of Indian languages.
f. Grammars of comparative/descriptive/reference are
needed to be considered as corpus of databases.
g. Morphological Analyzers and morphological generators.
3. POS tagged corpora:
Part-of-speech
(or POS) tagged corpora are collections of texts in
which part of speech category for each word is marked.
POS
tagged corpora will be developed in a bootstrapping
manner. As a first step, manual tagging will be done
on some amount of text. A POS tagger which uses learning
techniques will be used to learn from the tagged data.
After the training, the tool will automatically tag
another set of the raw corpus. Automatically tagged
corpus will then be manually validated which will be
used as additional training data for enhancing the performance
of the tool. This process will be repeated till the
accuracy of the tool reaches a satisfactory level. With
this approach, the initial man hours per 10,000 words
will be more. Thereafter, the tagging process will speed
up.
4.
Chunked corpora:
The
chunked corpora will also be prepared in a manner similar
to the POS tagging. Here also the initial training set
will be a complete manual effort. Thereafter, it will
be a man-machine effort. That is why, the target in
the first year is less and double in the successive
years. Chunked corpora is a useful resource for various
applications.
5.
Semantically tagged corpora:
The
real challenge in any NLP and text information processing
application is the task of disambiguating senses. In
spite of long years of R & D in this area, fully
automatic WSD with 100% accuracy has remained an elusive
goal. One of the reasons for this shortcoming is understood
to be the lack of appropriate and adequate lexical resources
and tools. One such resource is the "semantically
tagged corpora".
In
semantically tagged corpora , words in the text documents
will be marked with their correct senses. For example,
in
"Can a can can soup"
apart from POS tagging, it is also necessary to tag
the text as
"Can a can <included-in-set: container> can
<included-in-set:hold-action>" soup"
which is a example of semantic tagging.
The
question that arises is "What should be the set
of such tags and where should these come from?".
A natural answer to this is obtained when we look at
the "WordNet". WordNet is a semantic structure
where "relational semantics" is exploited
to encode the senses of words. The basic machinery for
sense representation is the accumulation of synonyms
into 'synsets' and also enumerating the semantic relations
like 'hypernyms', 'meronyms' etc. For example, the 'included-in-set'
tag above is the hypernmy
(super ordinate) relation which disambiguates the sense.
Following
are the steps towards creating semantically tagged corpora:
1.
Develop, refine and make widely available Indian language
WordNets. (IITB is developing Hindi and Marathi WordNets;
AU-KBC and Tanjavur university are working on Tamil
WordNets. Similarly other language WordNets are being
created at a other places.)
2.
Link the WordNets into the "Indo-WordNet"-
a massive semantic structure of Indian language WordNets.
3. Link the Indo WordNet to English and Euro-WordNets.
4.
Create large amounts of sense tagged corpora manually
for the purpose of training a 'sense tagging machine'.
The tags are the INDO-WORDNET SYNSET NUMBERS.
5.
Devise algorithms for the training task. Hidden Markov
Model, Entropy maximization etc. are the possible candidates.
6.
For the purpose of semi-automatic semantic tagging,
invest on user friendly and intelligent user interfaces.
The
semantically tagged corpora is a valuable resource which
will be constructed using the Indian language WordNets
and then employing machine learning algorithms (as in
the case of POS taggers discussed above).
6.
Syntactic tree bank:
Preparation
of this resource requires higher level of linguistic
expertise and needs more human effort. For preparing
this corpora experts will manually tag the data for
syntactic parsing. A tool can then automatically extract
various tree structures for the tree bank. Since it
requires more manual effort and also a higher degree
of linguistic expertise, building of this resource will
be a relatively slower process. The initial take-off
time will also be more in this case.
Since,
a crucial point related to this task is to arrive at
a consensus regarding the tags, degree of fineness in
analysis and the methodology to be followed. This calls
for some discussions amongst the scholars from varying
fields such as linguistics and computer science . It
will be achieved through conduct of workshops and meetings.
First some Sanskrit scholars, linguists and computer
scientists will review the existing tagging scheme developed
for Indian languages by IIIT, Hyderabad and define standards
for all Indian languages (extendable to any language).
On this basis some experiments will be carried out on
the selected Indian languages to test the applicability
and quality of the defined standards. After testing
these actual tagging task will start.
7.
Parallel aligned corpora:
A
text available in multiple languages through translation
constitutes parallel corpora. The National Book Trust,
Sahitya Akademi are some of the official agencies who
develop parallel texts in different languages through
translation. Such Institutions have given permission
to the Central Institute of Indian Languages to use
their works for creation of electronic versions of the
same as parallel corpora. The magazines, news paper
houses who bring out translated versions of their output
are another source to provide texts for parallel corpora.
First wherever necessary the text have to be keyed in
and then computer programmes have to be written for
creating
[I] Aligned texts; [II] Aligned sentences; and [III]
Aligned chunks.
8. Tools:
1.
Tools for Transfer Lexicon Grammar (including creation
of interface for building Transfer Lexicon Grammar).
2. Spellchecker and corrector tools
3. Tools for POS tagging. (Trainable tagging tool with
an Interface for editing POS tagged corpora).
4. Tools for chunking (Rule-based language-independent
chunkers).
5. Interface for chunking (Building an interface for
editing and validating the chunked corpora).
6. Tools for syntactic tree bank, incl. interface for
developing syntactic tree bank.
7 Tools for semantic tagging with basic resources are
the Indian language WordNets showing a browser that
has two windows - one showing the senses (i.e., synsets)
from the WordNet appear in the other window, after which
a manual selection of the sense can be done.
8. (Semi) automatic tagger based on statistical NLP
(the preliminary version of
which is ready in IITB).
9. Tools for text alignment, including Text alignment
tool, Sentence alignment tool and Chunk alignment tool
as well as an interface for aligning corpora .
Annexure-5
Annexure-5:
Notes on Items
Item
1 Server & Proxy Server (Specs):
Example:
IBM Mainframe Servers; zSeries 990
Price: $63,000 (approx)
Physical
Configuration
Weight (unpacked) - 2007 kg
Footprint - 2.49 Sq. meters
Service Clearance - 5.45 Sq. meters
Input Power - 21.39 kVA
Heat Output - 72.73 KBTU/hr
Air Flow - CFM 3250, m3/m
Height - 194.1 cm (76.4 inches)
Hardware Model - D32 upwards
Coupling Links - Max # Links 64¹
Channels - 512/120/48/16 (ESCON/FICON Express/OSA-
Express/HiperSockets)
Cryptographic - PCI Crypto Accelerator - up to
12 optional (up to 6 cards)
Processor Memory - 256 GB
Software
- z/OS 1.2 and subsequent releases
z/VM 3.1, z/VM 4.2 and subsequent releases
Red Hat, SuSE, Turbolinux
OS/390 2.10
VSE/ESATM 2.5 and subsequent releases
TPF 4.1 (ESA mode only)
Red Hat, SuSE, Turbolinux
Item 2: Large Serial Disk System or equivalent
Large
storage systems like IBM 7133 Serial Disk System advanced
models D40 and T40 provide highly available storage
for UNIX, Windows NT, and Novell NetWare servers. By
implementing a powerful industry-standard serial technology,
the 7133 advanced models D40 and T40 provide outstanding
performance, availability, and attachability.
Product highlights
"
Should provide outstanding disk storage performance
with advanced SSA bandwidth of 160 MB/sec
" High availability with redundant data paths,
redundant cooling units, and additional (at least, two)
power supplies;
" Should be an enterprise-strength storage for
distributed systems
" High availability to safeguard data access
" Ultra high speed 15,000 rpm hard disk drives,
now available in higher capacities (capacities of 36.4
GB or 72.8 GB, at least)
" Simplified storage management;
" Shared storage for all major types of servers
plus scalability for fast-growing environments
" Facilitates of remote mirroring-up to 10 km connection
distances-with the Advanced SSA Optical Extender
Item
3: High-end scanners
Possible
models:
Creo
www.creo.com
EverSmart Pro II
8000-pixel trilinear CCD
11.8x17
11.8x17
3175x8200
$29,950
EverSmart
Select
8000-pixel trilinear CCD
11.8x17
11.8x17
5600x11,400
$34,950
EverSmart
Supreme
8000-pixel trilinear CCD
11.8x17
11.8x17
5600x14,000
$44,950
Fujifilm
Fujifilm
Superlinear CCD
18.5x13.8
18.5x13.8
13,500
$25,000
Screen
USA
Cezanne Elite
8000-pixel trilinear CCD
13x20.9
13x20.9
5300
$34,000
Item
4: High Speed LAN printers
Standard
Item
5: Satellite PCs attached to servers or equivalent
Like
Compaq Alphaserver DS20 System
Specifications:
$18,000 each
Processor/cache
Up to two 500-MHz Alpha 21264; each with 64-KB I-cache,
64-KB D-cache on
chip & 4 MB of ECC onboard cache
Memory
128 MB ECC SDIMM memory, expandable up to 4 GB
System architecture
Dual 256-bit wide memory data paths & cross-bar
switch technology
providing 5.2 GB/s (peak) memory bandwidth; dual 64-bit
PCI buses
providing 532 MB/s I/O throughput
Performance
11,616 tpmC @ $50.58/tpmC/18.01.99
6,065 Specweb96 (2 CPU), 4,092 (1 CPU)
27.7 Specint95, 58.7 Specfp95 (1 CPU); 76.1 Specfp SMP
(2 CPU)
Internal expansion
6 slots: 5 PCI, 1 PCI/ISA
Storage controllers
Dual-channel Fast SCSI-2, FW SCSI-2, FWD SCSI-2, Ultra
SCSI RAID, CI, DSSI
Network controllers
10/100 Ethernet, FDDI, Token Ring, asynchronous communications
Drive bays (removable)
Seven hot-pluggable StorageWorks drive bays for 4-GB,
9-GB or 18-GB disks
(maximum 128-GB). Three removable media bays: one 3.5"
bay for diskette drive;
one 5.25" for CD-ROM; one for tape or hard disk
Power supply
Optional 675 watt redundant power supply
Interfaces
Two serial, one parallel, keyboard, mouse
High availability
Server management software, optional redundant power
supply, auto reboot, thermal
management software, remote system management, RAID,
hot-pluggable drives,
memory failover, ECC memory, ECC cache, SMP CPU failover,
error logging,
optional uninterruptible power supply (UPS) & UPS
Power Management Software
Operating systems
Tru64 UNIX 4.0E, OpenVMS 7.1-2, MS Windows NT 4.0 SP3,
LINUX
Annexure-6:
International Workshop On Linguistic Data Consortium In Indian Languages
For
Language Technology R & D
Central Institute of Indian Languages
in collaboration with
HP Labs India & IIIT, Hyderabad
LIST
OF PARTICIPANTS
01.
Prof. Rajeev Sangal (sangal@iiit.net)
02. Prof. Yegnanarayana (yegna@cs.iitm.ernet.in)
03. Prof. Mark Liberman (myl@unagi.cis.upenn.edu)
04. Dr. Hemant Darbari (darbari@cdac.ernet.in)
05. Dr.G. Uma Maheshwara Rao (guraosh@uohyd.ernet.in)
06. Dr. SriGanesh Madhvanath (srig@india.hp.com)
07. Dr.A.G. Ramakrishnan (agram@india.hp.com)
08. Dr. KSR Anjaneyulu (anji@india.hp.com)
09. Dr.Roger Tucker (roger_cf_tucker@yahoo.co.uk)
10. Sri Joel Pinto (joelp@india.hp.com)
11. Dr. Dipti Mishra Sharma (dipti@iiit.net)
12. Dr. Pushpak Bhattacharya (pb@cse.iitb.ac.in)
13. Dr. Hema Murthy (hema@lantana.tenet.res.in)
14. Ms. Kalika Bali (kalika@india.hp.com)
15. Dr.Ksenia Shalonova (kacniya-spb@hotmail.com)
16. Dr. Nandini Chatterjee Singh (nandini@nbrc.ac.in)
17. Sri Sai Jay Ram (sai@cdotb.ernet.in)
18. Dr.V. Ramasubramaniam (vram@ece.iisc.ernet.in)
19. Dr. Nitendra Rajput (rnitendra@in.ibm.com)
20. Sri GVD Prasad Rao (gdattu@india.hp.com)
21. Sri Deepu Vijayasenan (deepuv@india.hp.com)
22. Sri B.Ajay Sivaram (joyshiv@india.hp.com)
23. Sri C.S.Ramalingam (ramali@ee.iitm.ernet.in)
24. Ms. Aparna Subramanian (aparnasubramanian@indiatimes.com)
25. Dr. Ravi Shankar (ravi_ling@yahoo.com)
26. Dr.Ananthapadmanabha (anantha@blr.vsnl.net.in)
27. Sri Partha Pratim Talukdar (partha@india.hp.com)
28. Dr.K. Narayanamurthy (knmuh@yahoo.com)
29. Dr.R.N.V. Sitaram (sirn@india.hp.com)
30. Dr.C. Chandrasekhar (chandra@cs.iitm.ernet.in)
31. Dr. V Jawahar (jawahar@iiit.net)
32. Dr. Anvita Abbi (anvita@jnuniv.ernet.in)
33. Dr.S.Ramani (ramani@india.hp.com)
34. Prof. Etienne Barnard
35. Ms. Marelie Davel
PARTICIPANTS
FROM MHRD
01.
Ms. Kumud Bansal, Additional Secretary (kumudb@sb.nic.in)
02. Ms. Bela Banerjee, Jt. Secretary (L) (bela.edu@sb.nic.in)
PARTICIPANTS FROM CIIL
1.
Prof. Udaya Narayana Singh (udaya@ciil.stpmy.soft.net)
2. Prof. J.C. Sharma (sharma@ciil.stpmy.soft.net)
3. Dr. Rajesh Sachdeva (rajesh@ciil.stpmy.soft.net)
4. Dr.K. Ramasamy (ramaswamy@ciil.stpmy.soft.net)
5. Dr.K.S.Rajyashree(rajya@ciil.stpmy.soft.net)
6. Dr.A.K. Basu (basu@ciil.stpmy.soft.net)
7. Dr.Pon Subbiah (subbiah@ciil.stpmy.soft.net)
8. Dr.Sam mohan Lal(mohan@ciil.stpmy.soft.net)
9. Ms.Rekha Sharma(rekha@ciil.stpmy.soft.net)
10. Dr.B. Mallikarjun (mallikarjun@ciil.stpmy.soft.net)
11. Dr.V. Saratchandra Nair (nair@ciil.stpmy.soft.net)
12. Dr.I.S. Borkar (borkar@ciil.stpmy.soft.net)
13. Dr.P.P. Giridhar (giridhar@ciil.stpmy.soft.net)
14. Sri N.H. Itagi (nhitagi@ciil.stpmy.soft.net)
15. Dr.B.A. Sharada (sharada@ciil stpmy.soft.net)
16. Sri G.Vijayasarathi(gvsarathi@ciil.stpmy.soft.net)
Annexure - 7:
Minutes
of the Meeting on Linguistic Data Consortium held on
24.2.2004
at 11.30 AM in the Chamber of ES
A meeting on the proposal of Central Institute of Indian
Languages, Mysore to set up a Linguistic Data Consortium
for Indian Languages was held in the chamber of the
Education Secretary on 24.2.2004 at 11.30 AM.
The meeting was chaired by the Education Secretary and
attended by the following members:
1)
Additional Secretary, Dept. of Sec. & Higher Education,
Ministry of HRD
2) Joint Secretary (Languages)
3) Director (Finance)
4) Director, Central Institute of Indian Languages,
Mysore
5) Academic Secretary, Central Institute of Indian Languages,
Mysore.
2.
At the outset JS(L) welcomed the chairman and the members
and asked Director, CIIL, Mysore to give a presentation
on the proposal to set up the Linguistic Data Consortium.
Director, CIIL gave a presentation explaining the meaning
and background of Linguistic Data Consortium and need
to set up such a Linguistic Data Consortium for Indian
Languages (LDC-IL) on the model of Linguistic Data Consortium
(LDC) hosted by the University of Pennsylvania, USA
in 1992. Director, CIIL explained that CIIL, Mysore
proposes to set up a Linguistic Data Consortium in collaboration
with institutions working on Indian Language Technology
like Indian Institute of Science, Bangalore, Indian
Institute of Technology, Madras, Indian Institute of
Technology, Bombay and International Institute of Information
Technology, Hyderabad, etc.
3.
Director, CIIL stated that CIIL has 45 million corpora
of 15 Indian Languages and explained that he has made
a revised proposal after consulting the institutions
mentioned in para 1 above through, tele-conferencing
and E-mail, Chats and exchanges.
4. He informed the members that LDC-IL has mass application
tools and products, which include SMS in Hindi, and
application of the Technology in Call Centres etc. The
creation of consortium will necessitate a need to undertake
a vast survey to collect the high quality data useful
for the purpose. It was emphasized that resources must
be shared between the Government and the Non-Government
institutions/individuals in this regard. Major areas
of Linguistic Data Consortium are :-
1)
Speech Recognition and Synthesis
2) Character Recognition
3) Corpora Creation, in Indian Languages
4) Several byproducts like Lexicon, Thesauri etc.
5.
The LDC-IL will be located in CIIL, Mysore. CIIL's prior
experience in creating data base and their expertise
in Linguistics makes it an ideal location to host and
manage the LDC-IL. CIIL will require Human Resource
and other facilities to run and develop the project.
Director, CIIL stated that all engagements of the manpower
will be on contract basis.
6.
The budget of LDC-IL which covers Human Resources, tasks,
events, equipments, maintenance and intellectual Property
Rights (IPR) will be Rs.220.10 lakhs per year. Rs.1772.8
lakhs will be required for next eight years covering
the present and the next plan period. Director, CIIL
stated that they should be granted money in their Personal
Ledger Account so that money generated out of the project
can also be credited to this account and used as required
for the project with the approval of Project Advisory
Committee. JS(L) stated that project is to become self-sufficient
in eight years times.
7.
AS stated that the requirement of funds of CIIL should
go down progressively as the project is to become self
sufficient in eight years.
8.
Director (Finance) remarked that the proposal looks
promising but it needs to be specifically indicated
how the project will become self-sufficient in managing
LDC-IL over 8 years period. CIIL may indicate approximate
inflow of funds during the project period.
9.
Education Secretary stated that the names of the institutions,
which are willing to collaborate in the project with
CIIL, Mysore should be clearly indicated. He also desired
that differential rates of utilization for various categories
of users such as private institutions, University Departments
and individual users should be clearly indicated and
got approved. Education Secretary stressed upon the
need to spell out adequate safeguards vis-a-vis Intellectual
Property Rights of LDC-IL so that its contents/resources
may not be hijacked by foreign collaborators.
10.
JS(L) suggested that an IPR lawyer or expert may be
engaged by CIIL while drafting the LDC-IL agreement.
Director, CIIL stated that they can have one IPR specialist
in the Governing body. AS observed that LDC-IL application
will have big scope in the business and industry.
11.
Director (Finance) enquired about the manpower requirement
for the proposal. At this Director, CIIL explained that
all the people may not be located in CIIL but in other
institutions (collaborating with CIIL) also. He stated
that a Project Advisory Committee will do the selections
to recruit the staff on contract basis. It was suggested
by JS(L) that a representative of Ministry of IT may
be included in the PAC.
12.
After detailed discussions it was decided that (a) CIIL
will submit a detail proposal after incorporating the
suggestions given in this meeting. (b) Based on the
prepare, EFC should be prepared for the project.
The meeting ended with a vote of thanks to be chair
from JS(L).
Annexure- 8:
Minutes of the Meeting to Consider
the Project Proposal on
Linguistic Data Consortium for Indian Languages
held on 26.7.2004 at 3.00 PM in the Chamber of ES
A meeting to consider the Project Proposal of Central
Institute of Indian Languages, Mysore to set up a Linguistic
Data Consortium for Indian Languages (abbreviated as
LDC-IL) was held in the chamber of the Education Secretary
on 26.7.2004 at 3.00 PM. The meeting was chaired by
Shri B.S.Baswan, Education Secretary and attended by
the following members:
1)
Shri Sudeep Banerjee, Additional Secretary, Dept. of
Sec. & Higher Education, Ministry of HRD
2) Smt. Bela Banerjee, Joint Secretary (Languages)
3) Shri V.K. Pipersenia, Financial Advisor, Ministry
of HRD
4) Shri Madhukar Sinha, Director (Languages)
5) Prof. Udaya Narayana Singh, Director, Central Institute
of Indian Languages, Mysore
6) Dr. B. Mallikarjun, Academic Secretary, Central Institute
of Indian Languages, Mysore.
The Director, Central Hindi Directorate, Delhi, was
also present.
2.
At the outset JS(L) welcomed the chairman and the members
and asked Director, CIIL, Mysore to give a presentation
on the revised proposal to set up the Linguistic Data
Consortium. Director, CIIL gave a presentation explaining
the need for setting up of Linguistic Data Consortium
for Indian languages and the background of this proposal.
He suggested that the proposed Linguistic Data Consortium
for Indian Languages (LDC-IL) may be set up on the model
of Linguistic Data Consortium (LDC) hosted by the University
of Pennsylvania, USA in 1992, with initial and core
funding from the Ministry of Human Resource Development,
Government of India. He further pointed out that although
the LDC at University of Pennsylvania had huge databases
of some non-western languages like Chinese, Japanese,
Korean and Arabic, there are, at present, no resources
of comparable standards for Indian languages, whereas
there are already demands for such data from different
sectors, including software and telecom sectors. Director,
CIIL explained that CIIL, Mysore proposes to set up
the LDC-IL in project mode where it will work closely
with some other major technical institutions working
on Indian Language Technology such as Indian Institute
of Science, Bangalore, Indian Institute of Technology,
Madras, Indian Institute of Technology, Bombay and International
Institute of Information Technology, Hyderabad, etc.,
- besides some software giants from the private sector
who might be interested in this activity, such as HP
Labs, IBM, Infosys, Wipro, etc. He also presented the
communication from Shri R.K.Arora, Senior Director,
Ministry of Communication & Information Technology
(looking after the TDIL or Technology Development in
Indian Languages programme) assuring both active participation
and full support to this activity if undertaken by the
Institute.
3.
Director, CIIL stated that since CIIL already houses
and distributes 45 million corpora of 15 Indian Languages
and because it has already collaborated with the University
of Lancaster to join with Emille resources of another
45 million words in 5 Indian languages (as spoken in
the UK) to release an Emille-CIIL Corpora in Unicode
format since January 2004, it is logical that this next
major step be taken now to expand this activity in several
directions to realize the goal of LDC-IL.
4. He informed the members that LDC-IL has mass application
once databases, tools and products are generated as
a result of this activity. This would include Voice
Interface products in Hindi and other Indian languages
such as telephone queries, mobile services, dictation
software, speech driven systems, and application of
the Technology for automated recognition systems such
as OCRs, Handwriting recognition, etc. However, he pointed
out that LDC-IL will require to undertake a systematic
survey to collect the high quality linguistic data of
different kinds from different domains and from different
segments of population for it to be useful to software
application tools. The major areas of focus of LDC-IL
are :-
1)
Speech Recognition and Synthesis
2) Character Recognition
3) Natural Language Processing (NLP)
4) Corpora Creation in Indian Languages, including Parallel
Corpora, Spoken Corpus, etc.
5) Several by-products like Word finders, Lexicon, Thesauri,
Spell-checkers, Grammar-checkers, Auto-summarization,
Tree-banking Tools, Skeletal and Shallow Parsers, Statistical
Probabilities Models, Idioms Dictionaries and Chunkers,
etc.
5.
The LDC-IL will be located in CIIL, Mysore. CIIL's prior
experience in creating data base and their expertise
in Linguistics makes it an ideal location to host and
manage the LDC-IL. CIIL will require additional Human
Resources and other facilities to run and develop the
project. It may require a few Core Staff at the Professorial/Senior
level or maintenance personnel. However, Director, CIIL
pointed out that most of the manpower, i.e., the remaining
40-odd manpower requirement will be met with contractually.
If and when the core operations of LDC-IL becomes self-supporting,
all such contractual staff will continue to be engaged
only out of the resources generated.
6.
The budget of LDC-IL will be Rs.221.60 lakhs per year,
Rs.1772.8 lakhs over a period of eight years. This will
cover Contractual Human Resources, Project grants or
Expenses towards tasks for participating national institutions
and agencies, Expenses on Training programmes, Workshops,
Seminars and other events, Specialized software and
equipments to strengthen already existing infrastructure,
Equipment maintenance expenses, and Payment of Royalties
for data for which others have the Intellectual Property
Rights (IPR).
7.
Director, CIIL stated that the institute be granted
money in its Personal Ledger Account and/or be allowed
to maintain a special bank account for transactions
of LDC-IL as well as to house its Corpus Funds, so that
money generated out of the project could also be credited
to this account and used as required for the project
with the approval of Project Advisory Committee.
8.
ES suggested that the enterprise should not be confined
to only government-run institutions and that management
of the project has to be evolved in a manner so that
the balance between Public and Private players in the
field is maintained.
9.
AS pointed out that the proposal is worthy of support
even if no private sector group is ready to invest in
this activity. He commented that even in the USA, the
private groups will think about investing in the activity
only if they are demonstrated the range and richness
of the databases and tools as well as services by whoever
running the LDC-IL. Director (L) intervened to suggest
that the project is worthy of support even as an intellectual
exercise, without considering when exactly it would
achieve self-sufficiency.
10.
JS(L) commented that only after the project makes significant
progress, the software and telecom industries would
be interested in joining the LDC-IL group either as
corporate members or on its board. However, the requirement
of funds flow to the CIIL for this project would go
down progressively as the project is to become self
sufficient in eight years.
11.
The Financial Advisor remarked that the proposal looks
promising and considering its importance and demand,
operationalizing the entire project should be taken
up immediately in the project mode without going into
creation of a Society for this purpose. However, the
Project Management Structure needs to be considered
carefully to include Government, Academic Institutions
and Private Enterprises in managing the affairs of the
LDC-IL.
12.
After detailed discussions it was agreed that (a) the
LDC-IL proposal of the CIIL is approved in principle.
(b) It was also agreed that an SFC-note for establishment
of LDC-IL will be prepared immediately for further processing.
Towards that end, the Language Bureau will provide advice
to the CIIL to prepare the document at an early date.
(c) A two-page note on LDC-IL will be prepared for inclusion
into the folder to be distributed at the time of CABE
meeting on August 10-11, 2004. (d) A wide publicity
may be given to this proposed activity of the MHRD to
attract potential investors/members of LDC-IL.
The meeting ended with a vote of thanks to be chair
from JS(L).
(MADHUKAR SINHA)
Director (L)