Researching Sri Lankan English through a corpus
by Prof Ryhana Raheem and Dr Dushyanthi Mendis
Corpus Linguistics is a field of Linguistics which analyses natural
language data, in particular 'real' contemporary written and spoken
data, by means of a 'corpus', which is a collection of electronic or
digital texts.

On the highway of corpus linguistics |
The objective of corpus linguistics is to discover patterns of
language use, i.e., association patterns between particular words and
phrases.
The collected data is recorded in a computer as text files, and
special tools and computer software have been designed which help
researchers analyze the wealth of information that has been collected
and collated from a wide variety of fields.
Corpus linguists usually start by searching for a particular word or
phrase in a corpus, and then do a more detailed analysis to find out
exactly how the word has been used in naturally occurring language
situations.
This analysis is of immense value to teachers and learners, in fact
to all users, because it gives us information on the ways in which the
language we use is changing and evolving, and what new words and phrases
are being formed. Patterns of contemporary use discovered by corpus
linguists are often incorporated into books on grammar, dictionaries and
other tools that aid teaching and learning.
Corpus Linguistics with regard to English began in the 1960s. A
project was begun to investigate the features of Sri Lankan English in
this manner in the late 1990s.
Initially supported by the British Council, a multi-university team
of academics and researchers along with Dr Chris Tribble (who was
associated with the British Council at that time), commenced work on
collecting data on Sri Lankan English.
At the moment, the Sri Lanka corpus team is chaired by Professor
Ryhana Raheem, currently Director of the Post Graduate Institute of
English at the Open University of Sri Lanka, and includes Dr Dushyanthi
Mendis , Senior Lecturer, Department of English, University of Colombo,
Dr Hemamala Ratwatte, formerly Head of the Department of Language
Studies, Open University of Sri Lanka, Professor Manique Gunasekera of
the University of Kelaniya and other researchers from these
universities.
When completed, the corpus of Sri Lankan English will become a part
of a larger corpus of International English known as the ICE (the
International Corpus of English) and will be available to students,
teachers and researchers for research and study purposes.
The Sri Lankan corpus will be known as ICE-SL, or the International
Corpus of English - Sri Lanka. Work on this component is almost
half-completed, and is currently being directed by Prof. Dr. Joybrato
Mukherjee, Chair of English Linguistics of Justus Liebig University in
Giessen, Germany, who has obtained funding to complete the project.
Prof Mukherjee is scheduled to visit Sri Lanka in February to review
the work done so far on ICE-SL, collect the remaining texts to complete
the corpus, and set parameters for data collection and analysis for the
future.
Professor Mukherjee is also scheduled to deliver lectures on Corpus
linguistics perspectives on Indian and Sri Lankan English for interested
audiences at the University of Colombo (February 12th) and the Open
University (February 13th).
Among the most well-known examples of research corpora are the Brown
Corpus of America, the British National Corpus (BNC) and the London-Lund
corpus of Britain, the Kolhapur corpus of India, ACE of Australia and
the Wellington Corpus compiled for New Zealand English.
Work on Sri Lankan English done by the ICE-SL team is therefore a
timely effort to align the research on English in this country with
international trends and norms, and to provide a concrete base for
prescriptive (how the language should be used) and descriptive (how the
language is actually used) practices in Sri Lanka.
****
More on corpus linguistics
A landmark in modern corpus linguistics was the publication by Henry
Kucera and Nelson Francis of Computational Analysis of Present-Day
American English in 1967, a work based on the analysis of the Brown
Corpus, a carefully compiled selection of current American English,
totalling about a million words drawn from a wide variety of sources.
Kucera and Francis subjected it to a variety of computational
analyses, from which they compiled a rich and variegated opus, combining
elements of linguistics, language teaching, psychology, statistics, and
sociology.
A further key publication was Randolph Quirk's 'Towards a description
of English Usage' (1960, Transactions of the Philological Society,
40-61) in which he introduced The Survey of English Usage.
Shortly thereafter Boston publisher Houghton-Mifflin approached
Kucera to supply a million word, three-line citation base for its new
American Heritage Dictionary, the first dictionary to be compiled using
corpus linguistics. The AHD made the innovative step of combining
prescriptive elements (how language should be used) with descriptive
information (how it actually is used).
Other publishers followed suit. The British publisher Collins'
COBUILD dictionaries, designed for users learning English as a foreign
language, were compiled using the Bank of English.
The Brown Corpus has also spawned a number of similarly structured
corpora: the LOB Corpus (1960s British English), Kolhapur (Indian
English), Wellington (New Zealand English), ACE (Australian English),
the Frown Corpus (early 1990s American English), and the FLOB Corpus
(1990s British English).
Other corpora represent many languages, varieties and modes, and
include the International Corpus of English, and the British National
Corpus, a 100 million word collection of a range of spoken and written
texts, created in the 1990s by a consortium of publishers, universities
(Oxford and Lancaster) and the British Library. |