TELRI
HOME
|
The 7th
TELRI
Seminar
"Information
in Corpora"
Dubrovnik, Croatia, 26-29 September 2002
Accepted papers
- BARNBROOK, Geoff "Definition
parser and dictionary translation"
- BENKO, Vladimir "Web
and/as Corpora: linguistic data on internet"
- ČERMÁK, František,
Aleš KLÉGR "Modality in Czech
and English: Possibility particles and the
conditional mood in a parallel corpus"
- CIVIT,
Montserrat, Montserrat AREVALO, Maria Antonia
MARTI "MICE: A module
for NERC in Spanish"
- ERJAVEC, Tomaž "An
experiment in automatic bi-lingual lexicon
contstruction from a parallel corpus"
- GABOR, Kata "Making
correspondences between morphosyntactic and
semantic patterns"
- GROBELNIK, Marko,
Dunja MLADENIĆ "Efficient
visualization of large text corpora"
- JAKOPIN, Primož "Extraction
of lemmas from a web index wordlist"
- KOEVA, Svetla "The
structure of hyperonym relations"
- KRSTEV, Cvetana,
Duško VITAS "Multilingual
concordances using INTEX"
- LEVÂNE, Kristîne,
Normunds GRUZÎTIS "Automatic text
mark-up facilities building Latvian literature
corpus"
- MAHLBERG,
Michaela "The
textlinguistic dimension of corpus linguistics"
- MARČINKEVICIENË,
Ruta, Vidas DAUDARAVIČIUS "Detection of the
boundaries of collocation"
- MARTINEZ, Soledad
GARCIA, Anna FAGAN "Academic conflict
in research articles: a cross-disciplinary study
of chemistry and tourism articles"
- NENADIĆ, Goran,
Irena SPASIĆ, Sophia ANANIADOU "What
can be learnt and acquired from non-disambiguated
corpora"
- PAJZS, Julia "The
corpus based comparison of the meaning of the
word loyal in English and lojális
in Hungarian"
- SINCLAIR, John "The
mystery of meaning"
- SLAVCHEVA, Milena
"Defining
meaningful patterns in the group of the predicate"
- ŠOJAT, Krešimir,
Sanja FULGOSI, Božo BEKAVAC "Identification
of terminological varieties in legal texts
translations"
- TADIĆ, Marko, Božo
BEKAVAC, Ivana SIMEON "Marking terms in
the process of translation of the Acquis
Communautaire into Croatian"
- TAMBURINI, Fabio "Quantitative
Analysis of word distributional behaviour.
Italian adverbs of manner: a case study"
- TUFIS, Dan "Interlingual
alignment of parallel semantic lexicons by means
of automatically extracted translation
equivalents"
- UZAR, Rafal "Corpora
and translation quality assessment"
- VEBER, Marek,
Karel PALA "CED - Program of
corpora editing"
- VIČIČ, Jernej,
Tomaž ERJAVEC "Corpus driven
machine translation"
- WENZHONG, Li "An
analysis of the key words and the words in
association in the college learner English corpus"
- WILLIAMS,
Geoffrey "Out of control:
defining specialised and non-specialised words in
a corpus-driven special language dictionary"
- WYNNE, Martin,
Rowan WILSON "Oxford Text
Archive: archive, library or corpus?"
- ZHIWEI, Feng "Translation
divergence in machine translation"
Geoff Barnbrook
Definition parser and dictionary translation
During the early 1990s software was developed at the
University of Birmingham by Geoff Barnbrook and John
Sinclair to parse English definition sentences of the
kind used in the Cobuild dictionaries. The development
and operation of this parser and the local grammar of
definition associated with it are described in Defining
Language (Barnbrook, forthcoming). One of the potential
applications identified for the parser after its
development was its use in the translation of dictionary
definitions into other languages and the production of
bilingual bridge dictionaries based on the Cobuild range.
This paper explores the specific areas where the parser
could be used in this process and invites suggestions for
collaborative projects making use of the parser. It will
draw on the actual and potential use of the grammar in
the production of current and proposed dictionaries.
Vladimir Benko
Web and/as Corpora: Linguistic Data on Internet
Since recently, there is a growing tendency of
considering the web as a large multi-language corpus, out
of which dynamic corpora can by selected for individual
languages or language sets, both monolingual and
parallel, by means of the Language Technology tools. To
analyze those dynamic corpora, output of the conventional
search engines can be used, either in its "pure"
form, or with a minimal post-processing that does not
involve downloading of the respective web pages delivered
as a result of the search. To be able to use a search
engine in this way, several conditions must be fulfilled:
(1) The search engine must process correctly the texts
in all representations (character sets) used for the
respective language on the web. While this is easy to
meet for English that usually do not use any accented
characters, for most western "Latin-1"
languages it requires coping with at least ISO-8859 and
UTF-8 codings. For the "Latin-2" languages,
such as Slovak, Czech or Hungarian, at lest three code
sets are to be used (Win-1250, ISO-8859-2 and UTF-8); and
for "Cyrillic" languages like Russian no less
than 4 sets must be considered (Win-1251, ISO-8859-5, KOI8-R
and UTF-8). The local MAC and DOS codings can usually be
ignored without great loss of language data, though, in
general, they might also be considered. The "correct
interpretation" of the respective language set
typically involves the ability to distinguish between
alphabetic and non-alphabetic characters, provide correct
upper/lower case conversions, the ability to accept all
valid alphabetic characters in search expressions and to
display the search results in a legible way. The easiest
and most generally accepted solution is that introduced
by AllTheWeb and recently adopted also by Google, which
convert everything into UTF-8 before creating the index
and use UTF-8 as basic output coding displaying the
results.
(2) A reasonably selective language filter for the
respective language must exist to provide for elimination
of unwanted web pages, either as part of the index
building strategy (e.g. Yandex for Russian and other
Cyrillic Languages), or via the user interface (Google,
AllTheWeb, AltaVista).
(3) A reasonable number of results must be shown (typically
several hundreds), to provide for looking also for medium-frequency
words and expressions in the respective language. (Google,
AllTheWeb, Yandex)
(4) The search results page must display at least one-line
context of the expression searched for. (Google, Yandex).
(5) For languages with a rich morphology, a
morhological analyzer/generator should be part of the
user interface to enable searching a word or expression
in all its respective forms (Yandex for Russian). This
can be partially supplemented by regular expression
search ability (AltaVista).
Anyway, the user of web as a corpus must be aware of
the specific nature of this resource when compared to a
"normal" corpus: the data is very noisy and
unstable (the web pages appear and disappear in an
unpredictable way), there are typically no word-list and
lexical-statistics operations available, there is no
control over the representatives and register of the
language data. On the other hand, this need not to be a
great problem when, e.g., a low frequency lexical
phenomena are sought while compiling a dictionary of a
language with no large-scale corpus available, such as
Slovak.
The article will bring examples of rare words "lexical
evidence mining" for Slovak, as well as estimations
of size of the web subcorpora, and the "recall"
and "precision" values for some TELRI languages.
F. Čermák, A. Klégr
Modality in Czech and English: Possibility Particles
and the Conditional Mood in a Parallel Corpus
Summary: The paper examines two kinds of modality
exponents and their interlingual relationships, using an
aligned parallel minicorpus of two contemporary Czech
originals (drama, novel) and their English translations.
It focuses on four most frequent Czech adverbial
particles of possibility / approximation, snad moţn?,
asi, nejsp?đe, and the Czech conditional mood marker by
in the texts and their equivalents. It contrasts the
findings with the equivalents in the latest and largest
Czech-English dictionary. The results confirm that in
either case the lexicographic description is insufficient
both in the range of equivalents offered and their
respective representativeness.
Montserrat Arevalo, Montserrat
Civit, and Maria Antonia Marti
MICE: a module for NERC in Spanish
Named Entities recognition and classification (NERC)
is a core problem to solve in IR and IE technologies. In
this paper we present MICE, a system for NERC based on
syntactic as well as semantic information. MICE is a
module in a pipe-line process for corpus processing and
annotation developed by the CLiC-TALP groups (University
of Barcelona and Universitat Polit?cnica de Catalunya).
This module acts/runs after the morphosyntactic tagger,
once proper names (strong NEs) had been identified, and
before the syntactic chunker.
We have defined two types of NE: strong and weak NEs.
Strong NEs include only proper names written in capital
letters; weak NEs are defined in terms of syntactic and
semantic characteristics. We distinguish between simple
and complex weak NEs. Complex NEs include coordination of
one or more constituents and some kind of subordinate
complements as relative clauses.
Weak NEs have specific syntactic patterns and all of
them include at least one trigger word. Trigger words
carry the semantic as well as morphosyntactic (POS)
information of the whole NE. Semantic information is
expressed in terms of a set of types, compatible with MUC
classification, where each type is associated with a set
of trigger words. The MICE module processes the corpus
identifying the nominal phrases containing a trigger word
and assigning to the whole nominal phrase its type and
POS. This process is carried out by means of a chunk
grammar context-sensitive. Up to now we have only dealt
with strong NEs and simple weak NEs.
Future work will focus on developing the module that
will deal with complex weak NEs. To do so, we need to
develop a treebank with syntactic as well as semantic
information.
Tomaž Erjavec
An Experiment in Automatic Bi-Lingual Lexicon
Construction from a Parallel Corpus
The IJS-ELAN corpus (Erjavec 2002) contains 1 million
words of annotated parallel Slovene-English texts. The
corpus is sentence aligned and both languages are word-tagged
with context disambiguated morphosyntactic descriptions
and lemmas.
In the talk we discuss an experiment in automatic bi-lingual
lexicon extraction from this corpus. Extracting such
lexica is one the prime uses of parallel corpora, as
manual construction is an extremely time consuming
process, yet the resource is invaluable for
lexicographers, terminologists, translators as well as
machine translation systems.
For the experiment we used two statistics based
programs for the automatic extraction of bi-lingual
lexicons from parallel corpora: the Twente software (Hiemstra,
1998), and the PWA system (Tiedemann, 1998). We compare
the two programs in terms of availability, ease of use
and the type and quality of the results.
We experimented with several different choices of
input to the programs, using varying amounts of
linguistic information. We compared the extractions using
the word-forms from the corpus to that where lemmas have
been used: this normalises the input and abstracts away
from the rich inflections of Slovene.
Following the lead of Tufis and Barbu (2001) we also
restricted the translation lexicon to lexical items of
the same part-of-speech, i.e. we make the assumption that
a noun is always translated as a noun, a verb as a verb,
etc. This again reduces the search space for the
algorithms and could thus lead to superior results.
Finally, we experimented with taking the whole corpus
as input, and opposed this to processing corpus
components separately. The reasoning here is that it is
likely that different components will contain distinct
senses of polysemous words, which will be translated into
different target words. For such words there would
therefore be no benefit in amalgamating different texts,
while the final precision might in fact be lower.
Preliminary results show that the precision of the
extracted translation lexicon is much improved by
utilising lemmas with an identical part-of-speech in the
source and target languages; this argues in favour of
linguistic pre-processing of the corpus. However, the
recall of the system tends to be lower, as it misses out
on conversion translations. In the conclusion we discuss
this and other findings, as well as current results on
extracting translation equivalents of collocations.
REFERENCES Erjavec, T. The IJS-ELAN Slovene-English
Parallel Corpus. International Journal of Corpus
Linguistics, 7/1, In print, 2002. http://nl.ijs.si/elan/
Hiemstra, D. Multilingual Domain Modeling in Twenty-One:
Automatic Creation of a Bi-Directoral Translation Lexicon
from a Parallel Corpus. In the Proceedings Computational
Linguistics in the Nederlands, Nijmegen, pp. 41-57, 1998.
http://wwwhome.cs.utwente.nl/~irgroup/align/
Tiedemann, J. Extraction of translation equivalents from
parallel corpora. In Proceedings of the 11th Nordic
Conference on Computational Linguistics, Center for
Sprogteknologi, Copenhagen, 1998. http://numerus.ling.uu.se/~corpora/plug/pwa/
Tufis, D., Barbu, A.M. Automatic Construction of
Translation Lexicons. In Kluew, V., D'Attellis, C.,
Mastorakis, N. (eds.) Advances in Automation, Multimedia
and Modern Computer Science. WSES Press, pp. 156-172,
2001.
Kata Gabor
Making correspondences between morphosyntactic and
semantic patterns
The project my paper describes aims at extracting
information from a 1.500.000 words corpus of Hungarian
business news by means of matching sematic patterns to
the input sentence to yield an XML-tagged output with a
detailed description of its semantic structure. The
corpus is composed of short business news that contain
one or two sentences. Semantic tagging implies the
identification of the so-called 'main event' of the
sentence, which is most frequently represented by the
predicate of the main clause, and the 'participants' of
the event, which take the form of complements of the
predicate. A semantic pattern consists of the event-type
and a set of the corresponding participants and their
role in the event. The input text is first subjected to a
morphological analysis and a shallow syntactic parsing
that defines sentence and clause boundaries and labels
word sequences as VPs, NPs, APs etc. The task of finding
the main event and labeling argument phrases as
participants and circumstances is performed by an
intermediate rule-based module. Rules can refer to
morphosyntactic information and they output semantic tags.
The main difficulties arising while transforming
morphosyntactic information into semantic information are
reference and coreference relations, homonimy and
different syntactic behaviour of words and phrases
belonging to the same semantic pattern. As a solution to
these problems, an extensive lexical database of
syntactic patterns describing verbal and nominal argument
structures is integrated into the syntactic parsing
module. The database contains 11.000 argument structures
of the 3.500 most freqent Hungarian verbs and 11.500
argument structures for 9.000 nouns from general
vocabulary and business terminology. Lexical entries of
argument structure patterns contain the arguments'
detailed morphosyntactic and semantic description. These
patterns are associated with one or more meanings of the
lemma. However, the problem of homonimy is reduced to the
minimum since ambiguity is far less frequent between
argument structure patterns than between lemmas. The
syntactic analysis module looks up for and labels
complements of the predicate and those of noun phrases,
and transmits all information to the intermediate module.
Matching event-type patterns is then completed among
predicates and their arguments. Other phrases that don't
match any complement in the lexical entry are considered
as optional adjuncts that typically don't represent
participants of events but circumstances. Argument
structure patterns, besides their usefulness in
disambiguation, offer the advantage of facilitating the
task of making correspondences between syntactic and
semantic arguments by restricting the number of phrases
to be investigated. It also makes easier to write
generalizing rules over syntactic arguments since they
are provided a special link pointing to the governing
word by the syntax analysis module.
Marko Grobelnik and Dunja
Mladenić
Efficient Visualization of Large Text Corpora
Visualization is one of the important ways on how to
deal with large amounts of textual data. Most frequent
application of text visualization techniques is
particular in cases when one needs to understand or to
explain the structure and nature of large quantity of
typically unlabeled and poorly structured textual data in
the form of documents.
The usual approach when dealing with text for
visualization is first to transform the text data into
some form of high dimensional data and in the second step
to carry out some kind of dimensionality reduction down
to two or three dimensions that allows to graphically
visualize the data. There are several (but not too many)
approaches and techniques offering different insights
into the text data like: showing similarity structure of
documents in the corpora (e.g. WebSOM, ThemeScape),
showing time line or topic development through time in
the corpora (e.g. ThemeRiver), showing frequent words and
phrases relationships between them (Pajek), etc.
One of the most important issues when dealing with
visualization techniques is scalability of the approach
to enable processing of very large amounts of the data.
In this paper, our contributions are two procedures for
text visualization working in linear time and space
complexity.
The first procedure is a combination of the K-Means
clustering procedure and a technique for nice graph
drawing. The idea is first to build certain number of
document clusters (with K-Means procedure), which are in
the second step transformed into the graph structure
where more similar clusters are connected and bound more
tightly. The third step performs one sort of
multidimensional scaling procedure by aesthetically
drawing of the graph. Each node in the graph represents
the set of similar documents represented by the most
relevant and distinguishing keywords denoting the topic
of the documents.
The second procedure performs hierarchical K-Means
clustering procedure producing a hierarchy of document
clusters. In the next step the hierarchy is drawn into
the two-dimensional area split accordingly to the
hierarchy splits. Like in the first approach, each
cluster (group of documents) in the hierarchy is
represented by the set of the most relevant keywords.
Both approaches will be demonstrated on the number of
examples visualizing e.g. Reuters text corpora (over 800k
documents) and various web-sites.
Primož Jakopin
Extraction of lemmas from a web index wordlist
The paper deals with a large list of words and their
frequencies, as obtained from the main Slovenian web
index NAJDI.SI (http://www.najdi.si). The list (March
2002) contains 7.591.414 units with a total frequency of
578.745.747, obtained from 1,447,602 web pages where 33
language could be identified (Slovenian 920,215 pages,
English 493,894, German 12,730, Croatian 4,892, Serbian 2,625,
Italian 2,530, French 2,063, Russian 1,851, Spanish 1,084,
Hungarian 848, Romanian 606, Polish 582, Danish 580,
Finnish 547, Czech 499, Portuguese 471, Japanese 383,
Latin 305, Dutch 248, Slovak 181, Swedish 161, Bosnian
147, Norwegian 82, Bulgarian 20, Albanian 18, Korean 17,
Ukrainian 10, Icelandic 4, Arab 3, Macedonian 3, Chinese
1, Greek 1 and Thai 1). As expected the wordlist is very
varied, Hapax legomena amount to 49,3%, but it is
nevertheless probably the most complete source of
neologisms in Slovenian.
As it was not possible to use the context of the words
(the entire index was not available) or to check the list
manually (the size of the list and the fact that stemming
is used during the internet search) an algorithm with
associated software utility have been devised which
separates Slovenian words in the list from other units (nonwords),
checks for noise and assigns lemmas to wordforms.
The algorithm is based on inflection rules for
Slovenian nouns, verbs and adjectives from the Dictionary
of Standard Slovenian (SSKJ), on word frequencies from an
80-million word text corpus Nova beseda where the
wordlist has been manually inspected and corrected, and
from word-tag frequencies of a 1-million word subcorpus
which has been POS tagged.
In the paper preliminary results, obtained by the
application of the algorithm, are presented. An
illustration is shown in Table 1 with lemmas on besed- (word-
in English) from the NAJDI.SI wordlist.
*beseda 125829 besedilnooblikovalen 2 *besediti 5
besedoljub 4 *besedar 2 besedilnooblikoven 6 besedivka 1
besedoljubiteljski 1 besedaren 1 besedilnoorganizacijski
2 *besedje 332 *besedolomen 1 besedati 2
besedilnoskladenjski 2 besedko 1 besedoslovec 1 besedca 3
besedilnost 35 *besednica 4 besedosloven 24 *beseden 8081
besedilnotipski 9 besedničar 1 *besedoslovje 106
besedenje 3 besedilnovrsten 6 *besednik 33
besedospreminjevalen 2 besedeslovje 1 *besedilo 151994
besednjačenja 3 besedotvorec 3 *besedica 1248
besedilodajalec 4 *besednjak 1872 *besedotvoren 426 *besedičenje
114 besedilopisec 7 besednjakov 32 *besedotvorje 382 *besedičiti
34 besedilopisen 2 besednjađki 1 besedotvorno 93
besedijana 6 besedilopisje 1 besednomotoričen 2
besedotvornopomenski 3 besedijski 2 besedilopiska 1
besednooblikovalski 2 besedotvornozgodovinski 2 besedika
1 besediloslovec 2 besednopomenski 8 besedovadba 4
besedilce 37 besedilosloven 94 besednoreden 20 *besedovalec
4 *besedilen 2054 *besediloslovje 207 besednoskladenjski
2 besedovalen 5 besediljenje 8 besedilotvorec 4 *besednost
4 *besedovanje 77 besedilnik 16 besedilotvoren 26
besednotvoren 2 *besedovati 23 besedilnoanalitičen 1
besedin 5 besednoumetniđki 3 besedovec 4
besedilnoanalitski 1 *besediđče 1093 besednoumetnosten
10 besedovrsten 1 besedilnogradivski 2 besediđčen 19 *besednovrsten
80 besedozvezen 2 besedilnolingvističen 1 besediđe 13
besednozvezen 36 *besedun 2
Table 1: 88 lemmas of wordforms on besed- from the
NAJDI.SI index (* = also in SSKJ)
There were 501 such wordforms, od which 221 were
noise; from 280 remaining wordforms the 88 lemmas from
the table have been obtained. 25 lemmas, marked with an
asterisk, can also be found in the Dictionary of Standard
Slovenian, other 63 are new words.
Svetla Koeva
The structure of hyperonym relations
Hyperonymy and hyponymy are inverse, asymmetric and
transitive relations, which correspond to the notion of
class-inclusion: if W1 is a kind of W2, then W2 is
hyperonym of W1 and W1 is a hyponym of W2. The relation
implies that the hyperonym may substitute the hyponym in
a context but not the other way about.
All phonetic strings expressing the same semantic
meaning are defined as a word W. The semantic meaning is
considered as a set of semantic components with no
specification of their number and features.
Multiple co-hyponyms can appear to a word. Co-hyponyms
have to inherite the equal set of semantic components of
their immediate hyperonym.
In WordNet, multiple co-hyperonyms have occasionally
been encoded. In English database approximately 0.5
percents of words receive two hyperonyms (never more), in
our assessment only operation conjunction between them is
considered. It could be efficient to encode more
comprehensively multiple hyperonymy relations.
In every hyperonymy/hyponymy relation n hyperonyms
could appear, where n ? 1. If n = 1, the set of semantic
components of the hyperonym is proper subset of the
semantic components of the hyponym.
If n = 2, there are two options - union or
intersection of the hyperonyms. The union of the semantic
components of two hyperonyms W1 and W2 is inherited by
its immediate hyponym W0 which means that W0 inherits
also the union of higher hyperonyms W11 and W22, etc.
Second option - the hyperonyms W1 and W2 have
intersection which is equal to their common immediate
hyperonym W. The hyponym W0 inherits the semantic
components from either W1 or W2 (disjunction is applied)
and thus from the higher W. Consequently we accept not
lexicalized, but constructed via implication, nodes in
the structure.
There are different combinations between hyperonyms in
terms of union and intersection, if n ? 3. A single
hyperonym in a node could appear only once in the
structure. If a hyperonym shares the node with other
hyperonyms it could appear more then once at different
levels.
The results of such approach should be avoiding of
some artificial hierarchy between words, however the
correspondence with the WordNet structure would remain.The
hierarchical structure should therefore be tested against
a corpus or by some task, to verify its quality.
Cvetana Krstev and Duško Vitas
Multilingual concordances using INTEX
Intex (Silberztein, 1993) is a flexible environment
for the development of linguistic resources and tools.
The user of Intex can develop his own applications using
functions incorporated in the system, available lexical
resources, such as a system of electronic dictionaries
for simple and compound words, and a graphical interface
for the construction of finite transducers.
The paper describes procedures involved in the
development of aligned concordances for a text (source)
and its translation (target). They are produced on basis
of texts that were already processed with Intex
independently from one another (Krstev, 1994; Vitas 2002).
The main aim of this process is to identify lexical
elements that are translated 'literally' from the source
language to the target language, using concordances of
both texts. The role of these elements in newspaper texts
was described in depth in (Krstev, 2001). Once these
elements are identified, transducers can be constructed
that generate texts encoded with XML-like tags, as well
as auxiliary files containing pointers to such tags (Gross,
1997).
The paper describes an application that is being
developed for texts tagged in the abovementioned way,
which attaches to generated concordances of a text the
'corresponding context' in target language. The
'corresponding context' is defined as a segment between
XML-like tags with comparable attributes. This method
will be illustrated with aligned texts of Plato's
Republic, Voltaire's Candide, Flaubert's Bouvard et Pécuchet,
Vern's Le tour du monde en vingt-quatre jour, and a
sample of texts from the monthly "Le Monde
diplomatique".
Silberztein, M. D. (2001): INTEX, (http://www.bestweb.net/~intex/downloads/Manuel.pdf)
Krstev, C. and Vitas, D. (1994): Concordances of Aligned
Texts (in Serbian), XXXVIII konferencija za ETRAN, Niđ,
pp. 229-230.
Vitas, D. and Krstev, C. (2002): Intex and aligned texts,
5th INTEX Workshop, Marseille, May 2-3, 2002.
Krstev, C. and Vitas, D. (2001): Lexically Driven
Alignment Based on Local Grammars, The 6th TELRI Seminar
"Multilingual Corpus Research", Bansko,
Bulgaria, 9 - 11 November 2001.
Gross, Maurice (1997). The Construction of Local Grammars.
In Roche, Emmanuel; Schabes, Yves (eds.) Finite State
Language Processing, Cambridge, Mass. : The MIT Press.
Kristîne LEVÂNE & Normunds
GRŰZÎTIS
Automatic Text Mark-up Facilities Building Latvian
Literature Corpus
The Latvian Corpus at the Artificial Intelligence
Laboratory of IMSC covers ca 30 mill. running words; ca 3.5
mill. running words are in the Latvian literature corpus,
which is the part of corpus with free access on the web.
This part is not copyright protected, and the corpus of
the classics is interesting both for academic users and
others. At the moment there are only simple navigation
possibilities ensured on the web, so the main task of
this project is to facilitate the use of literature
corpus. The gained experience serves basis for the other
software tools of Latvian Corpus, which are under the
development.
From February, 2002 the development of Latvian
literature corpus software tools has being carried out.
The conception, requirements and the desirable tasks have
been settled. So far, we have no common text structure
standards and the content of corpus was HTML tagged.
First, structure conception and standards based on XML
technologies were created. Second, software tools and
methods for the present corpus automatic transformation
to the new build-up tagging system were developed.
Presentation will deal with solutions and issues
concerning this process.
DTD grammars are created for each Latvian literature
genre (poetry, drama and prose). First DTD was made for
poetry, because this genre is the most complicate.
Different collections of poetry were examined, the aim
was to try combining all the features in one grammar. In
order to detect automatization problems, tagging tool was
developed. For drama DTD, grammar by J.Bosak (http://www.ibiblio.org/bosak)
for Shakespeare plays was used, which is a widely used
example for drama structuring. The grammar for prose is
relatively more simple. The current results of literature
corpus transformation are available on www.ailab.lv/users/normundsg.
Next stage of the project is to create the whole
corpus system and to develop software tools (navigation,
concordance, statistics, and search) for end-users. Web
interface will be provided giving the possibility to
address wider audience and providing effective further
development of the literature corpus.
Michaela Mahlberg
The textlinguistic dimension of corpus linguistics
One of the major achievements of corpus linguistics is
that it stressed the necessity of revising the widely
accepted ideas of lexis and grammar. The established
separation of lexis and grammar is just an illusion that
is destroyed as soon as natural language is looked at.
New corpus linguistic models have led to a completely new
way of describing (the English) language (e.g. Sinclair
1999a, Hunston & Francis 1999). But the potential of
corpora can take us a step further. There is a dimension
to corpus linguistics which has not received enough
attention so far: the 'textlinguistic dimension'. If we
look at a text as a communicative unit, the meanings of
words in a given text can comprise more than what is
normally listed in dictionaries. Functions such as giving
emphasis or expressing attitudes and feelings can be part
of the meaning of words in text. Corpus data suggests
that there are groups of words which tend to share
certain textlinguistic functions that contribute to the
meanings of these words. General nouns like thing, way,
man, or move form one of these groups. Among the
functions that characterise these nouns we find the
'support function'. A general noun fulfils the support
function if it occurs in a construction where it does not
contribute much meaning by itself, but helps to represent
information according to the communicative needs of the
speaker/writer and hearer/reader. The support by a
general noun can create various effects. For instance, a
general noun can help to structure a sentence according
to the information principle, as in: The man who played
that part was Norman Lumsden, and [...] (BNC). Here, the
clause begins with the general noun man, whose
postmodifier refers back to given information which is
then supplemented by new information towards the end so
that the information load increases gradually. In other
cases, the support by a general noun can be interpreted
as an economic or effective way of packing information.
The general noun way can, for example, introduce both
finite and non-finite postmodifying structures into
clauses (way in which/of/to ...) and thereby contribute
to "the flexibility and extendibility of the syntax"
(Sinclair 1999b: 169), as in: The way in which specialist
health services for the elderly are provided nationally
varies considerably (BNC). The concept of the support
function results from the interpretation of corpus data
from the BNC and the Bank of English.
References
Hunston, Susan & Gill Francis (1999): Pattern grammar:
a corpus-driven approach to the lexical grammar of
English, Amsterdam: Benjamins.
Sinclair, John (1999a): "The Lexical Item", In
Contrastive Lexical Semantics, E. Weigand (ed.),
Amsterdam/Philadelphia: Benjamins, 1-24.
Sinclair, John (1999b): "A Way with Common Words",
In Out of Corpora. Studies in Honour of Stig Johansson,
Hilde Hasselgĺrd & Signe Oksefjell (eds.), Amsterdam:
Rodopi, 157-179.
Rűta Marcinkevičienë
and Vidas Daudaravičius
Detection of the boundaries of collocation
There are methods to detect and extract collocations
form a text, like mutual information that helps to
identify two words occuring in conjunction. Nevertheless
this method does not work for longer collocations. In the
case of multiword unit it is hard to detect the exact
boundaries of a collocation, even if if has a clearcut
boundaries and is not fuzzy in the edges.
The method how to detect the boundaries of
collocations is suggested while dealing with large
corpora. The collocation is assumed to consist of a
sequence of co-occuring words. It is detected according
to high frequency word pairs that form the collocations
itself or, in the case of longer collocations, part of it.
Then the boundaries of a particular collocation are
detected by measuring the variety of possible contextual
partners to the left and to the right of the collocation.
Low variety signals the continuation of the same
collocation and high variety of contextual partners is
the sign of the boundary of a collocation.
Soledad Garcia Martinez and
Anna Fagan
Academic Conflict in Research Articles: A Cross-Disciplinary
Study of Chemistry and Tourism Articles
The aim of this communication is to present the
results of a Research Project carried out in the Faculty
of Modern Languages at the University of La Laguna (Tenerife)
on the way criticism is presented on Research Articles to
the scientific community.
In today's competitive academic world, the pressure to
publish is continually increasing, and, in order to
justify publication of their research articles (RA),
writers must create a research space which permits them
to present their new claims to the other members of the
academic community. This mainly implies the indication of
a knowledge gap and/or the criticism of any weak point in
the previously published work by other researchers or the
academic community itself. The latter phenomenon has been
termed academic conflict, a critical speech act whose
rhetorical expression ranges from blunt criticism to the
use of subtle hedging devices, aimed at an individual or
the community in general.
The study of citation practices across the disciplines
carried out by the members of our Project has revealed a
dichotomy between the so-called "hard" and
"soft" sciences, thus there may also be
significant interdisciplinay differences in the
rhetorical strategies used to express AC and in the
frequency of the critical speech act itself. In this
study we discuss the development of the taxonomy we have
created to describe the rhetorical choices writers use
when making the critical speech act, and the application
of this taxonomy to 50 RAs from two distinct disciplines:
Tourism, representing the soft disciplines, and
Chemistry, the hard disciplines.
The application of this taxonomy, which categorises AC
according to directness, writer mediation, and the target
of the criticism, has yielded some surprising results.
These findings may indicate, inter alia, that a more
delicate taxonomy should be applied to the study of AC.
Goran Nenadić, Irena Spasić,
and Sophia Ananiadou
What Can Be Learnt and Acquired from Non-disambiguated
Corpora: A Case Study in Serbian
Every NLP system needs to incorporate a certain amount
of relevant linguistic knowledge acquired from theory and/or
corpora. One of the main challenges is the efficient
customisation of such systems to a new task or domain by
automatic learning and acquisition of specific
constraints [5, 9]. In this paper we discuss possible
approaches to learning various lexical and grammatical
features from non- disambiguated corpora in a
morphologically rich language such as Serbian. Unlike
reliable tagging tools for such languages [6, 8],
electronic texts are widely available, and therefore, we
concentrate on learning from initially tagged [7] but non-
disambiguated text.
We present three case studies based on the computation
of minimal representation (i.e. intersection) of features
from non-disambiguated corpora. Each case concentrates on
learning different type of linguistic information.
In the first case, we have used a genetic algorithm
approach [4] to learn cases required by a specific
proposition. We computed the minimal set of cases for
each preposition so that every corresponding (non-disambiguated)
NP from the learning corpus keeps at least one case from
the set (Figure 1). The results coincide with the
corresponding (theoretical) grammars, thus proving that
this feature can be learnt from corpora. Further, the
learning method is unsupervised, as no prior knowledge
has to be provided.
In the second case, we have used a general NP
structure and obligatory agreements between NP
constituents [2], to learn structures for specific named
entities (namely names of companies, educational and
governmental institutions). The initial set of entities
was identified by using specific designators (e.g.
'preduzece' (Eng. company)) as anchors [3]. Then, we
computed a minimal set of lexical and morpho-syntactic
features that were inherent for every NP from the set,
producing lexicalised local grammars [1] that describe
structure of specific types of named entities.
Finally, we used particles (e.g. 'kao' (Eng. like)) as
anchors to learn frozen, multiword adverbial expressions
(e.g. 'kao grom iz vedra neba' (Eng. surprisingly)).
Simple expressions like 'kao NP'(e.g. 'kao konj' (Eng.
hardly)) were not considered. The remaining expressions
are "minimised" by conflating some grammatical
features (e.g. pronouns in 'kao da su PRON:dative sve
ladje potonule' (Eng. disappointedly)).
As these studies show, some basic grammatical
constraints (like cases) and specific lexical preferences
(like lexicalised NP structures and multiword adverbials)
can be learnt automatically even in a morphologically
rich language. However, although the precision of grammar-related
constraints is promising, the broader coverage of lexical
learning is still a challenge.
References
[1] Gross, M. (1997): The Construction of Local Grammars,
in: Roche, E. & Y. Schabes (eds.): Finite State
Language Processing. Cambridge, MA, The MIT Press, pp.
329-352.
[2] Nenadic, G., D. Vitas (1998): Using Local Grammars
for Agreement Modeling in Highly Inflective Languages, in
Sojka, P. et al. (Eds): Text, Speech, Dialogue,
Proceedings of TSD'98, Masaryk University, Brno, the
Czech Republic, pp. 97- 102
[3] Nenadic, G., I. Spasic (2000): Recognition and
Acquisition of Compound Names from Corpora, in: NLP-2000,
Lecture Notes in Artificial Intelligence 1835, Springer
Verlag, Berlin.
[4] Nenadic, G., I. Spasic, S., Ananiadou, (2002):
Reducing Lexical Ambiguity in Serbo-Croatian by Using
Genetic Algorithms, in Proceedings of Fourth European
Conference on Formal Description of Slavic Languages,
FDSL-4, Germany, 2001
[5] Riloff, E. (1996): Automatically Generating
Extraction Patterns from Untagged Text, in Proceedings of
the Thirteenth National Conference on Artificial
Intelligence (AAAI-96), pp. 1044-1049
[6] Tadic, M. (2002): Building the Croatian National
Corpus, in Proceedings of LREC-3, 3rd International
Conference on Language, Resources and Evaluation, Las
Palmas, Spain, 2002
[7] Vitas, D (1993): Mathematical Model of Serbo-Croatian
Morphology (Nominal Inflection), PhD thesis, Faculty of
Mathematics, University of Belgrade (in Serbo-Croatian)
[8] Vitas, D., C. Krstev, G. Pavlovic-Laţetic, G.
Nenadic (1998): Recent Results in Serbian Computational
Lexicography, in Monograph on 125th anniversary of the
Faculty of Mathematics. University of Belgrade, pp. 111-128
[9] Yangarber, R., R. Grishman (2000): Machine Learning
of Extraction Patterns from Un-annotated Corpora, in
Proceedings of the 14th European Conference on Artificial
Intelligence: ECAI-2000 Workshop on Machine Learning for
Information Extraction, Berlin, Germany
Júlia Pajzs
The corpus based comparison of the meaning of the
word loyal in English and lojális in
Hungarian
Several years ago – in a special communicative
situation – I suddenly realised that there is a
difference in the meaning of the word loyal in English
and the corresponding word lojális in Hungarian. While
its core meaning in English – according to the OALD
2002 is “remaining faithful to sb/sth and supporting
them or it: a loyal friend /supporter She has always
remained loyal to her political principles”, in
Hungarian the meaning is surprisingly different with many
similarities, however: “Valamely politikai rendszerhez,
ill. államhoz hű <személy> ill. ilyenre jellemző
lojális állampolgár, nyilatkozat. | Vkihez, (kül.
feletteséhez) v. valamilyen közösséghez ragaszkodó,
és hozzá méltányos. lojális vkihez, vki iránt. |
Becsületes, jóhiszemű. Ez nem volt lojális eljárás.”
(Concise Dictionary of Hungarian, 1972) ’A person who
is faithful to sth or sb, especially to a political
system or state, loyal citizen, loyal statement,
| To be fair or faithful to sb, (especially to one’s
boss) or to a collective. | Honest, unsuspecting This
was not a loyal procedure’. The above quotations
may make it obvious that although there is a strong
semantic similarity between the Hungarian and the English
adjective, there is difference between the most frequent
usage of these words. While in English the meaning is
unanimously positive, in Hungarian it is still not
considered necessarily positive to be loyal either to the
current state/government or to your boss, or to a common
aim. In the final version of my paper I would like to
show several corpus examples from both languages, using
the available BNC, COBUILD corpora and two Hungarian
corpora: The Historical Corpus of Hungarian, which
contains 23 million running words from 1772-1992, and the
Hungarian National Text Archive: a 150 million running
word synchronic corpus. The later comparison will give me
a chance to investigate if the usage of this word has
changed at all in the past years, since the transition.
The corpus based analysis of the most frequent
collocates of these words will certainly help a lot in
identifying the underlying similarities and differences
in the usage of this words. This examination can serve as
a model to similar studies on the parallel investigation
of languages and cultures.
John Sinclair
The Mystery of Meaning
The apparent paradox that I would like to explore is
that
(a) language interpreted as a formal system cannot
account for the creation of meaning
(b) language cannot acquire meaning from outside itself,
but must create it by the systematic ordering of items
Assuming that these statements can be supported, we
must look for the ways in which language creates meaning
within itself, but not through its organisation as a
formal system.
The clues are to be found in the evidence provided by
corpora. Relatively independent items form meaningful
units by coselection, frequent collocation adds meaning
through "contagion". Individuals compare
meanings through averral, from which truth value is
derived. Meanings are related to each other inside the
language system through paraphrase, which is the non-formal
process that allows language to retain aspects of a
formal system without submitting to the full rigours of
it.
Milena Slavcheva
Defining Meaningful Patterns in the Group of the
Predicate
In most cases, when information is extracted from
large corpora, the units that are searched for belong to
the category of nominals: proper names, common nouns and
noun groups are distinguished and interpreted
linguistically. In my work, the target of exploration and
formal description is language constructs in Bulgarian
that are identifiable as verb complexes. They form the
first layer of meaningful patterns within the group of
the predicate.
The modeling of the sentential structure is performed
in the setting of the BulTreeBank project [Simov et al.
2002a], where relations are defined and interface is due
to be established between the formal representation
necessary for large-coverage computing techniques like
chunk parsing and sophisticated HPSG conformant [Pollard,
Sag 1994] linguistic descriptions attached to the
sentences in the treebank. The segmentation of the verb
complex into reliable patterns is based on the philosophy
of easy-first parsing outlined by Abney in [Abney 1991]
and [Abney 1996]. The parsing technique uses reliable
patterns consisting of categories and regular expressions
that enter finite-state automata operating in the so
called cascade, that is, sequence of levels of phrase
recognition. The regular grammar cascade for the verb
complex consists of two subsequent levels of phrase
recognition where on the basis of smaller segments on the
first level, bigger segments are defined on the second
level by the application of corresponding groups of
pattern matching rules. The regular grammar engine used
is part of the software environment provided by the CLARK
system [Simov et al. 2001]
The sentence elements that immediately surround the
verb and form the first layer of meaningful patterns fall
in two main groupings: 1) elements that are generally
considered pronominal clitics; 2) auxiliary verb forms
and functional words. The interdependence between the
very rich tense and mood paradigm of the Bulgarian verbs
and the idiosyncrasies of the clitic behaviour leads to
the existence of verb complexes with different number and
type of elements which are in different combinations and
generate a variety of semantic connotations.
The patterns in the verb complex are defined in such a
way as to be compatible with a semantic model of the
Bulgarian temporal system developed by Gerdzhikov [Gerdzhikov
1999] where four types of tenses are distinguished: 1)
non-relative, non-perfect; 2) relative, non-perfect; 3)
non-relative, perfect; 4) relative, perfect.
In this way the segmentation of the verb complex at
the level of chunk parsing is interfaced with the feature
structure descriptions of the sentences in the treebank
of Bulgarian where syntactic and semantic information is
incorporated.
References
[Abney 1991] Abney, S. Parsing By Chunks. In: R. Berwick,
S. Abney and C. Tenny (eds.) Principle-Based Parsing,
Kluwer Academic Publishers.
[Abney 1996] Abney, S. Partial parsing via finite-state
cascades. In: J. Caroll (ed.) Proceedings of the ESSLLI'96
Robust Parsing Workshop.
[Gerdzhikov 1999] The Temporal Orientations that Build
the Meanings of the Bulgarian Verb Tenses. (in Bulgarian).
In: Balgarski ezik i literatura, Number 2-3, 1999.
[Pollard and Sag 1994] Pollard, C., I. Sag. Head-Driven
Phrase Structure Grammar. CSLI Publications, 1994.
[Simov et al. 2001] Simov, K., Z. Peev, M. Kouylekov, A.
Simov, M. Dimitrov, A. Kiryakov. ClaRK - an XML-Based
System for Corpora Development. In: Proceedings of Corpus
Linguistics 2001 Conference, pp.558-560
[Simov et al. 2002a] Simov, K., P. Osenova, M. Slavcheva,
S. Kolhovska, E. Balabanova, D. Doikov, K. Ivanova, A.
Simov, M. Kouylekov. Building a Linguistically
Interpreted Corpus of Bulgarian: the BulTreeBank. In:
Proceedings of LREC 2002, Canary Islands, Spain.
Krešimir Šojat, Sanja Fulgosi
and Božo Bekavac
Identification of terminological varieties in legal
texts translations
The paper investigates lexical and grammatical
varieties of terminological translation equivalents.
Parallel corpora provide a useful resource for
identifying terms in the source language and for checking
consistency of translations of terms in the target
language texts where no TE variations are permitted. XML
based search tool is applied to the sentence aligned
parallel corpus consisting of texts comprising several
original EU legal documents and their Croatian
translations. The input consists of EUROVOC terminology (glossary
of terms in English and their translations into Croatian
which should be used with 100% consistency). The tool
compares the consistency of translation equivalents set
by the EUROVOC in advance and the actual varieties of
translation equivalents found in the Croatian translated
texts. The tool works in sequence of comparison steps.
Firstly, input English sentences are compared with the
English side of the EUROVOC glossary. After locating
terms in original English sentences, the next step is a
further comparison between corresponding Croatian
sentences translated by human translators and a matching
pair of terms from the glossary. Sentences where terms in
translations do not match with already established and
expected term translations from the glossary are marked
and left for manual examination. Differences on the
lexical and grammatical level resulted from inconsistency
of terminological use of Croatian translators will be
presented and the typology and frequency of those
varieties will be discussed. We assume this kind of
corpus evidence will be a practical guide for translators
to produce terminologically consistent translations where
such a requirement is an absolute necessity like in legal
texts translations.
Marko Tadić, Božo Bekavac and
Ivana Simeon
Marking terms in the process of translation of the
Acquis Communautaire into Croatian
As a candidate for joining the EU, Croatia faces a
challenging task: translating the Acquis Communautaire,
an extensive body of legislation comprising approximately
150 000 pages. To speed up this process and to increase
the consistency of translation, we developed a tool to
suite the needs of translators. The input consists of the
original documents, converted into the XML format, which
is the standard accepted by the corpus linguistic
community today. The EUROVOC glossary (the official EU
legislative terminology lexicon translated to Croatian,
Brataniă (2000, 2001)) is also converted to XML and
stored as a separate document. The tool searches the
source English document, identifies the English terms
existing in EUROVOC, marks these terms in the original
document and offers the established Croatian translation
equivalents. The processing is based on traversing XML
documents with extensive usage of XML Document Object
Model, which provides a range of possibilities for
different output formats. The standard output is a HTML
document, being one of the most used and widespread
formats today and easily readable on any platform, with
terms marked and their Croatian TEs available at the
user's request. The trial processing was carried out on a
sample document, namely the Stabilization and Association
Agreement between EU and Croatia. The authors argue that
this tool provides a method for significant increase of
the consistency of translations (approximate number of
translators engaged by the Ministry of European
Integrations of the Republic of Croatia exceeds 100) and
reduction of the time human translators need to fulfill
the task. This tool will also simplify the second phase
of the translation process - the revision of the
translated documents. Furthermore, documents with terms
marked can also be used in any other type of
terminological research.
Fabio Tamburini
Quantitative Analysis of Word Distributional
Behaviour. Italian Adverbs of Manner: a case study
This paper intends to present the main lines of work
in progress based on the exploration of large corpora as
a source of quantitative information about language. The
focus is on some problems relating to the morpho-syntactic
annotation of corpora and on some statistical techniques,
showing their effectiveness on a specific case study.
The works is mainly based on CORIS/CODIS, a corpus of
contemporary written Italian, developed at CILTA -
University of Bologna, is a synchronic 100-million-word
corpus and is being lemmatised and annotated with part-of-speech
(POS) tags. Usually the set of tags is pre-established by
the linguist, who uses his/her competence to identify the
different word classes. The very first experiments we
made revealed that the traditional part-of-speech
distinctions used in Italian are often inadequate to
represent the syntactic features of words in context,
especially for complex classes, such as adverbs,
pronouns, prepositions and conjunctions.
In the literature there is a wide acceptance of the
distinction, mainly based on the concept of open and
closed set of words, between lexical words (content words)
and grammatical words (or functional words). Thus, it is
possible to postulate four main categories of words,
three belonging to the set of lexical words (nouns,
verbs, qualitative adjectives) and one large class that
collects all the grammatical words (and also adverbs of
manner). Using such distinction a subpart of CORIS has
been automatically tagged and statistical techniques have
been applied for retrieving context information for some
target words, obtaining a distributional fingerprint for
every word considered in this study. The approach is
based on the hypothesis that two syntactically and
semantically different words will usually appear in
different contexts and will have different fingerprints.
Some adverbs of manner have been chosen and different
clustering techniques have been applied to the
correspondent fingerprints. The main tools used were
Hierarchical Clustering and Self-Organising-Map. The
clusters derived applying such techniques suggested a
clear syntactical behaviour of the considered adverbs. It
emerged, as stated in various bibliographic references,
that Italian adverbs of manner tends to modify sentences
or to modify verbs and adjectives. These two syntactical
schemas act as extreme poles of a continuum in adverbs-of-manner
behaviour. Some adverbs prefer to modify mainly
sentences, but sometimes also verbs or adjectives. Other
adverbs prefer to modify mainly verbs or adjectives, and
in rare cases also sentences. Moreover the adverbs of
manner that prefer to modify sentences clusters very well
with a class of word that, in a previous work, have been
defined as soft connectives (Tamburini et al. 2002).
The global behaviour of each adverb of manner can thus
be represented as a preference occurring in modification
of other linguistic objects that can expressed by
probability values. That corresponds to what is required
by stochastic part-of-speech taggers. The tagging
procedure will assign to each adverb of manner both the
categories, disambiguating them using the derived
probabilities. This appears to be a suitable way for
managing such kind of linguistic phenomena.
Dan Tufiş
Interlingual alignment of parallel semantic lexicons
by means of automatically extracted translation
equivalents
Multilingual alignment of semantic lexicons (lexical
ontologies) usually relies on some kind of language-independent
conceptualization of their semantic content. In
EuroWordnet and its follow-up BALKANET, such a
conceptualization is called InterLingual Index (ILI). Two
meanings in two different language-specific semantic
lexicons which are mapped onto the same conceptual
representation are taken to be semantically equivalent or
put it otherwise, linguistic realizations of the same
concept. The usual procedure assumes that monolingual
ontologies are independently mapped, according to a
commonly agreed protocol, on the interlingual index.
Usually, this lexical projection is achieved by humans
and its accuracy is hampered by their lexicographic
experience, subjectivity, and tiredness. However, the
most important element that affects the projection
consistency is the difference in granularity between a
given lexical ontology and the interlingual index.
EuroWordnet and BALKANET adopted as the Interlingual
Index a set of unstructured concepts corresponding to the
meanings explicitly recorded in WordNet1.5, plus a few
concepts lexicalized in other languages. In this case, as
noted by several researchers, the sense-distinctions in
the Interlingual Index are too fine-grained in order to
expect an accurate and consistent mapping of multiple
language-specific ontologies. Recently, a lot of interest
rose around the idea of so-called "soft concept
clustering" of the Interlingual Index. The idea is
that instead of defining the crosslingual semantic
equivalence based on lexical projection over the same ILI
record, one should consider lexical projection over the
same cluster of ILI records. This weaker definition of
crosslingual semantic equivalence is more realistic and
easier to meet and operationalize for computer
applications.
We propose a method and its implementation for both
checking consistency of the monolingual mappings over the
Interlingual Index and for pinpointing the concepts in
the Interlingual Index that should be "soft-clustered".
The methodology builds on our most recent results in
sense clustering using automatic extraction of
translation equivalents, and on the recording of human
failures in consistent mapping of language specific
senses onto Interlingual Index records.
The parallel corpus we used in our experiments is the
"1984", based on Orwell's novel, developed in
the MULTEXT-EAST project, further cleaned up in the TELRI
and CONCEDE projects.
The paper will show how these resources are used in
checking the consistency of the mapping over the
Interlingual Index of several lexical ontologies as build
in the BALKANET project and how this checking could
provide hints for ILI soft clustering.
Rafal S. Uzar
Corpora and Translation Quality Assessment
The paper describes a new corpus project under
development at the University of Ůódę. The Department
of English prides itself on the quality of its students
of translation, however, there is always room for
improvement. With this aim in mind the author has begun
work on a corpus research project which will give the
departmental translator trainers a different view both of
their work and their students' work.
Within practical applications of language corpora and
second language learning, corpora can be loosely divided
into three groups:
a) monolingual b) bilingual (parallel or comparable) c)
learner For the purposes of translation training each of
these corpora have their advantages and disadvantages.
The project described in this paper will utilize all
three kinds of corpus in an attempt to gain a different
perspective on translation and the process of translation.
The PELCRA was set up in 1997 to produce extensive
corpus resources at both a local and national level. The
project consists of a variety of corpora:
1. A Polish monolingual corpus 2. An English learner
corpus
Translator trainees are free to make use of both types
of corpora found in the PELCRA whole and use them both as
a guide to avoid learner errors or erroneous learner
tendencies and also as a reference point by using the
Polish national corpus. The students also have access to
the BNC and in this way have at hand two monolingual
reference corpora for both of the languages they are
working in.
Students translate from the foreign language to the
mother tongue, which is generally considered the norm and
are encouraged to attempt translation from the mother
tongue into the foreign language (i.e. Polish into
English). It is with the latter that the learner corpus
comes into its own becoming a useful tool and guide for
the translator.
Extensive work by the PELCRA team (e.g. Leńko-Szymańska,
2000; Lewandowska-Tomaszczyk, McEnery, Leńko-Szymańska,
2000) have given our students valuable clues to dangerous
areas in the production of FL texts.
Our trainees have access to a wide range of
translation. However, the need for a more specialized
learner corpus, one created with translation in mind,
seemed apparent. The paper describes the production of a
learner translation corpus which allows the analysis of
errors and patterns specific to translation and student
translation.
Marek Veber and Karel Pala
CED - Program for Corpora Editing
The paper is concerned with editing of corpora and
tagged corpora in particular. It introduces a specialised
corpus editor (program CED) and library for work with
corpora (libkorplib.a).
The whole system CED displays the following functions
and properties:
- journaling of changes in corpora,
- editing of corpus texts,
- working with a list of localities in the course
of making aggregate corrections,
- co-operating with a corpus manager.
The library LIBKORPLIB.A provides an effective
interface regardless of the physical data storage, thus
it is possible to access data in various formats (text
files, SQL databases etc.).
The tool can be used for editing any corpora, making
quite complicated corrections in them, modifying tagged
corpora after adjustments caused by changes in the
respective tagsets. It can also closely cooperate with
other external programs like morphological analyzers or
morphological databases (or other dynamic resources) from
which the appropriate options for the desired changes can
be selected. The tool has been recently (2001-2002)
tested in NLP Laboratory FI MU during the development of
the grammatically tagged corpus DESAM: we have used it
for correcting both tagging errors and errors like
splitted words or misprints and also for the task
involving marking sentence boundaries and other aggregate
changes.
One of the main purposes of CED system is to
considerably speed up the development of tagged corpora
with the number of mistakes reduced to the reasonable
minimum -- these expectations have been fulfilled in the
course of building the corpus DESAM and now the same has
been experienced with its larger version DESAM2.
Jernej Vičič and Tomaž Erjavec
Corpus Driven Machine Translation
The paper presents an experiment in automatic
translation from Slovenian to English language based on
SMT, Statistical Machine Translation. EGYPT is the result
of a summer workshop at John Hopkins University, and is
currently most widely used toolbox for processing
bilingual parallel corpora for translation system
production. The IJS-ELAN corpus contains 1 million words
of annotated parallel and sentence aligned Slovene-English
texts, with both languages word-tagged with context
disambiguated morphosyntactic descriptions and lemmas.
The corpus is encoded in XML, according to the TEI
Guidelines P4.
A Slovene to English translation system was produced
using the EGYPT toolbox and the IJS-ELAN corpus. We
discuss the motives for source/target language selection,
i.e. why we chose to train the system for Slovenian to
English translation rather than vice-versa.
We performed basic evaluation on this system. The
initial model was then extended using corpus annotations,
in particular the context disambiguated lemmas, which
abstract away from the rich inflections of Slovene. In
this way the main disadvantage of our model, namely that
it derived from a corpus of relatively modest size, is,
at least to some extent, overcome.
The new translations were evaluated and results
compared with the translations of the initial system. The
translations were evaluated using two methods:
- WER, word error rate, is a variant of edit distance.
Translations, acquired using our system, are compared
with reference translations. Corpus is divided in train
and test pairs, test pairs are used as reference
translations and compared to new translations. All
insertions, transpositions and deletions are counted and
normalised.
- SSER, subjective sentence error rate. Translations,
acquired using our system, are marked by experts and
distributed into five classes ranging from "perfect
translation" to "perfect nonsense".
The results are presented and discussed.
Li Wenzhong
An Analysis of the Key Words and the Words in
Association in the College Learner English Corpus
This research is based on the College Learner English
Corpus which was completed at Shanghai Jiao Tong
University in 1999. The major objectives are to examine 1)
whether the distribution of key words is closely related
to the subject matter of the essays; 2) how the words are
associated with each other in the essays of the same
topic; and 3) how the words are inter-related in the
essays across different topics. Moreover, the relation
between the words in association and their collocational
links has also been investigated. A general survey of the
core word associations demonstrates that there exists
high overlapping frequency for the core words to be used
together in terms of lexical sets. For the learners' use
of the lexical words there are sets of core words that
are highly productive. These core words are often less
marked and super-ordinates in the learners' mental
matching of the semantic fields of the two languages. The
core words are often used in association with each other
and it is possible that the words in association can also
be used as collcoational links. Moreover, many of the
learners' use of lexical words are topic dependent and
there is high vocabulary concentration within the texts
of the same topic with regard to the whole corpus. The
core words of one subject-matter are often inter-related
semantically. And the various relationships between the
lexical sets for one topic may often attribute to the
very theme of the text. The findings have shown that the
learners tend to use words in close relation to the
meaning organizations in their mental lexicon. And the
choice of one word is much determined by how their
schematic knowledge about the world is activated and by
how the other words in association are chosen in their
vocabulary network. Therefore the success of vocabulary
using depends to a great extent on whether the learners
can successfully perceive the complex lexical relations
such as topical relations, association, and collocational
links and represent them with accuracy in their language
production. Such observations are of implications for the
EFL teaching in that the lexical words may be better
taught when the word association and topical relations
are considered. And the new words may be more accessible
and made easier for retention if they are delivered on
the basis of the subject-matter with which the words are
connected.
Geoffrey Williams
Out of control : defining specialised and non-specialised
words in a corpus-driven special language dictionary
For the advanced learner of English there is no
shortage of excellent dictionaries to choose from. It is
accepted that lexicographers will cover general needs of
the learner and that more specific usage is the
prerogative of the terminologist. However, for the non-native
writer of research papers terminology may not be the
major problem, it is the words that go around the terms
that prove difficult. Several problems arise here. One is
that this population rarely buys dictionaries, and when
it does, the users do not wish to wade through complex
entries for which the examples do not conform to domain
and genre-specific usage. Another problem arises from the
nature of terminology. To reach agreement on ideal usage
a "term" is generally defined outside of
context, however, when looking at terms within a corpus
we may find that research writing does not respect the
standard definition as authors strive to impose their
view as to the phenomenon under study. This means that
corpus examples may not be to the taste of everyone in a
controversial and developing field. Then, in developing a
corpus-driven approach to specialised language another
problem arises, that of grammatical norms. Insofar as
specialised corpora are inevitably composed of both
native and non-native productions some of the grammatical
usage may not be that of the particular syntax of some so-called
sublanguage, but simply bad English. In reference corpora
minor variations are lost in the mass of data, but this
is not necessarily so in the much smaller special
language corpora.
The Parasitic Plant Dictionary project is an attempt
to build a data-driven pedagogical dictionary in a
specialised field. This means looking at both specialised
and non-specialised items and displaying their usage in
context. Two lexical units will be discussed here, "control",
as verb and noun in general and scientific usage, and
"haustorium", a domain specific term in
parasitic plant biology. In looking at "control"
we shall see the difference between the complex entry
required in a pedagogical dictionary and that adopted
here to show the usage of this word in specialised
contexts. For "haustorium" we shall see the
difficulties in extracting an entry from corpus data that
will show usage, whilst not upsetting the terminological
requirements of the leaders in the field.
Martin Wynne and Rowan Wilson
Oxford Text Archive: archive, library or corpus?
The Oxford Text Archive is a large repository of
electronic texts and text corpora. At present the archive
works in much the same way that it has since its
inception. The user consults the catalogue, selects a
text or a number of texts and then completes the relevant
procedure in order to download the text or texts to their
computer. The main development in terms of resource
delivery in the past 25 years is that many of the
resources can now be downloaded directly from the
website, rather than being sent by post on magnetic media
or downloaded by ftp. The user is then left to their own
devices in order to find software to analyse the texts
and try to extract information from them.
In order to make the archive more useful and usable
for linguistics researchers, a system for the online
querying texts and corpora online is being developed at
the Oxford Text Archive. It is further proposed that the
user will be able to construct a corpus of texts from the
archive for downloading or querying online. It will be
possible to select texts for the corpus on the basis of
any of the resource metadata categories and by simply
picking and choosing from a list of texts.
Online concordancing is not new. Many sites and corpus
projects offer this facility. Furthermore, the ability to
select the texts on the fly and thus construct a virtual
corpus is not new. This paper reviews some existing
resources and services in this area.
The specific challenge of providing a service of this
type using the holdings of the Oxford Text Archive is
that there are more than 2400 texts in the archive and
they have been collected and documented over a period of
more than twenty-five years, and as such reflect a
multitude of different practices in the encoding of the
texts, in the construction of collections of texts, and
in the documentation of the resources. The size and
diversity of the archive makes it a potentially extremely
rich linguistic resource.
It is however a precondition for the type of
functionality which is proposed here that the textual
data and the metadata be interoperable. The OTA's
response to this challenge is examined in this paper.
There is also an examination of the extent to which the
framework which is being developed can be generalised.
Feng Zhiwei
Translation Divergence in Machine Translation
The selection of translation equivalence in MT (Machine
Translation) depends on the differentiation of
translation divergence between the Source Language (LS)
and Target Language (TL). In this paper, the different
types of translation divergence in MT are discussed. They
are the translation divergence in lexical selection, in
tense, in thematic relation, in head-switch, in
structure, in category, and in conflation. The
syntactical, semantic and contextual ambiguity that
related with the translation divergence also discussed.
The author suggests use the feature vector to represent
the co-occurrence cluster, and the co-occurrence cluster
based approach in the selection of translation
equivalence is described in detail.
|