Pre-conference Workshops

 

NLP in CALL

No Longer Pertinent or New Light Penetrates

3rd Eurocall workshop organised by the SIG in Language Processing

(1)
Aim of the Workshop
(2)
Schedule  
(3)
Papers
(a)
Teaching grammar with a treebank and a parser
(b)
Syntactic and ‘semantic’ error detection
(c)
Ontology Enrichment with Conceptual Structures for Cross-Linguistic Disambiguation
(d)
Learner-corpora and NLP
(e)
What have you done for me lately? The fickle alignment of NLP and CALL
(4)
Cost to Participants

Aim of the Workshop

The workshop title is in part borrowed from a presentation John Sinclair gave at a one-day conference on NLP in CALL in Manchester in 1998, which was jointly organised by Eurocall and the Centre for Computational Linguistics at UMIST (Manchester). Four years later, it seems appropriate to again endevour to assess the validity of parser-based and corpus-based approaches in CALL research and development.

During this workshop, participants will be introduced to examples of Natural Language Processing (NLP) approaches in CALL and will have the chance to familiarise themselves with the application of NLP techniques in CALL. The way in which such technology can be integrated in computer-assisted language learning will be discussed.

The Special Interest Group in Language Processing is Eurocall's newest SIG. The group organised a successful pre-conference workshops for EUROCALL2000 in Dundee and EUROCALL2001 in Nijmegen. This year's (third) workshop places emphasis on two areas of language processing that a highly relevant to CALL: morpho-syntactic parsing for error diagnosis and  the use of corpora for language learning and teaching. It brings together presenters from Finland, Germany, Sweden and Switzerland.

back to the top

Schedule
09:00 - 09:30 Registration / Coffee
09:30 - 09:45 Opening
Trude Heift Workshop Chair
Mathias Schulze Chair of SIGLP
09:45 - 10:30
Teaching grammar with a treebank and a parser
Anju Saxena, Lars Borin
10:30 - 11:15
Syntactic and ‘semantic’ error detection
Sébastien L’haire
11:15 - 12:00
Ontology Enrichment with Conceptual Structures for Cross-Linguistic Disambiguation
Steve Legrand
12:00 - 13:30 Lunch
13:30 - 14:15
Learner-corpora and NLP
Veit Reuer
14:15 - 15:00
What have you done for me lately? The fickle alignment of NLP and CALL
Lars Borin
15:00 - 16:00 Round Table Discussion with the panel of contributors
16:00 - 16:30 Coffee Break
Abstracts
Teaching grammar with a treebank and a parser
Anju Saxena1 ;Lars Borin1,2
1 Department of Linguistics, Uppsala University, Sweden  
2
Computational Linguistics, Department of Linguistics, Stockholm University, Sweden

It is generally acknowledged that the goal of teaching grammar in Linguistics should not primarily be that students memorize definitions of concepts and grammatical constructions, but rather that they understand and learn to recognize different structural patterns. This can hardly be achieved without giving students practical training in the skill of grammatical analysis. Research has shown that hands-on problem-solving is more stimulating and thought-provoking than when the information and results are handed down to the pupils during lectures. With this in mind, we formulated a project (The work described here forms part of the project IT-based Collaborative Learning in Grammar, a collaboration between the universities in Uppsala and Stockholm, funded by the Swedish Agency for Distance Education (DISTUM), for the three years 2002–2004. Anju Saxena is the principal investigator for the project. See also <http://www.ling.uu.se/anjusaxena/distum.html>.) for realizing a new format for teaching courses in grammar in Linguistics, Computational Linguistics, and lesser-taught languages, where practical training and corpus-based exercises will comprise an integral part of the students’ learning process. The proposed web-based training material has a modular architecture, composed of four types of modules:
    • ‘Encyclopedia’ module, describing grammatical concepts and constructions. 
    • ‘Text corpus’ module, containing (a) POS-tagged and syntactically annotated (‘tree­banks’) corpora of Swedish, and (b) an annotated corpus of a foreign language. For (a), we will use the SUC and Talbanken Swedish corpora, and for (b), a corpus of Kinnauri (a Tibeto-Burman language spoken in India) narratives available on the web <http://www.ling.uu.se/anjusaxena/ corpus.html>. 
    • ‘Interactive exercise’ module. Our aim here will be to provide students with a set of exercises, with basic tools for computer-mediated student cooperation in virtual work­groups (a ‘spread­sheet’ for problem-solving; optional ‘step-by-step questions’ for the grammatical topic covered; grammar rule writing exercises, to be discussed below), with hyperlinks to the ‘encyclopedia’, to the ‘resources’ (see below) and to the annotated foreign language corpus (hyperlinked to a dictionary; see Saxena 2000). 
    • ‘Resource’ modules will provide a pool of resources for further reading and relevant links to other sites.
One type of exercise in module 3 will use a treebank (Talbanken; Teleman 1974) together with a grammar writer’s workbench, in a refinement of an idea presented by Borin & Dahllöf (1999). We propose to use grammar rules written by students (using an existing parser frontend) as search expressions in the treebank. Given an NP rule formulated by a student, we could automatically tell how many treebank POS sequences matching the rule actually make up NPs, how many are not NPs, and how many NPs in the treebank are not described by the rule. There are all kinds of conceivable interesting elaborations of this basic scheme, which could be seen as a more linguistically sophisticated parallel to the use of (unannotated) text corpora and concordancing software in so-called data-driven language learning (Flowerdew 1996). 
At the moment, we are locating and evaluating NLP resources, mainly on the web. The evaluation is to be mainly pedagogical, i.e. we will ask ourselves whether a particular resource will be suitable for the pedagogical framework that we have adopted for teaching grammar. However, usability—as the term is used in Human–Computer Interaction research—will also be an important evaluation criterion, as well as the the estimated effort needed to adapt the resource for our needs. As the corpora are in place already, we are now evaluating tools for the manipulation and visualization of corpus data, parsing systems, and grammar writer’s work­benches, which raises a number of compatibility/standard­iz­ation issues that need to be resolved.

References
Borin, Lars & Mats Dahllöf 1999. A corpus-based grammar tutor for Education in Language and Speech Technology. EACL'99, Computer and Internet Supported Education in Language and Speech Technology. Proceedings of a Workshop Sponsored by ELSNET and The Association for Computational Linguistics. University of Bergen, Norway. 36–43.
Flowerdew, John 1996. Concordancing in language learning. The power of CALL, ed. by Martha Pennington. Houston, texas: Athelstan.

Saxena, Anju. 2000. Corpora of lesser-known languages on the internet: A pedagogical tool for the teaching of syntax. Paper presented at the workshop on IT inom språkundervisningen. Uppsala University. <http://www.ling.uu.se/anjusaxena/symposium0303.html>.

Teleman, Ulf 1974. Manual för grammatisk beskrivning av talad och skriven svenska. Lund: Liber.

back to the top
 
 

What have you done for me lately? The fickle alignment of NLP and CALL
Lars Borin, Computational Linguistics, Department of Linguistics, Stockholm University, Department of Linguistics, Uppsala University

Natural Language Processing (NLP; or Computational Linguistics, CL; or Language Engineering, LE; or Language Technology, LT)—which deals precisely with the use of (natural) language by computers—ought to be eagerly brought to bear on the task of developing Computer-Assisted Language Learning (CALL) applications by CALL practitioners. 

Similarly, NLP researchers ought to be interested in (human) first and second language learning, and in developing NLP systems in support of language development and learning. 

Unfortunately, neither is actually the case. In the recent broad Survey of the state of the art in human language technology (Cole et al. 1996), there is not a single word about (human) language learning. Similarly, CALL contributions at the biennal international conference on computational linguistics (COLING) have been next to nonexistent (e.g. Borissova 1988; Zock 1996; Schneider and McCoy 1998; Burstein and Marcu 2000). Much of the work on using NLP in CALL has been pursued under the heading of Artificial Intelligence (AI; a field which overlaps minimally with mainstream NLP; see Swartz and Yazdani 1992; Holland et al. 1995), particularly in the area of Intelligent Tutoring Systems (see Frasson et al. 1998; Goettl et al. 1998).

Chapelle (1997, 2001) is not optimistic about the contributions of AI/NLP to CALL, although at least in her 2001 book, the NLP work that she reviews (under the headings “Artificial intelligence” and “Computational linguistics”; 2001: 32–36) is in most cases more than a decade old, in a field which has seen very rapid development in the last ten years.

On a more positive note, there have been some international workshops on NLP and CALL, sometimes in connection with CL conferences (e.g. Jager et al. 1998; Olsen 1999; Schulze et al. 1999; Efthimiou 2000), although these, too, seem to depend on fortuitous circumstances, rather than a conviction that CALL is an important NLP application; thus, the Language resources and tools for educational applications workshop held at LREC 2000 (Efthimiou 2000) will not be repeated at the upcoming LREC 2002.

Some factors that could be instrumental in fostering the attitudes in the two communites (NLP and CALL) toward each other are:

  • Different backgrounds. NLP researchers often come from a Computer Science background, or from General Linguistics, while CALL researchers tend to have their basic training in Languages or Applied Linguistics. Sparck Jones (1996) remarks: “It has also to be recognized that the arrogance so characteristic of those connected with IT – the self-defined rulers of the modern world – is not merely irritating in itself, it is thoroughly offensive when joined to ignorance not only of language, but of relevant linguists’ work” (1996: 13), and: “On the practical side, it is impossible not to conclude that many linguists are techno- and logico-phobes.” (1996: 13f).
      • Different attitudes to technology. This is probably connected to the preceding factor. There is a difference between using existing technology, and trying to develop or shape the technology according to some need (see, e.g., Amiri 2000), or metaphorically, the difference between hunter-gatherer and food-production economies (see Diamond 1998).
      • Language-learning ideology. The emphasis in the SLA community is on communicative language learning. This is often interpreted as excluding e.g. form-based drill, which is where the greatest potential would be for present state-of-the-art NLP. However, at Uppsala University, there are at present regular courses in more than 40 languages, many of which have extremely complicated and exotic grammatical systems compared to the so-called ‘Modern Languages’, at the same time as the number of student contact hours is limited to 3–5 hours per week. Herein lies a great, mostly untapped, potential for NLP in CALL, as also in the support for threatened languages (Allwood and Borin 2001).
    I will discuss these and other factors in more detail in the paper, and also try to speculate about how to change this state of affairs.

    References
    Allwood, Jens and Lars Borin 2001. Datorer och språkteknologi som hjälpmedel i bevarandet av romani – Computers and language technology as an aid in the preservation of Romani. Plenary presentation at the symposium Romani as a language of education: possibilities and restrictions today, Göteborg University, 19–20 January 2001.
    Amiri, Faramarz 2000. IT-literacy for language teachers: Should it include computer programming? System 28: 77–84.
    Borissova, Elena 1988. Two-component teaching system that understands and corrects mistakes. COLING Budapest. Proceedings of the 12th International Conference on Computational Linguistics. Vol I. Budapest: John von Neumann Society for Computing Sciences. 68–70.

    Burstein, Jill and Daniel Marcu 2000. Benefits of modularity in an automated essay scoring system. Proceedings of the COLING–2000 workshop on using toolsets and architectures to build NLP systems. Centre Universitaire, Luxembourg, 5 August 2000.

    Chapelle, Carol 1997. CALL in the year 2000: Still in search of research paradigms? Language Learning & Technology 1(1): 19–43. Available on the WWW via <http://llt.msu.edu>.

    Chapelle, Carol 2001. Computer applications in second language acquisition. Cambridge: Cambridge University Press.

    Cole, Ron, Joseph Mariani, Hans Uszkoreit, Annie Zaenen and Victor Zue (eds.) 1996. Survey of the state of the art in human language technology. Cambridge: Cambridge University Press. Available on the WWW as <http://cslu.cse.ogi.edu/HLTsurvey/>.

    Diamond, Jared 1998. Guns, germs and steel. A short history of everybody for the last 13,000 years. London: Vintage.

    Efthimiou, Eleni (ed.) 2000. LREC 2000. Second international conference on language resources and evaluation. Workshop proceedings: Language resources and tools for educational applications. Athens: ILSP.

    Frasson, Claude, Gilles Gautier and Alan Lesgold (eds.) 1998. Intelligent tutoring systems. Third International Conference, ITS '96. Montréal, Canada, June 12–14, 1996. Proceedings. Lecture notes in computer science 1086. Berlin: Springer.

    Goettl, Barry P., Henry M. Halff, Carol L. Redfield and Valerie J. Shute (eds.) 1998. Intelligent tutoring systems. 4th International Conference, ITS '98. San Antonio, Texas, USA, August 16–19, 1998. Proceedings. Lecture notes in computer science 1452. Berlin: Springer.

    Holland, V. Melissa, Jonathan D. Kaplan and Michelle R. Sams (eds.) 1995. Intelligent language tutors: theory shaping technology. Mahwah, New Jersey: Lawrence Erlbaum Associates.

    Jager, Sake, John A. Nerbonne and A.J. van Essen (eds.) 1998. Language teaching and language technology. Lisse: Swets & Zeitlinger.

    Olsen, Mari Broman (ed.) 1999. Computer mediated language assessment and evaluation in natural language processing. A joint ACL–IALL symposium. Retrieved from the WWW in July 1999: <http:// umiacs.umd.edu/~molsen/acl-iall/accepted.html>.

    Schneider, David and Kathleen F. McCoy 1998. Recognizing syntactic errors in the writing of second language learners. COLING-ACL '98. Proceedings of the Conference, Vol. II. Montréal: Université de Montréal. 1198–1204.

    Schulze, Mathias, Marie-Josée Hamel and June Thompson (eds.) 1999. ReCALL: Language processing in CALL. Proceedings of a one-day conference “Natural Language Processing in Computer-Assisted Language Learning”, a special ReCALL publication. Hull: The CTI Centre for Modern Languages, University of Hull, UK.

    Sparck Jones, Karen 1996. How much has information technology contributed to linguistics?. Presentation at the British Academy Symposium on Information Technology and Scholarly Disciplines, 18–19 October 1996. The page references in the text are to the electronic version available via <http://xxx.lanl.gov/cmp-lg/9702011/>.

    Swartz, Merryanna L. and Masoud Yazdani (eds.) 1992. Intelligent tutoring systems for foreign language learning. Berlin: Springer-Verlag.

    Zock, Michael 1996. Computational linguistics and its use in real world: the case of computer assisted-language [sic] learning. COLING–96. The 16th international conference on computational linguistics. Proceedings, vol. 2. Copenhagen, Denmark: Center for Sprogteknologi. 1002–1004.

    back to the top
     
     

    Syntactic and ‘semantic’ error detection
    Sébastien L’haire, Department of Linguistics, Faculty of Arts, University of Geneva, Switzerland

    In this talk, we intend to present the research conducted at the University of Geneva in the framework of the European research project FreeText, which aims at developing an advanced hypermedia CALL software featuring NLP tools for a smart treatment of authentic documents and (relatively) free production exercises. The system targets intermediate to advanced learners of French.

    We use various NLP tools to provide the learners with intelligent feedback: a sentence structure viewer; a diagnosis tool; a speech synthesizer, which can pronounce either the software’s or the learners’ sentences; and a sentence reformulation tool.

    The presentation will focus on the techniques used for the error detection, the main goal of which is to give the learners an appropriate feedback for production exercises. The learner’s answer is compared, if applicable, with a model answer stored in the exercise database. If the sentences do not match, we start a 3-step procedure, involving spell checking, syntactic checking and ‘semantic’ checking. The procedure can be interrupted at any step, when an error is detected.

    We will not detail the well-known techniques of spell-checking. The syntactic error detection uses three different techniques. The main one is constraint relaxation. If the parser can only give a partial analysis, we try to relax some constraints to obtain a complete analysis, for instance agreement rules or verb and adjective complementation. The second technique is phonetic reinterpretation: we try to build a correct sentence by looking for homophones in the lexicon at the boundaries of the partial analysis chunks. The third technique is called chunk reinterpretation, where ad-hoc rules are applied. The results of the three techniques are combined in order to give an intelligent feedback.

    A sentence can be syntactically correct but nevertheless wrong in the context of the exercise. For the question “as-tu vu les voitures rouges?” (did you see the red cars?), if the instructions are to answer with a pronoun, the answer is “je les ai vues” (I’ve seen-AGR them). The learner could type “je les ai vus” which is a correct sentence, but a wrong answer, since “voiture” is a feminine noun. The pronoun “les” is both masculine and feminine, and the past participle “vu” must agree with the pre-posed object complement. Therefore “les” is feminine and the past participle must be “vues” feminine plural. Our semantic checker is able to detect such mismatches and works as follows: a semantic representation of both sentences, using “pseudo-semantic structures” which combine both lexical and abstract information, is extracted. Then we compare the learner’s answer with the model answer. These structures remain the same, regardless of the construction used (active, passive, focus etc.) - only some abstract features change. Transformation exercises are easy to construct. The teacher needs enter only one sentence in the database, while, with systems using pattern matching, formulas must be entered and all possibilities listed.

    back to the top

    Ontology Enrichment with Conceptual Structures for Cross-Linguistic Disambiguation
    Steve Legrand, University of Jyväskylä, Finland

    Conceptual structures (1) can be understood as those structures of mind that have developed in living organisms during their evolution in interactions with the changing environmental conditions. These structures are reflected in the semantics and are partially captured in the syntax of a natural language. However, natural language is, by no means, the only expression of those conceptual structures: all the other senses such as hearing, vision etc. employ the same structures. The value of these structures lies in their universality: languages may vary, but as all the human beings have presumably similar evolutionary development behind them, those conceptual structures should vary very little from region to region and between individuals. This gives hope that some universal semantic structures encoded in syntax may, in fact, be found in all languages and could be employed productively in many natural language processing tasks such as language learning and translation.

    Ontology enrichment differs from lexico-syntactic approach to annotation in certain respects. It does not exclude the use of real-world ontologies but is designed to work in tandem with them in a framework to be created for the purpose in the current study. The framework will use PIA (Platform for Information Applications) to annotate text with functional tags based on conceptual semantics that can then be used by information agents in various transactions. These XML-compatible tags contain instructions created with the help of a scripting language with a complete Turing engine functionality. Although this study concentrates on semantic components, the motivation behind is to allow the addition of real-world knowledge to semantic disambiguation with a minimum of effort. The framework will also allow component-based collaborative development of tagsets.

    Cross-linguistic disambiguation uses tags incorporating lexical semantic components to disambiguate text to boost the disambiguation accuracy of current parsers (Grammatical, stochastic, rule-based, syntactical and their combinations). As the conceptual structures used are universal, they can be used as a system of interlingua between several languages(2). In each language, the syntax of that language uses only a small subset of possible conceptual semantics structures available, and there are differences between languages in that respect. These differences in the use of conceptual semantic structures form often a stumbling block in language learning. For example, the Finnish sentence:

    Minulla on nälkä (’I have hunger’)                       translated to English: I am hungry

    may not be self-evident for a novice language learner. 

    However, although the sentences have a different syntactic structure on the surface, they can be represented independently on the conceptual level. These syntactic/conceptual differences between languages can be marked automatically with the help of the tagset. The language learner or teacher can be alerted (by highlighting the relevant parts, for example) to take note of the structures to concentrate on. The differences in syntax could also be made more explicit at that stage. This would create positive (learning through surprise) rather than negative (learning through failure) enforcement to the learner and would, therefore, make learning more effective.

    [1] Jackendoff, R.: Semantic Structures, The MIT Press, Cambridge, Mass., 1990

    [2] Dorr, B.J., The Use of Lexical Semantics in Interlingual Machine Translation. Journal of Machine Translation, 7:3, pp. 135-193, 1992.

    back to the top
     

    Learner-corpora and NLP
    Veit Reuer, University of Osnabrück, Germany

    There have been various approaches to grammar checking in natural language processing, i.e. recognition and correction of errors both in the field of language learning and word-processing. Approaches which do not use the anticipation of errors, usually are not very efficient. Therefore some steps are taken to improve the performance of the system. In the classical approach to PS-Rule-parsing, Mellish89 uses very complicated heuristics to determine possible next moves by the parser. Menzel98 in theis approach weighs the constraints used in the dependency grammar.

    Other approaches use anticipation and encode the position and type of possible errors somewhere in the grammar or the lexicon. Only very few developers have actually used learner-corpora to determine the outcome of the parsing-modifications. An exception is McCoy98.

    To enhance the efficiency of non-anticipating parsers and possibly to supply material for anticipation-based systems we have analyzed learner-corpora for error-types occurring most frequently. Since our interest lies in German as a second language, a corpus from Heringer97 has been used, which contains 7107 German sentences marked with error codes and corrections. 403 error classes were used. A second annotated corpus collected at the Universitat de Barcelona by Oliver Strunk is currently being investigated. Note that annotating a corpus with errors-flags has its own difficulties which will not be discussed here.

    In the following we present some conclusions that were reached.

    1. Since learners of German as a second language with medium to good knowledge produced the corpus, orthographic errors appear rarely (app. 4 per cent; note, that it is difficult to distinguish between "morphological" errors resulting in incorrect spelling and "true" orthographic errors). This probably depends also on the notational system used as L1, which was not registered in the first corpus. The second corpus was produced by Spanish-speaking learners of German. If the L1-language also uses Latin letters, very few orthographic errors should be expected.
    2. The most prominent specific error in syntax is a number-mismatch between subject-noun and verb which could be expected from the way German grammar works (3.73 per cent). But the second most frequent errors are comma errors (3.71 per cent; 4.1 per cent for all punctuation errors). One can argue that punctuation errors in language-learning/teaching play only a minor role. Nevertheless parsers which analyze learner language should be able to inform the learner if an punctuation error occurred. To our knowledge, none of the systems developed so far have concentrated on or even considered this aspect.
    3. Looking at errors from a more general perspective, it becomes clear that the number of linearization-errors (omission 16 percent and permutation 14 percent) exceeds the number of verb-subject-agreement-errors (4 per cent) and the number of insertions (only 5.5 per cent).
    Parsers should therefore be expected to concentrate on omission and permutation instead of treating every linearization-error the same as e.g. in Mellish89. Unfortunately government-errors were not marked as such in the corpus, but the number for e.g. a verb governing a certain case is around 8 per cent.
    Although some publications about error-recognition have mentioned the high number of syntatic errors in general as opposed to semantic or orthographic errors, there have not been any analyses of frequencies of syntactic errors in the context of NLP and CALL to our knowledge. In our opinion these analyses could be well used for the improvement of syntactic parsers used in CALL.

    References
    Heringer, Hans Jürgen (1995): Aus Fehlern lernen, Universität Augsburg, CD-ROM for Win9x/NT
    Mellish, Chris S. (1989): Some Chart-based Techniques for Parsing Ill-Formed Input, in Proc. 27th ACL-Conference, 102-109

    Menzel, Wolfgang and Schröder, Ingo (1998): Constraint-based Diagnosis for Intelligent Language Tutoring Systems, Universität Hamburg, Fachbereich Informatik Report Nr. FBI-HH-B-208-98

    Schneider, David and McCoy, Kathleen F. (1998): Recognizing Syntactic Errors in the Writing of Second Language Learners, Proc. 17th Int. COLING-Conference on Computational Linguistics, Montreal
     
     

    back to the top

    Costs to Participants

    Participant fee

    EUR 80.00
    back to the top

     

    Get an EUROCALL conference logo for your own site HERE>>>