The ATILF (Computer Processing and Analysis of the French Language) recruits Master 2 interns in Natural language processing in 2024 on two topics:

Topic 1: Semantic parsing of lexicographic definitions – Application to Trésor de la Langue Française informatisé

The Trésor de la Langue Française informatisé (TLFi) is the digitized version of the Trésor de la Langue Française (TLF), a sixteen-volume dictionary of the French language of the 19th and 20th centuries, published between 1971 and 1994. The main goal of this internship is to develop and evaluate parsing models for annotating the semantic structure of the TLFi definitions. [more]

Topic 2: Automatic identification of linguistic properties of Multiword Expressions

The term « multiword expression » (MWE) refers to a combination of multiple lexical items that displays irregular composition possibly on different linguistic levels (morphology, syntax, semantics, . . . ). The goal of this internship is to develop and evaluate methods to jointly identify MWE occurrences in texts and predict the linguistic criteria explaining their MWE-hood.
[more]

Location: ATILF, Nancy, France, https://www.atilf.fr

Duration: 5 months

Requirements: Candidates should be students in Master 2 Natural Language Processing, or Computational Linguistics, or computer science or applied mathematics (or equivalent).

Application: Candidates should send their applications (cover letter, CV and master grades) to Mathieu Constant (Mathieu.Constant@univ-lorraine.fr) not later than January 8, 2024. Applications will be processed on a continuous-flow basis.

1) Semantic parsing of lexicographic definitions – Application to Trésor de la Langue Française informatisé

Title: Semantic parsing of lexicographic definitions – Application to Trésor de la Langue Française informatisé

Supervisors: Mathieu Constant (ATILF, CNRS/Univ. Lorraine, Mathieu.Constant@univ-lorraine.fr) and Alain Polguère (ATILF, CNRS/Univ. Lorraine, Alain.Polguere@univ-lorraine.fr)

The internship will be in collaboration with Lucie Barque (LLF, Univ. Sorbonne Paris-Nord) and Alexis Nasr (LIS, Aix-Marseille Université).

Motivation and Context

The Trésor de la Langue Française informatisé (TLFi) is the digitized version of the Trésor de la Langue Française (TLF), a sixteen-volume dictionary of the French language of the 19th and 20th centuries, published between 1971 and 1994. The TLFi is freely accessible on the web since 2002: https://www.atilf.fr/ressources/tlfi/. Its entries are structured into several fields (forming its so-called microstructure): for instance, definitions, synonyms and antonyms, literary examples where the headwords appear, technical domain indicators, as well as semantic, etymological, historical, grammatical and stylistic features.

The ATILF-funded Definiens project (2009–2010) targeted the convertion of TLFi’s definitions into structured expressions (using XML tags). At that time, there was no widely accessible lexical database for the French language offering an explicitly internal structuring of the definition for each individual headword. Definiens set out to fill this gap by indicating for each definition in the TLFi (around 270,000): 1) its structuring in definitional components; 2) the role played by each component in characterizing the meaning of the headword.

The Definiens project was never completed due to the starting fo the major RELIEF lexicographic project. It ended with approximatively 5% of the TLF’s definitions manually-annotated.

Goals and objectives

The main goal of this internship is to develop and evaluate parsing models for annotating the semantic structure of the TLFi definitions, based on the set of already annotated definitions from the Definiens project. An example of such an annotation is given below for the lexical unit BROUETTE B.1 ‘wheelbarrow’ , where CC stands for Composante Centrale ‘central component’ and PC stands for Composante Périphérique ‘peripheral component’:

<CC>Véhicule</CC>

<CP>à une roue et à deux brancards</CP>

<CP>servant au transport des matériaux</CP></PARAPH>

(Translation: <CC>vehicule</CC> <CP>with one wheel and two holding poles<CP> <CP>used to carry materials</CP>)

In a second stage, the feasibility of enriching such basic segmentation tagging with tag indicating the semantic role of each PC should be explored. For instance, the enriched version of the above definition should be:

<CC=véhicule>Véhicule</CC>

<CP=parties caractéristiques>à une roue et à deux brancards</CP>

<CP=fonction>servant au transport des matériaux</CP></PARAPH>

(véhicule ‘vehicle’, parties caractéristiques ‘characteristic parts’, fonction ‘function’)

The work may be divided in several objectives:

reading of the scientific literature on the topic and related areas
data preparation
develop simple baselines training off-the-shelf (syntactic) parsers with the provided data
adapting models to the specific nature of the task and the data
quantitative and qualitative evaluation of the models
produce semantic annotations for all TLFi definitions
make proposals + experiment the feasibility of semantic enrichment

A reading knowledge of French is a plus.

2) Automatic identification of linguistic properties of Multiword Expressions

Supervisors: Mathieu Constant (ATILF, Univ. Lorraine, France, Mathieu.Constant@univ-lorraine.fr) and Patrick Watrin (CENTAL, Univ. of Louvain, Belgium patrick.watrin@uclouvain.be)

Motivation and context

The term « multiword expression » (MWE) refers to a combination of multiple lexical items that displays irregular composition possibly on different linguistic levels (morphology, syntax, semantics, . . . ). They include a large variety of phenomena such as idioms (run around in circles ‘to keep doing or talking about the same thing without achieving anything’ [Collins dictionary]), sup- port verb constructions (take a walk), nominal compounds (dry run ‘rehearsal’), complex function units (in spite of ). They have been the subject of extensive research work in the NLP community over several decades, in particular since the seminal paper of [SBB+02].

The identification of MWEs in texts requires operational criteria characterizing the idiosyncrasy of MWEs, notably to build MWE-annotated corpora or lexical resources. Such linguistic criteria are often based on formal tests [CCR+21], including for instance semantic ones such as *a dry run is a run (in the sense of rehearsal) vs. a dry shirt is a shirt.

Automatic identification of MWEs in texts are nowadays mostly based on sequence classifiers trained on MWE-annotated datasets, on top of pretrained large language models, cf. [RSG+20]. Proposed approaches do not provide any cues indicating why a given occurrence of an MWE has been identified as such in the text, regarding linguistic criteria.

Goals and Objectives

The goal of this internship is to develop and evaluate methods to jointly identify MWE occurrences in texts and predict the linguistic criteria explaining their MWE-hood. For this purpose, the work will be based on the French corpus Sequoia [CS12] that has been annotated for MWEs, including the linguistic properties used to manually identify them [CCR+21].

The internship has the following objectives:

read the scientific literature on MWEs and their identification in texts • develop a baseline MWE model to simply identify MWEs in texts
adapt the model to also characterize the linguistic properties of the identified MWEs
quantitative and qualitative evaluation of the results

A good knowledge of French is a plus. The duration of the internship is 5 months. The internship will take place at the ATILF, Nancy.

References

[CCR+21] Marie Candito, Mathieu Constant, Carlos Ramisch, Agata Savary, Bruno Guillaume, Yannick Parmentier, and Silvio Cordeiro. A french corpus annotated for multiword expressions and named entities. Journal of Language Modelling, 8(2):415–479, Feb. 2021.

[CS12] Marie Candito and Djamé Seddah. Effectively long-distance dependencies in French : annotation and parsing evaluation. In TLT 11 – The 11th International Workshop on Treebanks and Linguistic Theories, Lisbon, Portugal, November 2012.

[RSG+20] Carlos Ramisch, Agata Savary, Bruno Guillaume, Jakub Waszczuk, Marie Candito, Ashwini Vaidya, Verginica Barbu Mititelu, Archna Bhatia, Uxoa Iñurrieta, Voula Giouli, Tunga Güngör, Menghan Jiang, Timm Lichte, Chaya Liebeskind, Johanna Monti, Re- nata Ramisch, Sara Stymne, Abigail Walsh, and Hongzhi Xu. Edition 1.2 of the PARSEME shared task on semi-supervised identification of verbal multiword expres- sions. In Stella Markantonatou, John McCrae, Jelena Mitrovic ́, Carole Tiberius, Carlos Ramisch, Ashwini Vaidya, Petya Osenova, and Agata Savary, editors, Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons, pages 107– 118, online, December 2020. Association for Computational Linguistics.

[SBB+02] Ivan A. Sag, Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger. Multiword expressions: A pain in the neck for nlp. In Alexander Gelbukh, editor, Com- putational Linguistics and Intelligent Text Processing, pages 1–15, Berlin, Heidelberg, 2002. Springer Berlin Heidelberg.

Mathieu Constant

ATILF - UMR 7118 - CNRS / UL

Offres de stage

1) Semantic parsing of lexicographic definitions – Application to Trésor de la Langue Française informatisé

2) Automatic identification of linguistic properties of Multiword Expressions