Machine learning, données textuelles et recherche en sciences humaines et sociales

Journée d'étude

Dates

Du lundi 25 novembre 2019
au mardi 26 novembre 2019

Horaires

Lundi 25 novembre : de 13h30 à 17h30
Mardi 26 novembre : de 8h30 à 12h30

Lieu(x)

Site Descartes - Bâtiment Buisson - Salle de conférence D8.001

Organisateur(s)

Langue(s) des interventions

Français

On assiste depuis quelques années à une explosion des usages de l’apprentissage automatique, notamment due aux progrès spectaculaires rendus possibles par l’apprentissage profond. Cette explosion s’accompagne d’un double déplacement de ces techniques des milieux académiques vers les milieux industriels : non seulement le secteur marchand s’est approprié ces techniques pour construire des services à destination de ses clients, mais il a également su s’approprier le leadership dans la recherche sur l’apprentissage automatique. A travers ces deux journées d’étude, nous souhaitons interroger et stimuler un déplacement inverse : les données textuelles ayant reçu une attention particulière lors des développements récents de l’apprentissage profond, quels sont pour les sciences humaines et sociales les usages, existants ou possibles, de ces techniques d’apprentissage ?

PROGRAMME

Lundi 25 novembre

13h30 : Accueil

14h-15h : Thibault Clérice École nationale des chartes

Deep learning et humanités : entre score et application

Le développement d'outils en apprentissage profond est toujours évalué de la même manière. Loin de vouloir réformer cette méthode, et en réintroduisant ses grands principes de reproductibilité, nous nous proposons même d'en réaffirmer l'importance. Cependant, à travers une série de cas pratiques, nous réfléchirons à l'importance des scores et données utilisées en apprentissage profond, en particulier dans les humanités. Relation entre sets de données, préparation de corpus, toutes ces étapes peuvent infléchir grandement les résultats d'une analyse finale.

15h-16h : Julien Velcin Laboratoire Eric - Université Lyon

Analyzing informational landscape with AI

Analyzing the complexity of our informational landscape is almost untractable without using advanced machine learning and NLP techniques. Several powerful algorithms have been designed to "digest" this huge amount of information by building efficient categorization schemes. However we have to take care of the consistency of those schemes for them to be accepted by the end user of our system. It is especially the case for researchers in social sciences and humanities. In this talk I will present such a collaboration with researchers who are working on media studies and aim at studying the "mediascape" that grounds most of our understanding of the outer world today. This work has been developed in the project "Journaliste à l'ère du Numérique", in close collaboration with J.C. Soulages (centre M. Weber, Univ. Lyon 2).

16h-16h30 : pause

16h30-17h30: Mathieu Valette Équipe de Recherche Textes, Informatique, Multilinguisme - INALCO

Titre à venir

Mardi 26 novembre

8h30h-9h : accueil

9h-10h : Eric de la Clergerie Almanach - Inria

Titre à venir

10h-11h : Taylor Arnold University of Richmond, Collegium de Lyon

Studying Visual Style in American Television

In this presentation, I show how face detection and recognition algorithms, applied to frames extracted from a corpus of moving images, are able to capture many formal elements present in moving images. Locating and identifying faces makes it possible to algorithmically extract time-coded labels that directly correspond to concepts and taxonomies established within film theory. Knowing the size of detected faces, for example, provides a direct link to the concept of shot framing. The blocking of a scene can similarly be deduced knowing the relative positions of identified characters within a specific cut. Once produced on a large scale, extracted formal elements can be aggregated to explore visual style across a collection of materials. It is then possible to understand how visual style is used within the internal construction of narrative and as a way to engage broadly with external cultural forces. The talk will focus on the application of our techniques for extracting formal elements to a corpus of moving images containing every broadcast episode of two Network Era sitcoms

11h-11h30 : Pause

11h30-12h30 : Angelo Mario Del Grosso Institute of Computational Linguistics “A. Zampolli” ILC-CNR, Pisa

Multilingual Word-by-word alignment. Methodology and some preliminary outcomes towards the construction of multilingual Lexicon within the “Traduzione del Talmud Babilonese” project.

Textual scholars have been exploiting for long time multilingual resources in their daily work to better understand the primary sources they inquire. Bitexts are parallel texts which turn out to be useful in a number of cross-linguistic and comparative processing tasks. This talk will show the workflow adopted within the research activities conducted on the Italian translation of the Babylonian Talmud. More specifically, I will illustrate the ongoing work towards the construction of a multilingual Hebrew/Aramaic/Italian terminological resource by means of stochastic generative approaches to word-by-word text alignment.

The related literature discusses plenty of techniques concerning this topic. The alignment tool I developed is grounded on generative models (i.e., IBM and HMM models), which are a collection of non-supervised machine learning algorithms, to calculate the probability of linking two words in a multilingual term pair.
From a technical standpoint, beside the adopted models, which are based on an alignment function and on an unsupervised training procedure devoted to estimating the unknown probability distributions, other machine learning approaches to word alignment exist that encompass discriminative techniques, which are based on a target function and on a supervised learning process exploiting labeled training data set.
The implemented models were widely adopted in the literary domain, as they are able to profitably handle interpretative bitexts modeling also deletion, insertion, transposition phenomena without having an extant labeled data set.
The workflow I will present encompasses four distinct phases: 1) The encoding of the parallel text, which has been carried out according to the last TEI recommendations. In particular, the linking-target approach described within the Module 16 of the guidelines was used. 2) The semi-automatic extraction of the Italian terms, which has been carried out by means of linguistic analysis technologies available at the Institute of Computational Linguistics (ILC-CNR). These tools include a stochastic component for terminology extraction. 3) The addition of Hebrew/Aramaic terms to the Italian extracted ones via word-by-word alignment to automatically process the three main ancient languages appearing in the Talmud, namely mishnaic Hebrew, biblical Hebrew and babylonian Aramaic. 4) Finally, the revision of the obtained results through an ad-hoc implemented web-based application. This final step is devoted to build a ground truth and/or a gold training set allowing us to perform a complete validation process of the alignment outcomes.
For the time being, 219.000 tokens have been analyzed, extracted from four tractates of the Babylonian Talmud which were translated so far."

Gratuit

marianne.reboul [at] ens-lyon.fr (Marianne Reboul)
jean-philippe.mague [at] ens-lyon.fr (Jean-Philippe Magué)
pierre.borgnat [at] ens-lyon.fr (Pierre Borgnat)

Mots clés

Recherche

Disciplines

Accès directs

Outils

PROGRAMME