On the Robustness of Text Vectorizers
Damien Garreau (MCF at J. A. Dieudonné laboratory, Université Côte d'Azur).
Speaker: Damien Garreau (MCF at J. A. Dieudonné laboratory, Université Côte d'Azur).
Title: On the Robustness of Text Vectorizers
Abstract:
A fundamental issue in natural language processing pipelines is their robustness with respect to changes in the input. One critical step in this process is the embedding of documents, which transforms sequences of words or tokens into vector representations. In this talk, I will show how popular embedding schemes, such as concatenation, TF-IDF, and Paragraph Vector (a.k.a. doc2vec), exhibit robustness in the Hölder or Lipschitz sense with respect to the Hamming distance. I will present quantitative bounds and demonstrate how the constants involved are affected by the length of the document. This is joint work with Rémi Catellier and Samuel Vaiter.
Preprint: https://arxiv.org/abs/2303.07203
More information: https://sites.google.com/view/damien-garreau/
Talk in room M7 101 (Campus Monod, ENS de Lyon)