Unsupervised Semantic Expansion of Specialized Long Documents for Information Retrieval [Directed PhD thesis]
By Oussama Ayoub, my former PhD student co-advised with Christophe Rodrigues at ESILV. [theses.fr, manuscrit]
First Phd student for whom I was director (previous were co-advised). A great experience were I learned a lot!
Abstract: Faced with the incessant growth of textual data that needs processing, Information Retrieval (IR) systems are confronted with the urgent need to adopt effective mechanisms for efficiently selecting document sets that are best suited to specific queries. A predominant difficulty lies in the terminological divergence between the terms used in queries and those present in relevant documents. This semantic disparity, particularly pronounced for terms with similar meanings in large-scale documents from specialized domains, poses a significant challenge for IR systems.In addressing these challenges, many studies have been limited to query enrichment via supervised models, an approach that proves inadequate for industrial application and lacks flexibility. This thesis proposes LoGE an innovative alternative with an unsupervised search system based on advanced Deep Learning methods. This system uses a masked language model to extrapolate associated terms, thereby enriching the textual representation of documents. The Deep Learning models used, pre-trained on extensive textual corpora, incorporate general or domain-specific knowledge, thus optimizing the document representation.The analysis of the generated extensions revealed an imbalance between the signal (relevant terms added) and the noise (irrelevant terms). To address this issue, we developed SummVD, an innovative extractive automatic summarization approach, using singular value decomposition to synthesize the information contained in documents and identify the most pertinent phrases. This method has been adapted to filter the terms of the extensions based on the local context of each document, thereby maintaining the relevance of the information while minimizing noise.
Laisser un commentaire