Multi‑Document Summarization

Multi‑Document Summarization has become more and more important due to the overgrowing number of documents especially in the internet. Multi‑Document Summarization systems must produce summaries about different texts keeping track of redundant information but also contradictory information and also additional information. In this work, we especially aim at finding redundant information as well as contradictory information using techniques such as text similarity, topic segmentation and networks of words (i.e. maps of content).

 

Web Content Mining

Web Content Mining is more and more important as one wants to filter out important information from the web. This work aims at detecting similar websites through the analysis of their content, in particular their related terms automatically extracted from a query. Machine Learning, Datawarehouse and Natural Language Processing techniques will be used to achieve this work.

 

Multimedia Information Retrieval

Combining images and text is a new trend in Artificial Intelligence. Here, we aim at using image features together with text features to index web images (i.e. to characterize the content of an image with words). The student will have to implement a multimedia search engine to evaluate our assumptions i.e. using multiword units, position of words to the image, density of words, colour textures etc.

 

Automatic Clustering of Multiword Units

This work aims at clustering multiword units extracted from SENTA and/or HELAS (softwares for the extraction of multiword units) following their morpho‑syntactic structure and context i.e. compound nouns, verbal locutions etc. In a first phase, the PoBOC algorithm (a soft clustering algorithm) will be used. For that purpose, it will be necessary to define attributes and a specific similarity measure. Other clustering algorithm will then be used to evaluate the PoBOC.

 

Efficient Computing of Non‑Contiguous Multiword Units

Computing statistics of non‑contiguous substrings can be overwhelming when gigabytes of texts may be processed. For contiguous substrings, suffix‑arrays have been used to speed up the process of calculating their frequency. For non-contiguous substrings, Gil & Dias(2003) have proposed an adaptation of the suffix‑array data structure. Meanwhile, Yamamoto (2000) has proposed a new methodology to calculate statistics for contiguous substrings using classes of substrings. The aim of this work is to extend Yamamoto’s algorithm to non‑contiguous substrings.