Università di Pisa
Sistema bibliotecario di ateneo

The anatomy of a Clustering Engine for Web Snippets

Ferragina, Paolo and Gullì, Antonio (2004) The anatomy of a Clustering Engine for Web Snippets. Technical Report del Dipartimento di Informatica . Università di Pisa, Pisa, IT.

[img] Postscript (GZip) - Published Version
Available under License Creative Commons Attribution No Derivatives.

Download (344Kb)


    Recently there has been a surge of commercial interest about novel IR-tools, like Vivisimo or Groxis, that support the user of a search engine in his/her query formulation and query refinement. The basic idea is that the snippets returned by the search engine are grouped into clusters which are then organized in a hierarchy whose nodes are properly labeled via meaningful sentences. Each sentence must capture the "theme" of the snippets contained into the cluster it labels. This way the user is provided with a small, but intelligible, picture of the query answers at various levels of details. Despite this commercial interest, we found just four scientific papers on this topic. None of them achieved results comparable to Vivisimo, that actually represents the state-of-the-art. In the present paper we address this problem in its full generality: labels of variable length for denoting the clusters, labels drawn from the Web snippets as non contiguous sequences of terms, clusters possibly overlapping and organized within a hierarchy. We achieve this results by means of an algorithmic approach that exploits some innovative ideas, at least from the academic side!

    Item Type: Book
    Uncontrolled Keywords: Web Clustering, Web Snippets, Search Engines, Indexing, Algorithms, Search Engines, Experimentation
    Subjects: Area01 - Scienze matematiche e informatiche > INF/01 - Informatica
    Divisions: Dipartimenti (until 2012) > DIPARTIMENTO DI INFORMATICA
    Depositing User: dott.ssa Sandra Faita
    Date Deposited: 10 Dec 2014 09:49
    Last Modified: 10 Dec 2014 09:49
    URI: http://eprints.adm.unipi.it/id/eprint/2110

    Repository staff only actions

    View Item