Klaus Berberich,

Max-Planck-Institute für Informatics

Abstract:

Frequent sequence mining is a fundamental building block in data mining. While
the problem has been intensively studied, existing methods cannot handle
datasets consisting of billions of sequences. Datasets of that scale are common
in applications such as natural language processing, when computing n-gram
statistics over large-scale document collections, and business intelligence, when
analyzing sessions of millions of users.

In this talk, I will present two methods that we developed recently to mine
frequent sequences using MapReduce as a platform for distributed data
processing. Suffix-Sigma, as the first method, targets the special case of
contiguous sequences such as n-grams. It relies on sorting and aggregating
sequence suffixes, leveraging ideas from string processing. MG-FSM, as the
second method, identifies also non-contiguous frequent sequences. To this end,
it partitions and prepares the input in such way that frequent sequences can be
efficiently mined in isolation on each of the resulting partitions using any
existing method. Experiments on two large-scale document collections demonstrate that Suffix-Sigma and MG-FSM are substantially more efficient and scalable than alternative approaches. Furthermore, I will discuss extensions of Suffix-Sigma and MG-FSM, for instance, to report only closed or maximal sequences and thus drastically reduce their output.

(* Joint INESC-ID/LASIGE Seminar)

Klaus Berberich is a Senior Researcher at the Max Planck Institute for Informatics where he coordinates the research area Text + Time Search & Analytics. His research is rooted in Information Retrieval and touches the related areas of Data Management and Data Mining. Klaus has built a time machine — to search in web archives. More recently, he has worked on frequent sequence mining algorithms for modern platforms such as MapReduce. His ongoing research focuses on (i) novelty & diversity in web archive search; (ii) temporal linking of document collections; (iii) mining document collections for insights about the past, present, and future. Klaus holds a doctoral degree (2010, summa cum laude) and a diploma (2004) in Computer Science from Saarland University. He has served on numerous program committees in his research communities of interest (IR, DB, DM).

 

Date: 2014-Dec-04     Time: 15:30:00     Room: 020


For more information: