Frequent Sequence Mining in MapReduce*
Klaus Berberich,
Max-Planck-Institute für Informatics –
Abstract:
Frequent sequence mining is a fundamental building block in data mining. While
the problem has been intensively studied, existing methods cannot handle
datasets consisting of billions of sequences. Datasets of that scale are common
in applications such as natural language processing, when computing n-gram
statistics over large-scale document collections, and business intelligence, when
analyzing sessions of millions of users.
In this talk, I will present two methods that we developed recently to mine
frequent sequences using MapReduce as a platform for distributed data
processing. Suffix-Sigma, as the first method, targets the special case of
contiguous sequences such as n-grams. It relies on sorting and aggregating
sequence suffixes, leveraging ideas from string processing. MG-FSM, as the
second method, identifies also non-contiguous frequent sequences. To this end,
it partitions and prepares the input in such way that frequent sequences can be
efficiently mined in isolation on each of the resulting partitions using any
existing method. Experiments on two large-scale document collections demonstrate that Suffix-Sigma and MG-FSM are substantially more efficient and scalable than alternative approaches. Furthermore, I will discuss extensions of Suffix-Sigma and MG-FSM, for instance, to report only closed or maximal sequences and thus drastically reduce their output.
(* Joint INESC-ID/LASIGE Seminar)
Klaus Berberich is a Senior Researcher at the Max Planck Institute for Informatics where he coordinates the research area Text + Time Search & Analytics. His research is rooted in Information Retrieval and touches the related areas of Data Management and Data Mining. Klaus has built a time machine — to search in web archives. More recently, he has worked on frequent sequence mining algorithms for modern platforms such as MapReduce. His ongoing research focuses on (i) novelty & diversity in web archive search; (ii) temporal linking of document collections; (iii) mining document collections for insights about the past, present, and future. Klaus holds a doctoral degree (2010, summa cum laude) and a diploma (2004) in Computer Science from Saarland University. He has served on numerous program committees in his research communities of interest (IR, DB, DM).
Date: 2014-Dec-04 Time: 15:30:00 Room: 020
For more information:
- mjs@inesc-id.pt
- 213100360
Upcoming Events
OLISSIPO Summer School in Lisbon | Computational phylogenetics to analyse the evolution of cells and communities

We are happy to announce the OLISSIPO Summer School on Computational phylogenetics to analyse the evolution of cells and communities, which will be held in Lisbon, Portugal, at INESC-ID, between July 2-7, 2023.
Keynote speakers:
David Posada, University of Vigo (class)
João Alves, University of Vigo (hands-on)
Nadia El-Mabrouk, Université de Montréal (class)
Mattéo Delabre, Université de Montréal (hands-on)
Ran Libeskind-Hadas, Claremont McKenna College (class and hands-on)
Russell Schwartz, Carnegie Mellon University (class and hands-on)
See the preliminary agenda at: https://olissipo.inesc-id.pt/tree-tango-school
Registration is mandatory. You can register at: https://forms.gle/VsASFHW5E7MJvaCc9
The registration fee is 250€ for students and OLISSIPO members and 350€ for postdocs or other researchers (meals indicated at the schedule of the school are included, accommodation and flights are not). All details will be made available upon registration.
We will have slots for flash talks (3-10 min depending on the number of submissions) to present yourself and the work you have been developing in your research.
The 13th Lisbon Machine Learning School | LxMLS 2023

The Lisbon Machine Learning Summer School (LxMLS) takes place yearly at Instituto Superior Técnico (IST). LxMLS 2023 will be a 6-day event (14-20 July, 2023), scheduled to take place as an in-person event.
The school covers a range of machine learning topics, from theory to practice, that are important in solving natural language processing problems arising in different application areas. It is organized jointly by Instituto Superior Técnico (IST), a leading Engineering and Science school in Portugal, the Instituto de Telecomunicações, the Instituto de Engenharia de Sistemas e Computadores, Investigação e Desenvolvimento em Lisboa (INESC-ID), the Lisbon ELLIS Unit for Learning and Intelligent Systems (LUMLIS), Unbabel, Zendesk, and IBM Research.
Check online for information about past editions: LxMLS 2011, LxMLS 2012, LxMLS 2013, LxMLS 2014, LxMLS 2015, LxMLS 2016, LxMLS 2017, LxMLS 2018, LxMLS 2019, LxMLS 2020, LxMLS 2021, LxMLS 2022 (you can also watch the videos of the lectures for 2016, 2017, 2018, and 2020).
31st International Conference on Information Systems Development (ISD 2023)

The 31st International Conference on Information Systems Development (ISD 2023) conference provides a forum for research and developments in the field of information systems. The theme of ISD 2023 is “Information systems development, organizational aspects and societal trends”. New trends in developing information systems emphasize the continuous collaboration between developers and operators in order to optimize the software delivery time. The conference promotes research on methodological and technological issues and how IS developers and operators are transforming organizations and society through information systems.
The ISD 2023 conference held this year also provides an opportunity for researchers and practitioners to promote their research, practical experience, and to discuss issues related to Information Systems through papers, posters, and journal-first paper presentations.
ISD 2023 will be hosted by Instituto Superior Técnico, in Lisbon, Portugal, on August 30–September 1, 2023.