WebHome < BDC

Big Data Computing

Master's Degree in Computer Science

Academic Year 2018-2019, spring semester

Instructor: Irene Finocchi

Office Hours: by appointment.
Office: Città Universitaria, Department of Statistical Sciences, 4th floor, room 21.
E-mail: irene.finocchi AT uniroma1.it

Meeting times and location

Day	Time	Room
Monday	10:30-13:00	Aula 2 (Via del Castro Laurenziano 7)
Friday	8:00-10:30	Aula 2 (Via del Castro Laurenziano 7)

News
Course description
Lectures and readings
Grading
Google group

News

September exam grades: here. Final grades for students who passed both parts are here. I will proceed with grade registration on Infostud unless you let me know, by October 8, that you are not going to accept the grade and want to improve it.

July 1 final grades: final grades. I will proceed with grade registration on Infostud (for students who passed both parts) unless you let me know, by July 31, that you are not going to accept the grade and want to improve it.

September exam: September 4, 9:00, aula 34 Dipartimento di Statistica (fourth floor). Students who have already passed Part 1 can arrive at 10:30.

July 1 grades: here. The final grades for students who have passed both the first and the second part, considering also the paper hacking reports and presentations, will be available at the end of this week.

June 6 final grades: final grades. I will proceed with the grade registration on Infostud (for students who passed both parts) unless you let me know, by July 3 (next wednesday), that you are not going to accept the grade and want to improve it. (June 6 exam grades are available here)

Midterm grades: here

Student presentations June 3: Sala riunioni 34 (Statistica), Department of Statistical Sciences, Città Universitaria (Main Campus at Piazzale Aldo Moro), building CU002 (http://www.dss.uniroma1.it/sites/default/files/files/mappaCUSapienza.pdf), 4th floor, Room 34.

Summer exams: June 6 at 9:00 in Aula P1 and July 1 at 9:00 in Aula P2 (Main Campus). Students who have already passed the first part can arrive at 10:30.

Course description

As data sets grow to Terabyte and Petabyte scales, traditional models and paradigms of sequential computation become obsolete. The course will focus on fundamental algorithmic and programming issues posed by big-data computing, tackling some major data mining problems on a variety of computational models used for managing massive information structures. We will study how algorithm design techniques and technological aspects of modern computing platforms interact and adapt to each other. The emphasis will be on:

MapReduce as a programming model for distributed data mining on large clusters of computers
Data streaming techniques for mining on-the-fly huge and rapidly changing streams of data
External memory algorithms for processing data stored on slow secondary memories

The lectures will follow an experimental and problem-driven approach. The goal for the class is to be broad and to touch upon a variety of techniques, introducing standard practices as well as cutting-edge research topics in this area.

Hands-on programming sessions will be held to guide the students on the use of good programming practices and advanced programming frameworks, such as Hadoop. Students will learn the proper settings in which to use each paradigm, the advantages and disadvantages of each model, how to design/analyze algorithms and to write efficient code in different big data settings.

Learning outcomes:

Knowledge of big data processing frameworks (part of the Hadoop ecosystem)
Knowledge of advanced computational models, focusing on data streaming, MapReduce-style parallelism, external memory
Ability to write efficient code taking into account architectural features of modern computing platforms (including distributed systems)
Familiarity with data mining problems and techniques
Ability to study advanced research topics in big data systems and algorithmics for massive data
Performance analysis skills using back-of-the-envelope calculations, mathematical and experimental tools

Lectures and readings

Readings, notes, slides, papers, code... are posted here after each lecture.

Schedule 2018

There are no required textbooks for this class: many lessons explore cutting-edge topics and there is no unique book covering all of them systematically.

Some resources that we will use along the way are:

J. Leskovec, A. Rajaraman, and J. Ullman, Mining of Massive Datasets. Available online.
T. White. Hadoop: The Definitive Guide - Storage and Analysis at Internet Scale (4th edition). O'Reilly Media.
C. Demetrescu and I. Finocchi. Algorithms for data streams. In Handbook of Applied Algorithms, John Wiley and Sons, 2008
K. Mehlhorn, P. Sanders. Algorithms and data structures: The basic toolbox, Springer, 2009. Book web site.

Grading

Written exam with both multiple choice and open questions (if time permits, we will insert two midterms)
Reading assignment: read a scientific paper on big data systems, write a report, and (possibly) present the paper in class (you can work in groups, typically composed of two persons. In exceptional cases I could allow groups of 3.)