Big Data Computing

Master's Degree in Computer Science

Academic Year 2018-2019, spring semester

Instructor: Irene Finocchi

Office Hours: by appointment.
Office: Cittą Universitaria, Department of Statistical Sciences, 4th floor, room 21.
E-mail: irene.finocchi AT uniroma1.it

Meeting times and location

Day Time Room
Monday 10:30-13:00 Aula 2 (Via del Castro Laurenziano 7)
Friday 8:00-10:30 Aula 2 (Via del Castro Laurenziano 7)

News

  • September exam grades: here. Final grades for students who passed both parts are here. I will proceed with grade registration on Infostud unless you let me know, by October 8, that you are not going to accept the grade and want to improve it.

  • July 1 final grades: final grades. I will proceed with grade registration on Infostud (for students who passed both parts) unless you let me know, by July 31, that you are not going to accept the grade and want to improve it.

  • September exam: September 4, 9:00, aula 34 Dipartimento di Statistica (fourth floor). Students who have already passed Part 1 can arrive at 10:30.

  • July 1 grades: here. The final grades for students who have passed both the first and the second part, considering also the paper hacking reports and presentations, will be available at the end of this week.

  • June 6 final grades: final grades. I will proceed with the grade registration on Infostud (for students who passed both parts) unless you let me know, by July 3 (next wednesday), that you are not going to accept the grade and want to improve it. (June 6 exam grades are available here)

  • Midterm grades: here

  • Summer exams: June 6 at 9:00 in Aula P1 and July 1 at 9:00 in Aula P2 (Main Campus). Students who have already passed the first part can arrive at 10:30.

Course description

As data sets grow to Terabyte and Petabyte scales, traditional models and paradigms of sequential computation become obsolete. The course will focus on fundamental algorithmic and programming issues posed by big-data computing, tackling some major data mining problems on a variety of computational models used for managing massive information structures. We will study how algorithm design techniques and technological aspects of modern computing platforms interact and adapt to each other. The emphasis will be on:

  • MapReduce as a programming model for distributed data mining on large clusters of computers
  • Data streaming techniques for mining on-the-fly huge and rapidly changing streams of data
  • External memory algorithms for processing data stored on slow secondary memories

The lectures will follow an experimental and problem-driven approach. The goal for the class is to be broad and to touch upon a variety of techniques, introducing standard practices as well as cutting-edge research topics in this area.

Hands-on programming sessions will be held to guide the students on the use of good programming practices and advanced programming frameworks, such as Hadoop. Students will learn the proper settings in which to use each paradigm, the advantages and disadvantages of each model, how to design/analyze algorithms and to write efficient code in different big data settings.

Learning outcomes:

  • Knowledge of big data processing frameworks (part of the Hadoop ecosystem)
  • Knowledge of advanced computational models, focusing on data streaming, MapReduce-style parallelism, external memory
  • Ability to write efficient code taking into account architectural features of modern computing platforms (including distributed systems)
  • Familiarity with data mining problems and techniques
  • Ability to study advanced research topics in big data systems and algorithmics for massive data
  • Performance analysis skills using back-of-the-envelope calculations, mathematical and experimental tools

Lectures and readings

Readings, notes, slides, papers, code... are posted here after each lecture.

Schedule 2018

There are no required textbooks for this class: many lessons explore cutting-edge topics and there is no unique book covering all of them systematically.

Some resources that we will use along the way are:

  • J. Leskovec, A. Rajaraman, and J. Ullman, Mining of Massive Datasets. Available online.
  • T. White. Hadoop: The Definitive Guide - Storage and Analysis at Internet Scale (4th edition). O'Reilly Media.
  • C. Demetrescu and I. Finocchi. Algorithms for data streams. In Handbook of Applied Algorithms, John Wiley and Sons, 2008
  • K. Mehlhorn, P. Sanders. Algorithms and data structures: The basic toolbox, Springer, 2009. Book web site.

Grading

  • Written exam with both multiple choice and open questions (if time permits, we will insert two midterms)

  • Reading assignment: read a scientific paper on big data systems, write a report, and (possibly) present the paper in class (you can work in groups, typically composed of two persons. In exceptional cases I could allow groups of 3.)

Google group

The group will be used for technical discussions, homework assignment, last-minute messages. Better subscribing!

Subscribe to Big Data Computing (Sapienza, Irene Finocchi)

| Email: |

Visit this group
Edit | Attach | Watch | Print version | History: r114 < r113 < r112 < r111 < r110 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r114 - 2019-10-05 - IreneFinocchi






 
Questo sito usa cookies, usandolo ne accettate la presenza. (CookiePolicy)
Torna al Dipartimento di Informatica
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback