Homework discussion and exam registration: June 28, 9:30, Aula Seminari (for students who sent me both homeworks and passed both midterms or June 6 final).
Please, register to the second midterm using this link. You can participate independently of the outcome of the first midterm. The program is indeed divided into two parts: the first part covers topics explained in the first half of the course, until April 11 (included), the second part covers all the remaining topics. Once one part is passed (either through a midterm or through a final exam), the score for that part remains valid for the entire academic year.
Second homework posted (see Homeworks section below). Deadline: June 9.
Final exams, summer term:
On May 24 we'll have a second midterm (covering topics explained in the second half of the course, from April 12 - included - to May 23).
Final exam dates: June 6 (9:30, Aula Alfa) and June 30 (9:30, Aula Alfa).
You must register for the final exams on Infostud. This is required even if you passed both midterms, so that I can formally register your score on Infostud.
Final exams, starting in June, will be divided into two parts: you can do one part in a session and another one in a different session. Once one part is passed, the score for that part remains valid for the entire academic year.
Virtual machine with a lightweight Ubuntu distribution (LXLE) containing the latest release of Hadoop: download LXLE here (file size: 4GB) . In order to use the virtual machine, you need VirtualBox (available on the Oracle website).
As data sets grow to Terabyte and Petabyte scales, traditional models and paradigms of sequential computation become obsolete. The course will focus on fundamental algorithmic and programming issues posed by big-data computing, tackling some major data mining problems on a variety of computational models used for managing massive information structures. We will study how algorithm design techniques and technological aspects of modern computing platforms interact and adapt to each other. The emphasis will be on:
MapReduce as a programming model for distributed data mining on large clusters of computers
Data streaming techniques for mining on-the-fly huge and rapidly changing streams of data
External memory algorithms for processing data stored on slow secondary memories
The lectures will follow an experimental and problem-driven approach. The goal for the class is to be broad and to touch upon a variety of techniques, introducing standard practices as well as cutting-edge research topics in this area.
Hands-on programming sessions will be held to guide the students on the use of good programming practices and advanced programming frameworks, such as Hadoop. Students will learn the proper settings in which to use each paradigm, the advantages and disadvantages of each model, how to design/analyze algorithms and to write efficient code in different big data settings.
Knowledge of big data processing frameworks (part of the Hadoop ecosystem)
Knowledge of advanced computational models, focusing on data streaming, MapReduce-style parallelism, external memory
Ability to write efficient code taking into account architectural features of modern computing platforms (including distributed systems)
Familiarity with data mining problems and techniques
Ability to study advanced research topics in big data systems and algorithmics for massive data
Performance analysis skills using back-of-the-envelope calculations, mathematical and experimental tools
Lectures and readings
Readings, notes, slides, papers, code... are posted here after each lecture.
There are no required textbooks for this class: many lessons explore cutting-edge topics and there is no unique book covering all of them systematically.
Some resources that we will use along the way are:
Theory/programming homeworks assigned during the course. You can form groups, up to three persons. Homework assignments are due at the prescribed time: exceptions will be considered only when arranged beforehand with me, or in properly documented emergencies. If you are late with a homework, I will subtract points from its evaluation. Late homeworks must be nevertheless handed in (before the written exam, in the worst case) in order to be able to pass the exam itself. However, if a homework is too late, it might not contribute to your final score.
(Participation to the MapReduce programming contest )
Participation to in-class discussions and reading sessions
Final written exam
The group will be used for technical discussions, homework assignment, last-minute messages. Better subscribing!