Tags:
create new tag
view all tags

Big Data Computing

Master's Degree in Computer Science

Academic Year 2016-2017, spring semester

Instructor: Irene Finocchi

Office Hours: by appointment.
Office: Via Salaria, room 345/A. Phone: 06-49918426.
E-mail: finocchi AT di.uniroma1.it

Meeting times and location

Day Time Room
Monday 10:30-13:00 Aula 2 (Via del Castro Laurenziano 7)
Friday 8:00-10:30 Aula 2 (Via del Castro Laurenziano 7)

News

  • Paper hacking: instructions and advice on paper report writing are available here. Please, hand-in your report no later than May 5.

  • Paper assignment: the assignment of papers is available here, tab Paper-Assignment. Please, hand-in your report (more details soon) no later than May 5.

  • Paper hacking session: at this link you can find a bunch of papers for the paper hacking session. In groups of 2, skim through the papers (titles and abstracts) and bid for your preferred ones: please, insert at least 5 bids, possibly with ties if you have no real preferences on what to read. Bids can be inserted here. Provide your name and insert numbers as in the example at row 2 (1 denotes the highest preference). Deadline: this Friday at midnight. If you do not bid by the deadline, I will likely assign you a random paper. More details on Friday during the lesson.

  • Please, register to the first midterm using this link. Registration is mandatory.

Course description

As data sets grow to Terabyte and Petabyte scales, traditional models and paradigms of sequential computation become obsolete. The course will focus on fundamental algorithmic and programming issues posed by big-data computing, tackling some major data mining problems on a variety of computational models used for managing massive information structures. We will study how algorithm design techniques and technological aspects of modern computing platforms interact and adapt to each other. The emphasis will be on:

  • MapReduce as a programming model for distributed data mining on large clusters of computers
  • Data streaming techniques for mining on-the-fly huge and rapidly changing streams of data
  • External memory algorithms for processing data stored on slow secondary memories

The lectures will follow an experimental and problem-driven approach. The goal for the class is to be broad and to touch upon a variety of techniques, introducing standard practices as well as cutting-edge research topics in this area.

Hands-on programming sessions will be held to guide the students on the use of good programming practices and advanced programming frameworks, such as Hadoop. Students will learn the proper settings in which to use each paradigm, the advantages and disadvantages of each model, how to design/analyze algorithms and to write efficient code in different big data settings.

Learning outcomes:

  • Knowledge of big data processing frameworks (part of the Hadoop ecosystem)
  • Knowledge of advanced computational models, focusing on data streaming, MapReduce-style parallelism, external memory
  • Ability to write efficient code taking into account architectural features of modern computing platforms (including distributed systems)
  • Familiarity with data mining problems and techniques
  • Ability to study advanced research topics in big data systems and algorithmics for massive data
  • Performance analysis skills using back-of-the-envelope calculations, mathematical and experimental tools

Lectures and readings

Readings, notes, slides, papers, code... are posted here after each lecture.

Schedule 2017

There are no required textbooks for this class: many lessons explore cutting-edge topics and there is no unique book covering all of them systematically.

Some resources that we will use along the way are:

  • J. Leskovec, A. Rajaraman, and J. Ullman, Mining of Massive Datasets. Available online.
  • T. White. Hadoop: The Definitive Guide - Storage and Analysis at Internet Scale (4th edition). O'Reilly Media.
  • C. Demetrescu and I. Finocchi. Algorithms for data streams. In Handbook of Applied Algorithms, John Wiley and Sons, 2008
  • K. Mehlhorn, P. Sanders. Algorithms and data structures: The basic toolbox, Springer, 2009. Book web site.

Homeworks

The homework for A.Y. 2017-2018 (including a small software project in MapReduce) will be posted here during the course.

Old homeworks:

Grading

  • Homework assigned during the course, including a small software project in MapReduce (you can work in groups, up to two persons.)
  • Reading assignment: read a scientific paper on big data systems, write a report, and (possibly) present the paper in class (you can work in groups, up to two persons.)
  • Final written exam with both multiple choice and open questions (if time permits, we can insert two midterms)

Google group

The group will be used for technical discussions, homework assignment, last-minute messages. Better subscribing!

Subscribe to Big Data Computing (Sapienza, Irene Finocchi)

| Email: |

Visit this group
Edit | Attach | Watch | Print version | History: r92 < r91 < r90 < r89 < r88 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r92 - 2018-04-30 - IreneFinocchi






 
Questo sito usa cookies, usandolo ne accettate la presenza. (CookiePolicy)
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback