---++!! *Big Data Computing* ---+++!! Master's Degree in Computer Science ---+++!! Academic Year 2018-2019, spring semester <table width="100%" border=0 cellpadding=0> <tr> <td width="65%" valign="top"> <table width="100%" border=0 cellspacing=0> <tr> <td valign="top"> <font color=#AF0F0F size="+1"><b>Instructor: [[http://www.dsi.uniroma1.it/~finocchi][Irene Finocchi]]</b></font> *Office Hours*: by appointment.<br />Office: Città Universitaria, Department of Statistical Sciences, 4th floor, room 21. <br />E-mail: irene.finocchi AT uniroma1.it </td></tr> <tr> <td valign="top"> <div align="left"> </div> <div align="left"><b><p> </p>Meeting times and location</b></div> | *Day* | *Time* | *Room* | | Monday | 10:30-13:00 | Aula 2 (Via del Castro Laurenziano 7) | | Friday | 8:00-10:30 | Aula 2 (Via del Castro Laurenziano 7) | </td></tr> </table> </td> <td width="35%" valign="top" style="border-left: 1px solid #999999; "> %TOC% </td> </tr> </table> ---++ News * *September exam grades*: [[https://drive.google.com/file/d/1CQ5cbCZREjoNcrzGBKk2QG_Vqq57vbUx/view?usp=sharing][here]]. Final grades for students who passed both parts are [[https://drive.google.com/file/d/1Sj6TAhSCGJ_L-0W7tFWS2kfJCEah0GeC/view?usp=sharing][here]]. I will proceed with grade registration on Infostud unless you let me know, by October 8, that you are not going to accept the grade and want to improve it. * *July 1 final grades*: [[https://drive.google.com/file/d/10lmxMPBiCPCgapjfs7AJ7MKB8PNCEfk6/view?usp=sharing][final grades]]. I will proceed with grade registration on Infostud (for students who passed both parts) unless you let me know, by July 31, that you are not going to accept the grade and want to improve it. * *September exam*: *September 4*, *9:00*, *aula 34* Dipartimento di Statistica (fourth floor). Students who have already passed Part 1 can arrive at 10:30. * *July 1 grades*: [[https://drive.google.com/file/d/14BsOiTJ8I-LEcBscEw64nTH3EGWTcFb4/view?usp=sharing][here]]. The final grades for students who have passed both the first and the second part, considering also the paper hacking reports and presentations, will be available at the end of this week. * *June 6 final grades*: [[https://drive.google.com/file/d/1n1N_EE20yWNUviPH8OAiOAKcmBQoc2er/view?usp=sharing][final grades]]. I will proceed with the grade registration on Infostud (for students who passed both parts) unless you let me know, by July 3 (next wednesday), that you are not going to accept the grade and want to improve it. (June 6 exam grades are available [[https://drive.google.com/file/d/1veEV6GCdpJco6CC0Z3Uxte1JC4GW0O9v/view?usp=sharing][here]]) * *Midterm grades*: [[https://drive.google.com/file/d/1Y1eub3gIJ43ZzF5y9fhe2zzgC-EK67UF/view?usp=sharing][here]] * *Student presentations June 3*: Sala riunioni 34 (Statistica), Department of Statistical Sciences, Città Universitaria (Main Campus at Piazzale Aldo Moro), building CU002 (http://www.dss.uniroma1.it/sites/default/files/files/mappaCUSapienza.pdf), 4th floor, Room 34. * *Summer exams*: *June 6* at 9:00 in *Aula P1* and *July 1* at 9:00 in *Aula P2* (Main Campus). Students who have already passed the first part can arrive at 10:30. <!-- * *January exam grades*: [[https://drive.google.com/file/d/13ltMAvZjTLtFHWdqDWEpOkrbjHCYo2WV/view?usp=sharing][here]]. Please, let me know by *Friday, February 1*, if you *do not accept* your final grade. Otherwise, I will register it on Infostud (after that, you should receive an automatic email notification). * *Final grades September session*: [[https://drive.google.com/file/d/1g6_p01eCvdRfMi29mXZIfuM9r7iwTaIq/view?usp=sharing][here]]. Please, let me know by *Thursday, October 11*, if you *do not accept* your final grade. Otherwise, I will register it on Infostud (after that, you should receive an automatic email notification). * *Grades September 6 exam*: [[https://drive.google.com/file/d/1jA1M3RBJuwwSjKuC6yjB8bKow_Siv5Xl/view?usp=sharing][here]]. The final grades for students who have passed both the first and the second part, considering also the paper hacking reports and presentations, will be available at the beginning of next week. * *July 5 final grades*: [[https://drive.google.com/file/d/1sfGIfjnbF0ywELCqAG3AvKsuH3NBi9Xv/view?usp=sharing][here]]. Please, let me know by *Thursday, July 26*, if you *do not accept* your final grade. Otherwise, I will register it on Infostud (after that, you should receive an automatic email notification). Student 1826973 should contact me by email. * *Grades July 5 exam*: [[https://drive.google.com/file/d/18RVxWvAUG8vN9dlntyhH2KOpbT1EBChD/view?usp=sharing][here]]. The final grades for students who have passed both the first and the second part, considering also the paper hacking reports and presentations, will be available at the beginning of next week. * *July 5 exam*: *part 1* starts at *15:00*, *part 2* starts at *16:30*. * *June 6 final grades*: [[https://drive.google.com/file/d/1emsW_NypsIyTNb5QtxnrXJL0NjEvoszv/view?usp=sharing][here]]. Please, let me know by Wednesday, July 4, if you *do not accept* your final grade. Otherwise, I will register your grade on Infostud (after that, you should receive an automatic email notification). * *Grades June 6 exam*: [[https://drive.google.com/file/d/1OHwjp_T8fvGc_0hlqTF_ddL2oDek7Xgp/view?usp=sharing][here]]. The final grades for students who have passed both the first and the second part, considering also the paper hacking reports and presentations, will be available at the beginning of next week. * *Next exam*: due to an unexpected commitment, the next BDC exam is postponed from June 27 to *July 5*. Location: Aula 2. The first part will start at *15:00* (sharp) and the second part will start at 16:30. Late students will not be admitted. Please, register on Infostud to the June 27 session. * *June 6 exam*: the exam will be divided into two parts. The first part will start at 9:00 (sharp) and the second part will start at 10:30. Late students will not be admitted. * *Midterm scores*: [[https://drive.google.com/file/d/1-VuP3EuVYOnXADZIMLiboCYAOw_q0IJ4/view?usp=sharing][here]] * *Paper hacking*: instructions and advice on paper report writing are available [[https://drive.google.com/file/d/1dJvRv7fyZNJr0no-IR72AXnDi33RFqU7/view?usp=sharing][here]]. Please, hand-in your report no later than May 5. * *Paper assignment*: the assignment of papers is available [[https://docs.google.com/spreadsheets/d/1-DgONoBBoc9lSDpORNGQORreyjJcy6ArLdl4EPZ3eYk/edit?usp=sharing][here, tab Paper-Assignment]]. Please, hand-in your report (more details soon) no later than May 5. * *Paper hacking session*: at [[https://drive.google.com/drive/folders/1AlHY6QvGCn1U0rMZeie-MyWqdFbbXr9q?usp=sharing][this link]] you can find a bunch of papers for the paper hacking session. In groups of 2, skim through the papers (titles and abstracts) and bid for your preferred ones: please, insert at least 5 bids, possibly with ties if you have no real preferences on what to read. Bids can be inserted [[https://docs.google.com/spreadsheets/d/1-DgONoBBoc9lSDpORNGQORreyjJcy6ArLdl4EPZ3eYk/edit?usp=sharing][here]]. Provide your name and insert numbers as in the example at row 2 (1 denotes the highest preference). *Deadline: this Friday at midnight.* If you do not bid by the deadline, I will likely assign you a random paper. More details on Friday during the lesson. * Please, *register to the first midterm* using *[[http://twiki.di.uniroma1.it/twiki/view/Prenotazioni/2018_04_23_BigDataComputing][this link]]*. Registration is *mandatory*. * <font color="#AF0F0F"> *Google group registration:* </font> Please, use your intitutional email or let me know that you are a student attending the BDC class. * <font color="#AF0F0F"> *Course starting date for A.Y. 2017-2018:* </font> *Last-minute update.* Due to extreme weather conditions, [[https://www.uniroma1.it/it/notizia/didattica-sospesa-il-26-febbraio-emergenza-maltempo][all teaching activities at Sapienza are cancelled]] tomorrow. Hence, the course will start on Friday *March 2, 8:30*, Aula 2 Via del Castro Laurenziano 7 (Aule L Ingegneria). * *Restricted November session*: the restricted exam session (for students who have failed to graduate within the prescribed time; repeating students; part-time students; students workers) will be on *November 3*, *Aula Seminari*, *9:30*. Please, register for the exam on Infostud if you are entitled to participate. * Students who have passed both parts, sent me the homework, and didn't hear back from me about final grade registration on Infostud, please, send me an email with subject "BDC final grade registration" and your names. Thanks. * June 28 results: [[https://drive.google.com/file/d/0B1yYvm6QgJReb04tamtlOVRpbnM/view?usp=sharing][part 1]], [[https://drive.google.com/file/d/0B1yYvm6QgJReSzAzdlBQTDJ5MDg/view?usp=sharing][part 2]] * [[https://drive.google.com/file/d/0B1yYvm6QgJReM1lHVTRfTV81dGc/view?usp=sharing][First midterm grades]] <table> <tr> <td rowspan="3"> <img src="%ATTACHURLPATH%/midterm.jpg" alt="midterm.jpg" width='136' height='133' /></td> <td> * <font color=#AF0F0F><b>Midterm:</b></font> Tuesday *April 4*, Aula Alfa, <b>10:00 - 14:00</b> </td> </tr> </table> * <span class="WYSIWYG_COLOR" style="color: crimson;"> *Homework discussion and exam registration* </span>: *June 28, 9:30, Aula Seminari* (for students who sent me both homeworks and passed both midterms or June 6 final). * Please, *register to the second midterm* using *[[Prenotazioni.2016_05_24_BigDataComputingMidterm2][this link]]*. You can participate independently of the outcome of the first midterm. The program is indeed divided into two parts: the first part covers topics explained in the first half of the course, until April 11 (included), the second part covers all the remaining topics. Once one part is passed (either through a midterm or through a final exam), the score for that part remains valid for the entire academic year. * *Final exams, summer term*: * On *May 24* we'll have a second midterm (covering topics explained in the second half of the course, from April 12 - included - to May 23). * Final exam dates: *June 6* (9:30, Aula Alfa) and *June 30* (9:30, Aula Alfa). * You must *register* for the final exams on *Infostud*. This is required even if you passed both midterms, so that I can formally register your score on Infostud. * Final exams, starting in June, will be divided into two parts: you can do one part in a session and another one in a different session. Once one part is passed, the score for that part remains valid for the entire academic year. * Virtual machine with a lightweight Ubuntu distribution (LXLE) containing the latest release of Hadoop: *[[https://drive.google.com/file/d/0B1yYvm6QgJReUlgyc0pBQk1OOGc/view?usp=sharing][download LXLE here]]* (file size: 4GB) . In order to use the virtual machine, you need !VirtualBox (available on the Oracle website). <p> </p> * [[https://docs.google.com/spreadsheets/d/106mnEY8tZ6eiCH2mCgJ8z95JOwo7djCF7khdcEg6x94/edit?usp=sharing][Current contest results here]] * *Installing Hadoop*: we'll shortly introduce !MapReduce and its open source implementation Hadoop. Hadoop installation can pose some issues, depending on your platform. It would be therefore very useful if you could try to install Hadoop 2.2.0 right now, so that you can bring your laptop to the class with a working Hadoop version when requested (you can work in pairs). Together with [[http://www.dsi.uniroma1.it/~fusco][Dr. Emanuele Fusco]], we have prepared a short installation guide (there are many tutorials on the Web, but not all of them are correct/updated): [[%ATTACHURL%/installHadoop.pdf][here it is]]. The tutorial should be self-contained. If something goes wrong, just let us know: 1) posting questions to the Google group (if they are of general interest), or 2) sending us an email to arrange a meeting during the office hours. Better trying the installation shortly in order to get support! --> ---++ Course description As data sets grow to Terabyte and Petabyte scales, traditional models and paradigms of sequential computation become obsolete. The course will focus on fundamental algorithmic and programming issues posed by big-data computing, tackling some major data mining problems on a variety of computational models used for managing massive information structures. We will study how algorithm design techniques and technological aspects of modern computing platforms interact and adapt to each other. The emphasis will be on: * !MapReduce as a programming model for distributed data mining on large clusters of computers * Data streaming techniques for mining on-the-fly huge and rapidly changing streams of data * External memory algorithms for processing data stored on slow secondary memories <p> </p> The lectures will follow an experimental and problem-driven approach. The goal for the class is to be broad and to touch upon a variety of techniques, introducing standard practices as well as cutting-edge research topics in this area. Hands-on programming sessions will be held to guide the students on the use of good programming practices and advanced programming frameworks, such as Hadoop. Students will learn the proper settings in which to use each paradigm, the advantages and disadvantages of each model, how to design/analyze algorithms and to write efficient code in different big data settings. <font color="#AF0F0F"> *Learning outcomes:* </font> * Knowledge of big data processing frameworks (part of the Hadoop ecosystem) * Knowledge of advanced computational models, focusing on data streaming, !MapReduce-style parallelism, external memory * Ability to write efficient code taking into account architectural features of modern computing platforms (including distributed systems) * Familiarity with data mining problems and techniques * Ability to study advanced research topics in big data systems and algorithmics for massive data * Performance analysis skills using back-of-the-envelope calculations, mathematical and experimental tools <p> </p> ---++ Lectures and readings Readings, notes, slides, papers, code... are posted *[[BDC/Schedule][here]]* after each lecture. *[[BDC/ScheduleOld][Schedule 2018]]* There are no required textbooks for this class: many lessons explore cutting-edge topics and there is no unique book covering all of them systematically. Some resources that we will use along the way are: * J. Leskovec, A. Rajaraman, and J. Ullman, _[[http://www.mmds.org/][Mining of Massive Datasets]]_. Available online. * T. White. _Hadoop: The Definitive Guide - Storage and Analysis at Internet Scale_ (4th edition). O'Reilly Media. * C. Demetrescu and I. Finocchi. _[[%ATTACHURL%/SurveyStreaming08-DemetrescuFinocchi.pdf][Algorithms for data streams]]_. In Handbook of Applied Algorithms, John Wiley and Sons, 2008 * K. Mehlhorn, P. Sanders. _Algorithms and data structures: The basic toolbox_, Springer, 2009. [[http://www.mpi-inf.mpg.de/~mehlhorn/Toolbox.html][Book web site]]. <!-- * R. Bryant, D. O'Hallaron: _Computer Systems: A Programmer's Perspective_, Prentice Hall, 2003. --> <!-- * J. Bentley. _Programming pearls_, 2/ed, Addison-Wesley, 2000. [[http://netlib.bell-labs.com/cm/cs/pearls/][Book web site]].--> <!-- -- - ++ Homeworks The homework for A.Y. 2017-2018 (including a small software project in !MapReduce) will be posted here during the course. Old homeworks: * [[%ATTACHURL%/BigData-hw1-aa2016-2017.pdf][Homework A.Y. 2016-2017]] * [[%ATTACHURL%/BigData-hw1-aa2015-2016.pdf][Homework 1 A.Y. 2015-2016]] * [[https://drive.google.com/file/d/0B1yYvm6QgJReRlZzVVlXa2xSLXc/view?usp=sharing][Homework 2 A.Y. 2015-2016]] --> <!-- * [[%ATTACHURL%/BigData-hw1-aa2014-2015.pdf][Homework BDC 2015]] * [[%ATTACHURL%/BigData-hw3-aa2013-2014.pdf][Homework 3 BDC 2014]] * [[%ATTACHURL%/BigData-hw2-aa2013-2014.pdf][Homework 2 BDC 2014]] * [[%ATTACHURL%/BigData-hw1-aa2013-2014.pdf][Homework 1 BDC 2014]] --> <!-- * [[%ATTACHURL%/hw2-aa2011-2012.pdf][Mini-homework on data streams 2012]] * [[%ATTACHURLPATH%/hw1-aa2011-2012.pdf][Homework 1 2012]] --> ---++ Grading <!-- * <font color="#AF0F0F"> *Homework* </font> assigned during the course, including a small software project in !MapReduce (you can work in *groups*, typically composed of two persons. In exceptional cases I could allow groups of 3.) <br /> --> * <font color="#AF0F0F"> *Written exam* </font> with both multiple choice and open questions (if time permits, we will insert two midterms)<p> </p> * <font color="#AF0F0F"> *Reading assignment:* </font> read a scientific paper on big data systems, write a report, and (possibly) present the paper in class (you can work in *groups*, typically composed of two persons. In exceptional cases I could allow groups of 3.) <!-- <br />Reading sessions will be organized with a conference-style: each paper should be read by all students, who should prepare questions for the speakers. --> ---++ Google group The group will be used for technical discussions, homework assignment, last-minute messages. Better subscribing! | *Subscribe to Big Data Computing (Sapienza, Irene Finocchi)* | | Email: <input type=text name=email> <input type=submit name="sub" value="Subscribe"> | | [[http://groups.google.com/group/big-data-computing-sapienza-finocchi?hl=en][Visit this group]] |
This topic: BDC
>
WebHome
Topic revision: r114 - 2019-10-05 - IreneFinocchi
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback