Tags:
tag this topic
create new tag
view all tags
---++Lectures and Readings (Big Data Computing, 2019) <font color="#AF0F0F"><b>Tuesday, February 26</b></font> - Course introduction and *overview*. The *four V's* of big data. Volume: *how big is "big"?* What can we do with big data? How can we program *big data systems*? Velocity: what can we compute *if we cannot even store the input*? * Nice reading: [[http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/1][the sexiest job of the 21st century]] * [[https://drive.google.com/file/d/17AQRZ7V0PADkDdBeU9f-tdrci5fQ8foU/view?usp=sharing][Slides with course introduction]] <font color="#AF0F0F"><b>Friday, March 1</b></font> - Processing data on disk: the *external memory model* (Aggarwal & Vitter): parameters M and B. Fundamental external memory bounds: *scan* and *sort*. *Data permutation*: I/O-analysis of the naive linear-time algorithm, scan&sort algorithm. * [[https://drive.google.com/file/d/1S8wNrURhNmbDGXfB8uEGTfgdlUWghlhv/view?usp=sharing][External memory: introductory slides]] * Notes on [[https://drive.google.com/file/d/1FXK97OMuyF9sTj7HqpQdf1O_3MAYAZ16/view?usp=sharing][scan&sort permutation algorithm]] (only page 1) <font color="#AF0F0F"><b>Tuesday, March 5</b></font> - *Sorting data on disk*: 2-way mergesort, I/O analysis. A key inefficiency of 2-way mergesort. *k-way mergesort*, with full analysis. Hints for an efficient implementation. The 1TB sorting problem: can we do more than one pass? A back of the envelope calculation. * Notes on [[https://drive.google.com/file/d/1-eh-B-MtSdjA1Ye4zdV-LTDUWUbDz5fB/view?usp=sharing][binary mergesort I/O analysis]] * Notes on [[https://drive.google.com/file/d/1hRqnX1R8YFTz8pSy0dj9OGd-LyiBZWqY/view?usp=sharing][k-way mergesort]] * K. Mehlhorn, P. Sanders. _Algorithms and data structures: The basic toolbox_, 2009, Chapter 5 (Sorting and selection), Section 7 (without sample sort). [[http://www.mpi-inf.mpg.de/~mehlhorn/Toolbox.html][On-line version]]. <font color="#AF0F0F"><b>Friday, March 8</b></font> - Public transport strike. No class. <font color="#AF0F0F"><b>Tuesday, March 12</b></font> - <b>MapReduce</b>: stable storage, *distributed file system*, *programming model* (data as key-value pairs, map/reduce functions, wordcount example). * Dean & Ghemawat, [[http://research.google.com/archive/mapreduce-osdi04.pdf][MapReduce: Simplified Data Processing on Large Clusters]], OSDI 2004 paper * !MapReduce programming model: [[https://drive.google.com/file/d/19CxGTo-M0yWMQ6CFEMel3EPawNFSpYcq/view?usp=sharing][slides]] <font color="#AF0F0F"><b>Friday, March 15</b></font> - !MapReduce *runtime system*: distributed execution overview, master, workers, key partitioning, local storage vs. distributed file system, task coordination, failures. * !MapReduce runtime system: [[https://drive.google.com/file/d/1IBEHupTQH7BuQks5zCaIKpmzAeb3Acoc/view?usp=sharing][slides]] <font color="#AF0F0F"><b>Tuesday, March 19</b></font> - <font color=green><i><b>Hands-on class</b></i></font>. Introduction to *Hadoop*, *configuration*, *context*, *mapper and reducer classes*, basic *writable* types (Text, !IntWritable, !LongWritable). Implementing <b>WordCount</b>. Running !WordCount: main *HDFS commands*. * Hadoop: [[https://drive.google.com/file/d/1IVqYgR3K00uJFAbcZbNr2tC3NocdTRUw/view?usp=sharing][slides]] * Wordcount [[https://drive.google.com/file/d/1EjusdJuJUCfhG0Hj8gd_6Tx3_oKLd2Ez/view?usp=sharing][implementation, data sets, how-to-run notes]] * If you want to avoid installing Hadoop on your platform (or want to do that later), you can use the [[https://drive.google.com/file/d/0B1yYvm6QgJReUlgyc0pBQk1OOGc/view?usp=sharing][LXLE virtual machine]] (file size: 4GB). This requires !VirtualBox, which is available on the Oracle website. <font color="#AF0F0F"><b>Friday, March 22</b></font> - *Good coding practices*: examples of simple !MapReduce algorithms (matrix transpose, sum of n values). Towards a *theoretical cost model*. * Computational model, examples: [[https://drive.google.com/file/d/1S7llSYNNogZLkjMKi_MxW_0yHA59c3Qf/view?usp=sharing][slides]] * [[http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf][A model of computation for MapReduce]], by H. Karloff, S. Suri & S. Vassilvitskii, appeared in Proceedings of the ACM SIAM Symposium on Discrete Algorithms, SODA 2010 (Sections 1, 2, 3, only definitions covered in class). <font color="#AF0F0F"><b>Tuesday, March 26</b></font> - Computing a *minimum spanning tree* (MST) in !MapReduce: high-level description of the algorithm. *MST algorithm*: *correctness* and *space analysis*. Local space of reduce 1 functions, i.e., expected value of _|E<sub>i,j</sub>|_. Space analysis of round 2: size of subgraph _H_. When is local space sublinear? Deriving the optimal value for _k_, sublinearity on c-dense graphs. * MST: [[https://drive.google.com/file/d/1YWdSjlMIEEB44dOjfFTsqu3NMv6vicxl/view?usp=sharing][algorithm description and correctness]] * MST: [[https://drive.google.com/file/d/1lkPeLUf915ZTdTBMf_L_P3BBYvEMLLnH/view?usp=sharing][notes on space analysis - part 1]] * MST: [[https://drive.google.com/file/d/1lkjPkHZt5IjfKu66Bc_ffSqBVOZ_NwYc/view?usp=sharing][notes on space analysis - part 2]] * [[http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf][A model of computation for MapReduce]], by H. Karloff, S. Suri & S. Vassilvitskii, Proceedings of the ACM SIAM Symposium on Discrete Algorithms, 2010 (Section 5, only definitions and proofs covered in class). <font color="#AF0F0F"><b>Friday, March 29</b></font> - Mining networks: examples and properties of *real-world networks*, *Erdős-Rényi* model, locality and *clustering coefficient*, *small world* properties (Milgram's experiment, six degrees of separation), degree distribution and *power laws*, scale free networks. Generative network models: *Watts & Strogatz* model, *preferential attachment* model by Barabasi & Albert. * Slides on [[https://drive.google.com/file/d/1lnuYMBcjuR2EZazgKTQ2QXNbgR0mUpYm/view?usp=sharing][real-world networks properties]] <font color="#AF0F0F"><b>Tuesday, April 2</b></font> - *Triangle counting*: naive approach, !NodeIterator, <b>!NodeIterator++</b>, !MapReduce implementations. * Slides on [[https://drive.google.com/file/d/1JSSt44ue6EM6Hteje3xxFz5mbL4GXu3W/view?usp=sharing][triangle counting]] * [[https://theory.stanford.edu/~sergei/papers/www11-triangles.pdf][Paper]] on triangle counting <font color="#AF0F0F"><b>Friday, April 5</b></font> - Spatial and temporal *locality*. Case study: *matrix multiplication*. Number of I/Os of the standard algorithm. Reuse distance. Data layout and blocking. Blocked iterative matrix multiplication. * C. Demetrescu and I. Finocchi: notes on [[%ATTACHURL%/matmul.pdf][blocked matrix multiplication: iterative algorithm]]. <font color="#AF0F0F"><b>April 9 and 12</b></font> - classes suspended <font color="#AF0F0F"><b>Tuesday, April 16</b></font> - Exercises * [[https://drive.google.com/file/d/11V-EJy51hJKiYlY7H-JR1ycaTuY5Izhs/view?usp=sharing][June 8, 2017]] * [[https://drive.google.com/file/d/10h0NOjgopi2D8gk0VKwyepXyz86mhQD0/view?usp=sharing][June 28, 2017]] * [[https://drive.google.com/file/d/13Mph5TV5NuzvHJHmG4E-ha5oG0F0MCRu/view?usp=sharing][first midterm 2017]] * [[https://drive.google.com/file/d/0B1yYvm6QgJReTFdpZjFFb0daSWM/view?usp=sharing][Midterm solutions]]. <font color="#AF0F0F"><b>Tuesday, April 30</b></font> - *Document similarity*: set similarity, *shingling*, *Jaccard similarity*, *minHashing*. Analysis of minHashing with one random permutation. Detailed !minHashing example. *Implementation of minHashing*: (1) how to do a single pass over the document matrix; (2) how to get rid of permutations. * [[https://drive.google.com/file/d/1tMO5IM-86mNurM5M73AFY0h9kiYnOfli/view?usp=sharing][Slides used in class]]. * [[http://www.mmds.org/][Mining of Massive Datasets]] book, sections 3.1, 3.2, 3.3 <font color="#AF0F0F"><b>Friday, May 3</b></font> - Midterm <font color="#AF0F0F"><b>Tuesday, May 7</b></font> - *Locality-sensitive hashing*: partitioning the signature matrix into bands, definition of candidate pair, intuition behind LSH, two examples, analysis. * Slides: see previous lesson * [[http://www.mmds.org/][Mining of Massive Datasets]] book, section 3.4 <font color="#AF0F0F"><b>Friday, May 10</b></font> - Algorithms for *data streams*. Motivations, applications, and theoretical model. Streaming puzzles: 1) missing number; 2) curious George goes fishing. * [[https://drive.google.com/file/d/18CeNRU5FfZxzWDL2QOsnm1BAhXtwYRBG/view?usp=sharing][Class slides]] * M. Muthukrishnan: survey on [[https://www.cs.rutgers.edu/~muthu/stream-1-1.ps][data streams: algorithms and applications]] * C. Demetrescu and I. Finocchi: survey on [[http://twiki.di.uniroma1.it/pub/BDC/WebHome/SurveyStreaming08-DemetrescuFinocchi.pdf][algorithms for data streams]], Handbook of Applied Algorithms: Solving Scientific, Engineering and Practical Problems, 2008. <font color="#AF0F0F"><b>Tuesday, May 14</b></font> - Another streaming puzzle: *pointer and chaser* (how to find a duplicate item). Sampling uniformly from infinite streams: *reservoir sampling* (algorithm and analysis). * See previous class for pointer and chaser slides * [[https://drive.google.com/file/d/1wJ6-GN_yREH6B7UGUdG2nCKBFpcy9_iN/view?usp=sharing][Reservoir sampling slides]] <font color="#AF0F0F"><b>Friday, May 17</b></font> - *Triangle counting in* (adversarial) *data streams*: a sampling-based approach. Space *lower bound* based on a communication complexity reduction. * [[https://drive.google.com/file/d/1TRE1n39yUkoYN9JUVoV5QklMHYfdbeea/view?usp=sharing][Triangle counting slides]]: upper and lower bounds on triangle counting in data streams <font color="#AF0F0F"><b>Tuesday, May 21</b></font> - Frequent items (aka *heavy hitters*). *Problem definition*: frequency threshold, approximating the solution (parameter epsilon) and probabilistic guarantees (parameter delta). *Sticky sampling*: algorithm, correctness and expected space. * Heavy hitters: [[https://drive.google.com/file/d/1evqBuTNfV8b_Cwxw3Gy5RhZwiMTirmQ3/view?usp=sharing][Class slides]] <font color="#AF0F0F"><b>Friday, May 24</b></font> - Paper hacking session. <font color="#AF0F0F"><b>Friday, May 31</b></font> - Paper hacking session. <font color="#AF0F0F"><b>Monday, June 3</b></font> - Paper hacking session. <!-- * [[http://www.americanscientist.org/issues/pub/the-britney-spears-problem/1][The Britney Spears problem: tracking who's hot and who's not presents an algorithmic challenge]], by Brian Hayes. A popular paper on stream algorithmics appeared in volume 96 of the American Scientist, 2008. * Original VLDB 2002 paper describing Sticky Sampling: [[%ATTACHURL%/02vldb-freq.pdf][Approximate Frequency Counts over Streaming Data]], by G. Manku and R. Motwani. --> <!-- <font color="#AF0F0F"><b>Friday, March 16</b></font> - Spatial and temporal *locality*. Case study: *matrix multiplication*. Number of I/Os of the standard algorithm. Reuse distance. Data layout and blocking. Blocked iterative matrix multiplication. I/O-optimal recursive implementation. * C. Demetrescu and I. Finocchi: notes on [[%ATTACHURL%/matmul.pdf][blocked matrix multiplication: iterative algorithm]]. * [[http://twiki.di.uniroma1.it/pub/BDC/Schedule/ViS94.sorting_io.pdf][Divide-and-conquer algorithm]] by Vitter & Shriver (only Section 7, matrix multiplication). * You can now answer Question 1 of [[https://drive.google.com/file/d/13Mph5TV5NuzvHJHmG4E-ha5oG0F0MCRu/view?usp=sharing][last year's midterm]] <font color="#AF0F0F"><b>Friday, April 20</b></font> - Exercises: exam [[https://drive.google.com/file/d/11V-EJy51hJKiYlY7H-JR1ycaTuY5Izhs/view?usp=sharing][June 8, 2017]], exam [[https://drive.google.com/file/d/10h0NOjgopi2D8gk0VKwyepXyz86mhQD0/view?usp=sharing][June 28, 2017]], [[https://drive.google.com/file/d/13Mph5TV5NuzvHJHmG4E-ha5oG0F0MCRu/view?usp=sharing][first midterm 2017]]. [[https://drive.google.com/file/d/0B1yYvm6QgJReTFdpZjFFb0daSWM/view?usp=sharing][Midterm solutions]]. <font color="#AF0F0F"><b>Monday, May 21</b></font> - Paper hacking session. Discussion of the following papers: * Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (Apache Spark, 10th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012) * The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing (Google Cloud Dataflow, 41st International Conference of Very Large Databases, VLDB 2015) * !BigDebug: Debugging Primitives for Interactive Big Data Processing in Spark (IEEE/ACM 38th IEEE International Conference on Software Engineering, ICSE 2016) * Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics (13th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2016) --> <!-- <font color="#AF0F0F"><b>Tuesday, May 23</b></font> - Discussion of [[https://drive.google.com/file/d/0B1yYvm6QgJReTFdpZjFFb0daSWM/view?usp=sharing][midterm solutions]] <font color="#AF0F0F"><b>Thursday, May 11</b></font> - *Sketches*. Counting distinct items: *probabilistic counting*. * [[https://drive.google.com/file/d/0B1yYvm6QgJReUnNHLTNmS1lLYzA/view?usp=sharing][Probabilistic counting slides]] (except for slides 9 - 13). <font color="#AF0F0F"><b>Thursday, May 4</b></font> - Frequent items (aka *heavy hitters*). *Problem definition*: frequency threshold, approximating the solution (parameter epsilon) and probabilistic guarantees (parameter delta). *Sticky sampling*: algorithm, correctness and expected space. * Heavy hitters: [[https://drive.google.com/file/d/0B1yYvm6QgJReYzUzT1I3ZktoTk0/view?usp=sharing][problem definition and algorithm]], [[https://drive.google.com/file/d/0B1yYvm6QgJReNVJESVB6Wk5lT0U/view?usp=sharing][analysis]] * [[http://www.americanscientist.org/issues/pub/the-britney-spears-problem/1][The Britney Spears problem: tracking who's hot and who's not presents an algorithmic challenge]], by Brian Hayes. A popular paper on stream algorithmics appeared in volume 96 of the American Scientist, 2008. * Original VLDB 2002 paper describing Sticky Sampling: [[%ATTACHURL%/02vldb-freq.pdf][Approximate Frequency Counts over Streaming Data]], by G. Manku and R. Motwani. * On [[http://news.stanford.edu/news/2009/june10/rajeev_motwani-061009.html][Rajeev Motwani]], co-designer of sticky sampling. Stanford Report, 2009. <font color="#AF0F0F"><b>Tuesday, April 25</b></font> - Liberation day: classes suspended <font color="#AF0F0F"><b>From April 13 to April 18</b></font> - Easter holidays: classes suspended <font color="#AF0F0F"><b>Tuesday, April 11</b></font> - No class (we give back two hours to Prof. Massini) ============================================================================================================ <font color="#AF0F0F"><b>Tuesday, April 26</b></font> - <font color=green><i><b>Hands-on class</b></i></font>: stable storage and Elastic !MapReduce on *AWS* (Amazon Web Services). * [[https://drive.google.com/file/d/0B1yYvm6QgJReVXF5a3ZPbUp2SjQ/view?usp=sharing][Slides on AWS]] (by Emilio Coppa) * Slides on minHashing implementation: see [[https://drive.google.com/file/d/0B1yYvm6QgJReeUZpTWdtbi1vN0k/view?usp=sharing][April 19 slides]], last part <font color="#AF0F0F"><b>Monday, April 11</b></font> (longer class starting at 10:15) - <font color=green><i><b>Hands-on class</b></i></font> on network mining: computing the *degree distribution in !MapReduce* + output *post-processing* through Unix commands + *visualization* in Excel. *New Hadoop features*: * how to *split input files* (class !TextInputFormat vs. class !KeyValueTextInputFormat) * how to let Hadoop *parse common arguments automatically* (Tool and !ToolRunner) * how to *pass arguments* to mappers and reducers. * Degree calculator: [[https://drive.google.com/file/d/0B1yYvm6QgJReT09teXhFMXVQTXM/view?usp=sharing][Hadoop code]] <font color="#AF0F0F"><b>Tuesday, February 28</b></font> - <font color=green><i><b>Hands-on class</b></i></font>. Processing data on disk: tiny *experiments* on sequential vs random accesses, blocking, spatial locality issues. * [[https://drive.google.com/file/d/0B1yYvm6QgJReWkVGalc2anR3emM/view?usp=sharing][C code used in class]] and experiments vademecum <font color="#AF0F0F"><b>Tuesday, May 19</b></font> - Graduation day: class suspended --> -- [[http://www.dsi.uniroma1.it/~finocchi][Irene Finocchi]] - May 2019
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r106
<
r105
<
r104
<
r103
<
r102
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r106 - 2019-06-01
-
IreneFinocchi
Log In
or
Register
BDC Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Prenotazioni esami
Laurea Triennale ...
Laurea Triennale
Algebra
Algoritmi
Introduzione agli algoritmi
Algoritmi 1
Algoritmi 2
Algoritmi per la
visualizzazione
Architetture
Prog. sist. digitali
Architetture 2
Basi di Dati
Basi di Dati 1 Inf.
Basi di Dati 1 T.I.
Basi di Dati (I modulo, A-L)
Basi di Dati (I modulo, M-Z)
Basi di Dati 2
Calcolo
Calcolo differenziale
Calcolo integrale
Calcolo delle Probabilitą
Metodi mat. per l'inf. (ex. Logica)
canale AD
canale PZ
Programmazione
Fond. di Programmazione
Metodologie di Programmazione
Prog. di sistemi multicore
Programmazione 2
AD
EO
PZ
Esercitazioni Prog. 2
Lab. Prog. AD
Lab. Prog. EO
Lab. Prog. 2
Prog. a Oggetti
Reti
Arch. di internet
Lab. di prog. di rete
Programmazione Web
Reti di elaboratori
Sistemi operativi
Sistemi Operativi (12 CFU)
Anni precedenti
Sistemi operativi 1
Sistemi operativi 2
Lab. SO 1
Lab. SO 2
Altri corsi
Automi, Calcolabilitą
e Complessitą
Apprendimento Automatico
Economia Aziendale
Elaborazione Immagini
Fisica 2
Grafica 3D
Informatica Giuridica
Laboratorio di Sistemi Interattivi
Linguaggi di Programmazione 3° anno Matematica
Linguaggi e Compilatori
Sistemi Informativi
Tecniche di Sicurezza dei Sistemi
ACSAI ...
ACSAI
Computer Architectures 1
Programming
Laurea Magistrale ...
Laurea Magistrale
Percorsi di studio
Corsi
Algoritmi Avanzati
Algoritmica
Algoritmi e Strutture Dati
Algoritmi per le reti
Architetture degli elaboratori 3
Architetture avanzate e parallele
Autonomous Networking
Big Data Computing
Business Intelligence
Calcolo Intensivo
Complessitą
Computer Systems and Programming
Concurrent Systems
Crittografia
Elaborazione del Linguaggio Naturale
Estrazione inf. dal web
Fisica 3
Gamification Lab
Information Systems
Ingegneria degli Algoritmi
Interazione Multi Modale
Metodi Formali per il Software
Methods in Computer Science Education: Analysis
Methods in Computer Science Education: Design
Prestazioni dei Sistemi di Rete
Prog. avanzata
Internet of Things
Sistemi Centrali
Reti Wireless
Sistemi Biometrici
Sistemi Distribuiti
Sistemi Informativi Geografici
Sistemi operativi 3
Tecniche di Sicurezza basate sui Linguaggi
Teoria della
Dimostrazione
Verifica del software
Visione artificiale
Attivitą complementari
Biologia Computazionale
Design and development of embedded systems for the Internet of Things
Lego Lab
Logic Programming
Pietre miliari della scienza
Prog. di processori multicore
Sistemi per l'interazione locale e remota
Laboratorio di Cyber-Security
Verifica e Validazione di Software Embedded
Altri Webs ...
Altri Webs
Dottorandi
Commissioni
Comm. Didattica
Comm. Didattica_r
Comm. Dottorato
Comm. Erasmus
Comm. Finanziamenti
Comm. Scientifica
Comm Scientifica_r
Corsi esterni
Sistemi Operativi (Matematica)
Perl e Bioperl
ECDL
Fondamenti 1
(NETTUNO)
Tecniche della Programmazione 1° modulo
(NETTUNO)
Seminars in Artificial Intelligence and Robotics: Natural Language Processing
Informatica generale
Primo canale
Secondo canale
II canale A.A. 10-11
Informatica
Informatica per Statistica
Laboratorio di Strumentazione Elettronica e Informatica
Progetti
Nemo
Quis
Remus
TWiki ...
TWiki
Tutto su TWiki
Users
Main
Sandbox
Home
Site map
AA web
AAP web
ACSAI web
AA2021 web
Programming web
AA2021 web
AN web
ASD web
Algebra web
AL web
AA1112 web
AA1213 web
AA1920 web
AA2021 web
MZ web
AA1112 web
AA1213 web
AA1112 web
AA1314 web
AA1415 web
AA1516 web
AA1617 web
AA1819 web
Old web
Algo_par_dis web
Algoreti web
More...
BDC Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
Register User
Questo sito usa cookies, usandolo ne accettate la presenza. (
CookiePolicy
)
Torna al
Dipartimento di Informatica
E
dit
A
ttach
Copyright © 2008-2025 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback