Schedule < BDC

Tags: view all tags
---++Lectures and Readings (Big Data Computing, 2019)

<font color="#AF0F0F"><b>Tuesday, February 26</b></font> - Course introduction and *overview*. The *four V's* of big data. Volume: *how big is "big"?* What can we do with big data? How can we program *big data systems*? Velocity: what can we compute *if we cannot even store the input*?
   * Nice reading: [[http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/1][the sexiest job of the 21st century]]
   * [[https://drive.google.com/file/d/17AQRZ7V0PADkDdBeU9f-tdrci5fQ8foU/view?usp=sharing][Slides with course introduction]]

<font color="#AF0F0F"><b>Friday, March 1</b></font> - Processing data on disk: the *external memory model* (Aggarwal & Vitter): parameters M and B. Fundamental external memory bounds: *scan* and *sort*. *Data permutation*: I/O-analysis of the naive linear-time algorithm, scan&sort algorithm.  
   * [[https://drive.google.com/file/d/1S8wNrURhNmbDGXfB8uEGTfgdlUWghlhv/view?usp=sharing][External memory: introductory slides]]
   * Notes on [[https://drive.google.com/file/d/1FXK97OMuyF9sTj7HqpQdf1O_3MAYAZ16/view?usp=sharing][scan&sort permutation algorithm]] (only page 1)

<font color="#AF0F0F"><b>Tuesday, March 5</b></font> - *Sorting data on disk*: 2-way mergesort, I/O analysis. A key inefficiency of 2-way mergesort. *k-way mergesort*, with full analysis. Hints for an efficient implementation.  The 1TB sorting problem: can we do more than one pass? A back of the envelope calculation.
   * Notes on [[https://drive.google.com/file/d/1-eh-B-MtSdjA1Ye4zdV-LTDUWUbDz5fB/view?usp=sharing][binary mergesort I/O analysis]]
   * Notes on [[https://drive.google.com/file/d/1hRqnX1R8YFTz8pSy0dj9OGd-LyiBZWqY/view?usp=sharing][k-way mergesort]]
   * K. Mehlhorn, P. Sanders. _Algorithms and data structures: The basic toolbox_, 2009,  Chapter 5 (Sorting and selection), Section 7 (without sample sort). [[http://www.mpi-inf.mpg.de/~mehlhorn/Toolbox.html][On-line version]]. 

<font color="#AF0F0F"><b>Friday, March 8</b></font> - Public transport strike. No class.

<font color="#AF0F0F"><b>Tuesday, March 12</b></font> - <b>MapReduce</b>: stable storage, *distributed file system*, *programming model* (data as key-value pairs, map/reduce functions, wordcount example).
   * Dean & Ghemawat, [[http://research.google.com/archive/mapreduce-osdi04.pdf][MapReduce: Simplified Data Processing on Large Clusters]], OSDI 2004 paper 
   * !MapReduce programming model: [[https://drive.google.com/file/d/19CxGTo-M0yWMQ6CFEMel3EPawNFSpYcq/view?usp=sharing][slides]]

<font color="#AF0F0F"><b>Friday, March 15</b></font> - !MapReduce *runtime system*: distributed execution overview, master, workers, key partitioning, local storage vs. distributed file system, task coordination, failures.  
   * !MapReduce runtime system: [[https://drive.google.com/file/d/1IBEHupTQH7BuQks5zCaIKpmzAeb3Acoc/view?usp=sharing][slides]]

<font color="#AF0F0F"><b>Tuesday, March 19</b></font> - <font color=green><i><b>Hands-on class</b></i></font>. Introduction to *Hadoop*, *configuration*, *context*, *mapper and reducer classes*, basic *writable* types (Text, !IntWritable, !LongWritable). Implementing <b>WordCount</b>.  Running !WordCount: main *HDFS commands*. 
   * Hadoop: [[https://drive.google.com/file/d/1IVqYgR3K00uJFAbcZbNr2tC3NocdTRUw/view?usp=sharing][slides]]
   * Wordcount [[https://drive.google.com/file/d/1EjusdJuJUCfhG0Hj8gd_6Tx3_oKLd2Ez/view?usp=sharing][implementation, data sets, how-to-run notes]]
   * If you want to avoid installing Hadoop on your platform (or want to do that later), you can use the [[https://drive.google.com/file/d/0B1yYvm6QgJReUlgyc0pBQk1OOGc/view?usp=sharing][LXLE virtual machine]] (file size: 4GB). This requires !VirtualBox, which is available on the Oracle website. 

<font color="#AF0F0F"><b>Friday, March 22</b></font> - *Good coding practices*: examples of simple !MapReduce algorithms (matrix transpose, sum of n values). Towards a *theoretical cost model*.   
   * Computational model, examples: [[https://drive.google.com/file/d/1S7llSYNNogZLkjMKi_MxW_0yHA59c3Qf/view?usp=sharing][slides]]
   * [[http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf][A model of computation for MapReduce]], by H. Karloff, S. Suri & S. Vassilvitskii, appeared in Proceedings of the ACM SIAM Symposium on Discrete Algorithms, SODA 2010 (Sections 1, 2, 3, only definitions covered in class).

<font color="#AF0F0F"><b>Tuesday, March 26</b></font> -  Computing a *minimum spanning tree* (MST) in !MapReduce: high-level description of the algorithm. *MST algorithm*: *correctness* and *space analysis*. Local space of reduce 1 functions, i.e., expected value of _|E<sub>i,j</sub>|_. Space analysis of round 2: size of subgraph _H_. When is local space sublinear? Deriving the optimal value for _k_, sublinearity on c-dense graphs. 
   * MST: [[https://drive.google.com/file/d/1YWdSjlMIEEB44dOjfFTsqu3NMv6vicxl/view?usp=sharing][algorithm description and correctness]] 
   * MST: [[https://drive.google.com/file/d/1lkPeLUf915ZTdTBMf_L_P3BBYvEMLLnH/view?usp=sharing][notes on space analysis - part 1]] 
   * MST: [[https://drive.google.com/file/d/1lkjPkHZt5IjfKu66Bc_ffSqBVOZ_NwYc/view?usp=sharing][notes on space analysis - part 2]]
   * [[http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf][A model of computation for MapReduce]], by H. Karloff, S. Suri & S. Vassilvitskii, Proceedings of the ACM SIAM Symposium on Discrete Algorithms, 2010 (Section 5, only definitions and proofs covered in class).

<font color="#AF0F0F"><b>Friday, March 29</b></font> - Mining networks: examples and properties of *real-world networks*, *Erd&#337;s-Rényi* model, locality and *clustering coefficient*, *small world* properties (Milgram's experiment, six degrees of separation), degree distribution and *power laws*, scale free networks. Generative network models: *Watts & Strogatz* model, *preferential attachment* model by Barabasi & Albert.
   * Slides on [[https://drive.google.com/file/d/1lnuYMBcjuR2EZazgKTQ2QXNbgR0mUpYm/view?usp=sharing][real-world networks properties]]
 
<font color="#AF0F0F"><b>Tuesday, April 2</b></font> - *Triangle counting*: naive approach, !NodeIterator, <b>!NodeIterator++</b>, !MapReduce implementations.
   * Slides on [[https://drive.google.com/file/d/1JSSt44ue6EM6Hteje3xxFz5mbL4GXu3W/view?usp=sharing][triangle counting]]
   * [[https://theory.stanford.edu/~sergei/papers/www11-triangles.pdf][Paper]] on triangle counting
 
<font color="#AF0F0F"><b>Friday, April 5</b></font> - Spatial and temporal *locality*. Case study: *matrix multiplication*. Number of I/Os of the standard algorithm. Reuse distance. Data layout and blocking. Blocked iterative matrix multiplication.
   * C. Demetrescu and I. Finocchi: notes on [[%ATTACHURL%/matmul.pdf][blocked matrix multiplication: iterative algorithm]].

<font color="#AF0F0F"><b>April 9 and 12</b></font> - classes suspended
 
<font color="#AF0F0F"><b>Tuesday, April 16</b></font> - Exercises
   * [[https://drive.google.com/file/d/11V-EJy51hJKiYlY7H-JR1ycaTuY5Izhs/view?usp=sharing][June 8, 2017]]
   * [[https://drive.google.com/file/d/10h0NOjgopi2D8gk0VKwyepXyz86mhQD0/view?usp=sharing][June 28, 2017]]
   * [[https://drive.google.com/file/d/13Mph5TV5NuzvHJHmG4E-ha5oG0F0MCRu/view?usp=sharing][first midterm 2017]]
   * [[https://drive.google.com/file/d/0B1yYvm6QgJReTFdpZjFFb0daSWM/view?usp=sharing][Midterm solutions]]. 


<font color="#AF0F0F"><b>Tuesday, April 30</b></font> - *Document similarity*: set similarity, *shingling*, *Jaccard similarity*, *minHashing*. Analysis of minHashing with one random permutation. Detailed !minHashing example. *Implementation of minHashing*: (1) how to do a single pass over the document matrix; (2) how to get rid of permutations. 
   * [[https://drive.google.com/file/d/1tMO5IM-86mNurM5M73AFY0h9kiYnOfli/view?usp=sharing][Slides used in class]].
   * [[http://www.mmds.org/][Mining of Massive Datasets]] book, sections 3.1, 3.2, 3.3

<font color="#AF0F0F"><b>Friday, May 3</b></font> - Midterm

<font color="#AF0F0F"><b>Tuesday, May 7</b></font> - *Locality-sensitive hashing*: partitioning the signature matrix into bands, definition of candidate pair, intuition behind LSH, two examples, analysis.
   * Slides: see previous lesson
   * [[http://www.mmds.org/][Mining of Massive Datasets]] book, section 3.4

<font color="#AF0F0F"><b>Friday, May 10</b></font> - Algorithms for *data streams*. Motivations, applications, and theoretical model. Streaming puzzles: 1) missing number; 2) curious George goes fishing.
   * [[https://drive.google.com/file/d/18CeNRU5FfZxzWDL2QOsnm1BAhXtwYRBG/view?usp=sharing][Class slides]]
   * M. Muthukrishnan: survey on [[https://www.cs.rutgers.edu/~muthu/stream-1-1.ps][data streams: algorithms and applications]]
   * C. Demetrescu and I. Finocchi: survey on [[http://twiki.di.uniroma1.it/pub/BDC/WebHome/SurveyStreaming08-DemetrescuFinocchi.pdf][algorithms for data streams]], Handbook of Applied Algorithms: Solving Scientific, Engineering and Practical Problems, 2008.

<font color="#AF0F0F"><b>Tuesday, May 14</b></font> - Another streaming puzzle: *pointer and chaser* (how to find a duplicate item). Sampling uniformly from infinite streams: *reservoir sampling* (algorithm and analysis). 
   * See previous class for pointer and chaser slides
   * [[https://drive.google.com/file/d/1wJ6-GN_yREH6B7UGUdG2nCKBFpcy9_iN/view?usp=sharing][Reservoir sampling slides]]

<font color="#AF0F0F"><b>Friday, May 17</b></font> - *Triangle counting in* (adversarial) *data streams*: a sampling-based approach. Space *lower bound* based on a communication complexity reduction. 
   * [[https://drive.google.com/file/d/1TRE1n39yUkoYN9JUVoV5QklMHYfdbeea/view?usp=sharing][Triangle counting slides]]: upper and lower bounds on triangle counting in data streams

<font color="#AF0F0F"><b>Tuesday, May 21</b></font> - Frequent items (aka *heavy hitters*). *Problem definition*: frequency threshold, approximating the solution (parameter epsilon) and probabilistic guarantees (parameter delta). *Sticky sampling*: algorithm, correctness and expected space.
   * Heavy hitters: [[https://drive.google.com/file/d/1evqBuTNfV8b_Cwxw3Gy5RhZwiMTirmQ3/view?usp=sharing][Class slides]]

<font color="#AF0F0F"><b>Friday, May 24</b></font> - Paper hacking session. 

<font color="#AF0F0F"><b>Friday, May 31</b></font> - Paper hacking session. 

<font color="#AF0F0F"><b>Monday, June 3</b></font> - Paper hacking session. 


<!--
* [[http://www.americanscientist.org/issues/pub/the-britney-spears-problem/1][The Britney Spears problem: tracking who's hot and who's not presents an algorithmic challenge]], by Brian Hayes. A popular paper on stream algorithmics appeared in volume 96 of the American Scientist, 2008.   
   * Original VLDB 2002 paper describing Sticky Sampling: [[%ATTACHURL%/02vldb-freq.pdf][Approximate Frequency Counts over Streaming Data]], by G. Manku and R. Motwani.
-->

<!--

<font color="#AF0F0F"><b>Friday, March 16</b></font> - Spatial and temporal *locality*. Case study: *matrix multiplication*. Number of I/Os of the standard algorithm. Reuse distance. Data layout and blocking. Blocked iterative matrix multiplication. I/O-optimal recursive implementation.
   * C. Demetrescu and I. Finocchi: notes on [[%ATTACHURL%/matmul.pdf][blocked matrix multiplication: iterative algorithm]].
   * [[http://twiki.di.uniroma1.it/pub/BDC/Schedule/ViS94.sorting_io.pdf][Divide-and-conquer algorithm]] by Vitter & Shriver (only Section 7, matrix multiplication).
   * You can now answer Question 1 of [[https://drive.google.com/file/d/13Mph5TV5NuzvHJHmG4E-ha5oG0F0MCRu/view?usp=sharing][last year's midterm]]

<font color="#AF0F0F"><b>Friday, April 20</b></font> - Exercises: exam [[https://drive.google.com/file/d/11V-EJy51hJKiYlY7H-JR1ycaTuY5Izhs/view?usp=sharing][June 8, 2017]], exam [[https://drive.google.com/file/d/10h0NOjgopi2D8gk0VKwyepXyz86mhQD0/view?usp=sharing][June 28, 2017]], [[https://drive.google.com/file/d/13Mph5TV5NuzvHJHmG4E-ha5oG0F0MCRu/view?usp=sharing][first midterm 2017]]. [[https://drive.google.com/file/d/0B1yYvm6QgJReTFdpZjFFb0daSWM/view?usp=sharing][Midterm solutions]].

<font color="#AF0F0F"><b>Monday, May 21</b></font> - Paper hacking session. Discussion of the following papers:
   * Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (Apache Spark, 10th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012)
   * The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing (Google Cloud Dataflow,  41st International Conference of Very Large Databases, VLDB 2015)
   * !BigDebug: Debugging Primitives for Interactive Big Data Processing in Spark (IEEE/ACM 38th IEEE International Conference on Software Engineering, ICSE 2016)
   * Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics (13th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2016)

-->

<!--

<font color="#AF0F0F"><b>Tuesday, May 23</b></font> - Discussion of [[https://drive.google.com/file/d/0B1yYvm6QgJReTFdpZjFFb0daSWM/view?usp=sharing][midterm solutions]]

<font color="#AF0F0F"><b>Thursday, May 11</b></font> - *Sketches*. Counting distinct items: *probabilistic counting*.
   * [[https://drive.google.com/file/d/0B1yYvm6QgJReUnNHLTNmS1lLYzA/view?usp=sharing][Probabilistic counting slides]] (except for slides 9 - 13).

<font color="#AF0F0F"><b>Thursday, May 4</b></font> - Frequent items (aka *heavy hitters*). *Problem definition*: frequency threshold, approximating the solution (parameter epsilon) and probabilistic guarantees (parameter delta). *Sticky sampling*: algorithm, correctness and expected space.
   * Heavy hitters: [[https://drive.google.com/file/d/0B1yYvm6QgJReYzUzT1I3ZktoTk0/view?usp=sharing][problem definition and algorithm]],
 [[https://drive.google.com/file/d/0B1yYvm6QgJReNVJESVB6Wk5lT0U/view?usp=sharing][analysis]]
   * [[http://www.americanscientist.org/issues/pub/the-britney-spears-problem/1][The Britney Spears problem: tracking who's hot and who's not presents an algorithmic challenge]], by Brian Hayes. A popular paper on stream algorithmics appeared in volume 96 of the American Scientist, 2008.   
   * Original VLDB 2002 paper describing Sticky Sampling: [[%ATTACHURL%/02vldb-freq.pdf][Approximate Frequency Counts over Streaming Data]], by G. Manku and R. Motwani.
   * On [[http://news.stanford.edu/news/2009/june10/rajeev_motwani-061009.html][Rajeev Motwani]], co-designer of sticky sampling. Stanford Report, 2009.

<font color="#AF0F0F"><b>Tuesday, April 25</b></font> - Liberation day: classes suspended

<font color="#AF0F0F"><b>From April 13 to April 18</b></font> - Easter holidays: classes suspended

<font color="#AF0F0F"><b>Tuesday, April 11</b></font> - No class (we give back two hours to Prof. Massini)


============================================================================================================

<font color="#AF0F0F"><b>Tuesday, April 26</b></font> - <font color=green><i><b>Hands-on class</b></i></font>: stable storage and Elastic !MapReduce on *AWS* (Amazon Web Services). 
   * [[https://drive.google.com/file/d/0B1yYvm6QgJReVXF5a3ZPbUp2SjQ/view?usp=sharing][Slides on AWS]] (by Emilio Coppa)
   * Slides on minHashing implementation: see [[https://drive.google.com/file/d/0B1yYvm6QgJReeUZpTWdtbi1vN0k/view?usp=sharing][April 19 slides]], last part

<font color="#AF0F0F"><b>Monday, April 11</b></font> (longer class starting at 10:15) - <font color=green><i><b>Hands-on class</b></i></font> on network mining: computing the *degree distribution in !MapReduce* + output *post-processing* through Unix commands + *visualization* in Excel. *New Hadoop features*: 
   * how to *split input files* (class !TextInputFormat vs. class !KeyValueTextInputFormat)
   * how to let Hadoop *parse common arguments automatically* (Tool and !ToolRunner)
   * how to *pass arguments* to mappers and reducers. 
   * Degree calculator: [[https://drive.google.com/file/d/0B1yYvm6QgJReT09teXhFMXVQTXM/view?usp=sharing][Hadoop code]]

<font color="#AF0F0F"><b>Tuesday, February 28</b></font> - <font color=green><i><b>Hands-on class</b></i></font>. Processing data on disk: tiny *experiments* on sequential vs random accesses, blocking, spatial locality issues.  
   * [[https://drive.google.com/file/d/0B1yYvm6QgJReWkVGalc2anR3emM/view?usp=sharing][C code used in class]] and experiments vademecum

<font color="#AF0F0F"><b>Tuesday, May 19</b></font> - Graduation day: class suspended

-->


-- [[http://www.dsi.uniroma1.it/~finocchi][Irene Finocchi]] - May 2019