**Tuesday, February 26** - Course introduction and **overview**. The **four V's** of big data. Volume: **how big is "big"?** What can we do with big data? How can we program **big data systems**? Velocity: what can we compute **if we cannot even store the input**?

**Friday, March 1** - Processing data on disk: the **external memory model** (Aggarwal & Vitter): parameters M and B. Fundamental external memory bounds: **scan** and **sort**. **Data permutation**: I/O-analysis of the naive linear-time algorithm, scan&sort algorithm.

- External memory: introductory slides
- Notes on scan&sort permutation algorithm (only page 1)

**Tuesday, March 5** - **Sorting data on disk**: 2-way mergesort, I/O analysis. A key inefficiency of 2-way mergesort. **k-way mergesort**, with full analysis. Hints for an efficient implementation. The 1TB sorting problem: can we do more than one pass? A back of the envelope calculation.

- Notes on binary mergesort I/O analysis
- Notes on k-way mergesort
- K. Mehlhorn, P. Sanders.
*Algorithms and data structures: The basic toolbox*, 2009, Chapter 5 (Sorting and selection), Section 7 (without sample sort). On-line version.

**Friday, March 8** - Public transport strike. No class.

**Tuesday, March 12** - **MapReduce**: stable storage, **distributed file system**, **programming model** (data as key-value pairs, map/reduce functions, wordcount example).

- Dean & Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004 paper
- MapReduce programming model: slides

**Friday, March 15** - MapReduce **runtime system**: distributed execution overview, master, workers, key partitioning, local storage vs. distributed file system, task coordination, failures.

- MapReduce runtime system: slides

**Tuesday, March 19** - * Hands-on class*. Introduction to

- Hadoop: slides
- Wordcount implementation, data sets, how-to-run notes
- If you want to avoid installing Hadoop on your platform (or want to do that later), you can use the LXLE virtual machine (file size: 4GB). This requires VirtualBox, which is available on the Oracle website.

**Friday, March 22** - **Good coding practices**: examples of simple MapReduce algorithms (matrix transpose, sum of n values). Towards a **theoretical cost model**.

- Computational model, examples: slides
- A model of computation for MapReduce, by H. Karloff, S. Suri & S. Vassilvitskii, appeared in Proceedings of the ACM SIAM Symposium on Discrete Algorithms, SODA 2010 (Sections 1, 2, 3, only definitions covered in class).

**Tuesday, March 26** - Computing a **minimum spanning tree** (MST) in MapReduce: high-level description of the algorithm. **MST algorithm**: **correctness** and **space analysis**. Local space of reduce 1 functions, i.e., expected value of *|E _{i,j}|*. Space analysis of round 2: size of subgraph

- MST: algorithm description and correctness
- MST: notes on space analysis - part 1
- MST: notes on space analysis - part 2
- A model of computation for MapReduce, by H. Karloff, S. Suri & S. Vassilvitskii, Proceedings of the ACM SIAM Symposium on Discrete Algorithms, 2010 (Section 5, only definitions and proofs covered in class).

**Friday, March 29** - Mining networks: examples and properties of **real-world networks**, **Erdős-Rényi** model, locality and **clustering coefficient**, **small world** properties (Milgram's experiment, six degrees of separation), degree distribution and **power laws**, scale free networks. Generative network models: **Watts & Strogatz** model, **preferential attachment** model by Barabasi & Albert.

- Slides on real-world networks properties

**Tuesday, April 2** - **Triangle counting**: naive approach, NodeIterator, **!NodeIterator++**, MapReduce implementations.

- Slides on triangle counting
- Paper on triangle counting

**Friday, April 5** - Spatial and temporal **locality**. Case study: **matrix multiplication**. Number of I/Os of the standard algorithm. Reuse distance. Data layout and blocking. Blocked iterative matrix multiplication.

- C. Demetrescu and I. Finocchi: notes on blocked matrix multiplication: iterative algorithm.

**April 9 and 12** - classes suspended

**Tuesday, April 16** - Exercises

**Tuesday, April 30** - **Document similarity**: set similarity, **shingling**, **Jaccard similarity**, **minHashing**. Analysis of minHashing with one random permutation. Detailed minHashing example. **Implementation of minHashing**: (1) how to do a single pass over the document matrix; (2) how to get rid of permutations.

- Slides used in class.
- Mining of Massive Datasets book, sections 3.1, 3.2, 3.3

**Friday, May 3** - Midterm

**Tuesday, May 7** - **Locality-sensitive hashing**: partitioning the signature matrix into bands, definition of candidate pair, intuition behind LSH, two examples, analysis.

- Slides: see previous lesson
- Mining of Massive Datasets book, section 3.4

**Friday, May 10** - Algorithms for **data streams**. Motivations, applications, and theoretical model. Streaming puzzles: 1) missing number; 2) curious George goes fishing.

- Class slides
- M. Muthukrishnan: survey on data streams: algorithms and applications
- C. Demetrescu and I. Finocchi: survey on algorithms for data streams, Handbook of Applied Algorithms: Solving Scientific, Engineering and Practical Problems, 2008.

**Tuesday, May 14** - Another streaming puzzle: **pointer and chaser** (how to find a duplicate item). Sampling uniformly from infinite streams: **reservoir sampling** (algorithm and analysis).

- See previous class for pointer and chaser slides
- Reservoir sampling slides

**Friday, May 17** - **Triangle counting in** (adversarial) **data streams**: a sampling-based approach. Space **lower bound** based on a communication complexity reduction.

- Triangle counting slides: upper and lower bounds on triangle counting in data streams

**Tuesday, May 21** - Frequent items (aka **heavy hitters**). **Problem definition**: frequency threshold, approximating the solution (parameter epsilon) and probabilistic guarantees (parameter delta). **Sticky sampling**: algorithm, correctness and expected space.

- Heavy hitters: Class slides

**Friday, May 24** - Paper hacking session.

**Friday, May 31** - Paper hacking session.

**Monday, June 3** - Paper hacking session.

-- Irene Finocchi - May 2019

This topic: BDC > WebHome > Schedule

Topic revision: r106 - 2019-06-01 - IreneFinocchi

Copyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.

Ideas, requests, problems regarding TWiki? Send feedback

Ideas, requests, problems regarding TWiki? Send feedback