ScheduleOld < BDC

Lectures and Readings (Big Data Computing, 2018)

Friday, March 2 - Course introduction and overview. The four V's of big data. Volume: how big is "big"? What can we do with big data? How can we program big data systems? Velocity: what can we compute if we cannot even store the input?

Slides with course introduction
Nice reading: the sexiest job of the 21st century

Friday, March 9 - Processing data on disk: the external memory model (Aggarwal & Vitter): parameters M and B. Fundamental external memory bounds: scan and sort. Data permutation: I/O-analysis of the naive linear-time algorithm, scan&sort algorithm. Sorting data on disk: 2-way mergesort, I/O analysis.

External memory: introductory slides
Notes on scan&sort permutation algorithm, binary mergesort I/O analysis
K. Mehlhorn, P. Sanders. Algorithms and data structures: The basic toolbox, 2009, Chapter 5 (Sorting and selection), Section 7 (without sample sort). On-line version.

Monday, March 12 - A key inefficiency of 2-way mergesort. k-way mergesort, with full analysis. Hints for an efficient implementation. The 1TB sorting problem: can we do more than one pass? A back of the envelope calculation.

Notes on k-way mergesort
K. Mehlhorn, P. Sanders. Algorithms and data structures: The basic toolbox, 2009, Chapter 5 (Sorting and selection), Section 7 (without sample sort). On-line version.

Friday, March 16 - Spatial and temporal locality. Case study: matrix multiplication. Number of I/Os of the standard algorithm. Reuse distance. Data layout and blocking. Blocked iterative matrix multiplication. I/O-optimal recursive implementation.

C. Demetrescu and I. Finocchi: notes on blocked matrix multiplication: iterative algorithm.
Divide-and-conquer algorithm by Vitter & Shriver (only Section 7, matrix multiplication).
You can now answer Question 1 of last year's midterm

Monday, March 19 - MapReduce: stable storage, distributed file system, programming model (data as key-value pairs, map/reduce functions, wordcount example).

Dean & Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004 paper
MapReduce programming model: slides

Friday, March 23 - Hands-on class. Introduction to Hadoop, configuration, context, mapper and reducer classes, basic writable types (Text, IntWritable, LongWritable). Implementing WordCount.

Wordcount implementation, data sets, how-to-run notes
Hadoop: slides
If you want to avoid installing Hadoop on your platform (or want to do that later), you can use the LXLE virtual machine (file size: 4GB). This requires VirtualBox, which is available on the Oracle website.

Monday, March 26 - Running WordCount: main HDFS commands. MapReduce runtime system: distributed execution overview, master, workers, key partitioning, local storage vs. distributed file system, task coordination, failures.

MapReduce runtime system: slides

Friday, April 6 - Good coding practices: examples of simple MapReduce algorithms (matrix transpose, sum of n values). Towards a theoretical cost model. Computing a minimum spanning tree (MST) in MapReduce: high-level description of the algorithm.

A model of computation for MapReduce, by H. Karloff, S. Suri & S. Vassilvitskii, appeared in Proceedings of the ACM SIAM Symposium on Discrete Algorithms, SODA 2010 (Sections 1, 2, 3, only definitions covered in class).
Computational model, examples, MST: slides

Monday, April 9 - MST algorithm: correctness and space analysis. Local space of reduce 1 functions, i.e., expected value of |E_i,j|. Space analysis of round 2: size of subgraph H. When is local space sublinear? Deriving the optimal value for k, sublinearity on c-dense graphs.

MST: notes on space analysis - part 1
MST: notes on space analysis - part 2
A model of computation for MapReduce, by H. Karloff, S. Suri & S. Vassilvitskii, appeared in Proceedings of the ACM SIAM Symposium on Discrete Algorithms, 2010 (Section 5, only definitions and proofs covered in class).

Friday, April 13 - Mining networks: examples and properties of real-world networks, Erdős-Rényi model, locality and clustering coefficient, small world properties (Milgram's experiment, six degrees of separation), degree distribution and power laws, scale free networks. Generative network models: Watts & Strogatz model, preferential attachment model by Barabasi & Albert.

Slides on real-world networks properties

Monday, April 16 - Triangle counting: naive approach, NodeIterator, !NodeIterator++, MapReduce implementations.

Slides on triangle counting
Paper on triangle counting

Friday, April 20 - Exercises: exam June 8, 2017, exam June 28, 2017, first midterm 2017. Midterm solutions.

Monday, April 23 - First midterm.

Friday, April 27 - Document similarity: set similarity, shingling, Jaccard similarity, minHashing. Analysis of minHashing with one random permutation. Detailed minHashing example. Implementation of minHashing: (1) how to do a single pass over the document matrix; (2) how to get rid of permutations.

Slides used in class.
Mining of Massive Datasets book, sections 3.1, 3.2, 3.3

Friday, May 4 - Locality-sensitive hashing: partitioning the signature matrix into bands, definition of candidate pair, intuition behind LSH, two examples, analysis.

Slides: see previous lesson
Mining of Massive Datasets book, section 3.4

Monday, May 7 - Algorithms for data streams. Motivations, applications, and theoretical model. Streaming puzzles: 1) missing number; 2) curious George goes fishing.

Class slides
M. Muthukrishnan: survey on data streams: algorithms and applications
C. Demetrescu and I. Finocchi: survey on algorithms for data streams, Handbook of Applied Algorithms: Solving Scientific, Engineering and Practical Problems, 2008.

Friday, May 11 - Another streaming puzzle: pointer and chaser (how to find a duplicate item). Sampling uniformly from infinite streams: reservoir sampling (algorithm and analysis).

See previous class for pointer and chaser slides
Reservoir sampling slides

Monday, May 14 - IT Meeting

Friday, May 18 - Triangle counting in (adversarial) data streams: a sampling-based approach. Space lower bound based on a communication complexity reduction.

Triangle counting slides: upper and lower bounds on triangle counting in data streams

Monday, May 21 - Paper hacking session. Discussion of the following papers:

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (Apache Spark, 10th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012)
The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing (Google Cloud Dataflow, 41st International Conference of Very Large Databases, VLDB 2015)
BigDebug: Debugging Primitives for Interactive Big Data Processing in Spark (IEEE/ACM 38th IEEE International Conference on Software Engineering, ICSE 2016)
Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics (13th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2016)

Friday, May 25 -

Monday, May 28 -

Friday, June 1 -

-- Irene Finocchi - May 2018

This topic: BDC > WebHome > ScheduleOld
Topic revision: r2 - 2019-02-26 - IreneFinocchi