Lectures and Readings (Big Data Computing, 2018)
Friday, March 2 - Course introduction and
overview. The
four V's of big data. Volume:
how big is "big"? What can we do with big data? How can we program
big data systems? Velocity: what can we compute
if we cannot even store the input?
Friday, March 9 - Processing data on disk: the
external memory model (Aggarwal & Vitter): parameters M and B. Fundamental external memory bounds:
scan and
sort.
Data permutation: I/O-analysis of the naive linear-time algorithm, scan&sort algorithm.
Sorting data on disk: 2-way mergesort, I/O analysis.
Monday, March 12 - A key inefficiency of 2-way mergesort.
k-way mergesort, with full analysis. Hints for an efficient implementation. The 1TB sorting problem: can we do more than one pass? A back of the envelope calculation.
- Notes on k-way mergesort
- K. Mehlhorn, P. Sanders. Algorithms and data structures: The basic toolbox, 2009, Chapter 5 (Sorting and selection), Section 7 (without sample sort). On-line version
.
Friday, March 16 - Spatial and temporal
locality. Case study:
matrix multiplication. Number of I/Os of the standard algorithm. Reuse distance. Data layout and blocking. Blocked iterative matrix multiplication. I/O-optimal recursive implementation.
Monday, March 19 -
MapReduce: stable storage,
distributed file system,
programming model (data as key-value pairs, map/reduce functions, wordcount example).
Friday, March 23 -
Hands-on class. Introduction to
Hadoop,
configuration,
context,
mapper and reducer classes, basic
writable types (Text, IntWritable, LongWritable). Implementing
WordCount.
Monday, March 26 - Running WordCount: main
HDFS commands. MapReduce
runtime system: distributed execution overview, master, workers, key partitioning, local storage vs. distributed file system, task coordination, failures.
- MapReduce runtime system: slides
Friday, April 6 -
Good coding practices: examples of simple MapReduce algorithms (matrix transpose, sum of n values). Towards a
theoretical cost model. Computing a
minimum spanning tree (MST) in MapReduce: high-level description of the algorithm.
- A model of computation for MapReduce
, by H. Karloff, S. Suri & S. Vassilvitskii, appeared in Proceedings of the ACM SIAM Symposium on Discrete Algorithms, SODA 2010 (Sections 1, 2, 3, only definitions covered in class).
- Computational model, examples, MST: slides
Monday, April 9 -
MST algorithm:
correctness and
space analysis. Local space of reduce 1 functions, i.e., expected value of
|Ei,j|. Space analysis of round 2: size of subgraph
H. When is local space sublinear? Deriving the optimal value for
k, sublinearity on c-dense graphs.
Friday, April 13 - Mining networks: examples and properties of
real-world networks,
Erdős-Rényi model, locality and
clustering coefficient,
small world properties (Milgram's experiment, six degrees of separation), degree distribution and
power laws, scale free networks. Generative network models:
Watts & Strogatz model,
preferential attachment model by Barabasi & Albert.
Monday, April 16 -
Triangle counting: naive approach, NodeIterator,
!NodeIterator++, MapReduce implementations.
Friday, April 20 - Exercises: exam
June 8, 2017
, exam
June 28, 2017
,
first midterm 2017
.
Midterm solutions
.
Monday, April 23 - First midterm.
Friday, April 27 -
Document similarity: set similarity,
shingling,
Jaccard similarity,
minHashing. Analysis of minHashing with one random permutation. Detailed minHashing example.
Implementation of minHashing: (1) how to do a single pass over the document matrix; (2) how to get rid of permutations.
Friday, May 4 -
Locality-sensitive hashing: partitioning the signature matrix into bands, definition of candidate pair, intuition behind LSH, two examples, analysis.
Monday, May 7 - Algorithms for
data streams. Motivations, applications, and theoretical model. Streaming puzzles: 1) missing number; 2) curious George goes fishing.
Friday, May 11 - Another streaming puzzle:
pointer and chaser (how to find a duplicate item). Sampling uniformly from infinite streams:
reservoir sampling (algorithm and analysis).
Monday, May 14 - IT Meeting
Friday, May 18 -
Triangle counting in (adversarial)
data streams: a sampling-based approach. Space
lower bound based on a communication complexity reduction.
Monday, May 21 - Paper hacking session. Discussion of the following papers:
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (Apache Spark, 10th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012)
- The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing (Google Cloud Dataflow, 41st International Conference of Very Large Databases, VLDB 2015)
- BigDebug: Debugging Primitives for Interactive Big Data Processing in Spark (IEEE/ACM 38th IEEE International Conference on Software Engineering, ICSE 2016)
- Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics (13th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2016)
Friday, May 25 -
Monday, May 28 -
Friday, June 1 -
--
Irene Finocchi
- May 2018