Lectures and Readings (Big Data Computing, 2017)

Thursday, May 25 - Second midterm

Tuesday, May 23 - Discussion of midterm solutions

Thursday, May 11 - Sketches. Counting distinct items: probabilistic counting.

Tuesday, May 8 - Triangle counting in (adversarial) data streams: a sampling-based approach. Space lower bound based on a communication complexity reduction.

Thursday, May 4 - Frequent items (aka heavy hitters). Problem definition: frequency threshold, approximating the solution (parameter epsilon) and probabilistic guarantees (parameter delta). Sticky sampling: algorithm, correctness and expected space.

Tuesday, May 2 - Another streaming puzzle: pointer and chaser (how to find a duplicate item). Sampling uniformly from infinite streams: reservoir sampling (algorithm and analysis).

Thursday, April 27 - Algorithms for data streams. Motivations, applications, and theoretical model. Streaming puzzles: 1) missing number; 2) curious George goes fishing.

Tuesday, April 25 - Liberation day: classes suspended

Thursday, April 20 - Locality-sensitive hashing: partitioning the signature matrix into bands, definition of candidate pair, intuition behind LSH, two examples, analysis.

From April 13 to April 18 - Easter holidays: classes suspended

Tuesday, April 11 - No class (we give back two hours to Prof. Massini)

Thursday, April 6 - Document similarity: set similarity, shingling, Jaccard similarity, minHashing. Analysis of minHashing with one random permutation. Detailed minHashing example. Implementation of minHashing: (1) how to do a single pass over the document matrix; (2) how to get rid of permutations.

Tuesday, April 4 - First midterm (longer class: we take two hours from Prof. Massini)

Tuesday, March 28 - Triangle counting: naive approach, NodeIterator, !NodeIterator++, MapReduce implementations.

Thursday, March 23 - Mining networks: examples and properties of real-world networks, Erdős-RÚnyi model, locality and clustering coefficient, small world properties (Milgram's experiment, six degrees of separation), degree distribution and power laws, scale free networks. Generative network models: Watts & Strogatz model, preferential attachment model by Barabasi & Albert.

Tuesday, March 21 - MST algorithm: space analysis. Local space of reduce 1 functions, i.e., expected value of |Ei,j|. Space analysis of round 2: size of subgraph H. When is local space sublinear? Deriving the optimal value for k, sublinearity on c-dense graphs.

Thursday, March 16 - Examples of simple MapReduce algorithms: matrix transpose, sum of n values (two approaches with different performance). Theoretical cost model. Computing a minimum spanning tree (MST) in MapReduce: high-level description of the algorithm and correctness proof (why critical edges are not discarded during the first round).

  • A model of computation for MapReduce, by H. Karloff, S. Suri & S. Vassilvitskii, appeared in Proceedings of the ACM SIAM Symposium on Discrete Algorithms, SODA 2010 (Sections 1, 2, 3, only definitions covered in class).
  • Computational model, examples, MST: slides

Tuesday, March 14 - MapReduce runtime system: distributed execution overview, master, workers, key partitioning, local storage vs. distributed file system, task coordination, failures, combiners.

  • MapReduce runtime system: slides

Thursday, March 9 - Hands-on class. Introduction to Hadoop, configuration, context, mapper and reducer classes, basic writable types (Text, IntWritable, LongWritable). Implementing and running our own WordCount example. Main HDFS commands.

Tuesday, March 7 - MapReduce: stable storage, distributed file system, programming model (data as key-value pairs, map/reduce functions, wordcount example).

Thursday, March 2 - Spatial and temporal locality. Case study: matrix multiplication. Number of I/Os of the standard algorithm. Reuse distance. Data layout and blocking. Blocked iterative matrix multiplication. I/O-optimal recursive implementation.

Tuesday, February 28 - Hands-on class. Processing data on disk: tiny experiments on sequential vs random accesses, blocking, spatial locality issues. Sorting data on disk: Bottom-up iterative implementation of 2-way mergesort. k-way mergesort, with full analysis. Hints for an efficient implementation based on priority queues. Can we do more than one pass? A back of the envelope calculation.

  • C code used in class and experiments vademecum
  • K. Mehlhorn, P. Sanders. Algorithms and data structures: The basic toolbox, 2009, Chapter 5 (Sorting and selection), Section 7 (without sample sort). On-line version.

Thursday, February 23 - Processing data on disk: the external memory model (Aggarwal & Vitter): parameters M and B. Fundamental external memory bounds: scan and sort. Data permutation: I/O-analysis of the naive linear-time algorithm, scan&sort algorithm. Sorting data on disk: 2-way mergesort, I/O analysis and a key inefficiency.

Tuesday, February 21 - Course introduction and overview. Administrivia. The four V's of big data. Volume: how big is "big"? What can we do with big data? How can we program big data systems? Velocity: what can we compute if we cannot even store the input?

-- Irene Finocchi - May 2017

Topic revision: r1 - 2018-02-25 - IreneFinocchi

Questo sito usa cookies, usandolo ne accettate la presenza. (CookiePolicy)
Torna al Dipartimento di Informatica
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback