---++Lectures and Readings (Big Data Computing, 2018) <font color="#AF0F0F"><b>Friday, March 2</b></font> - Course introduction and *overview*. The *four V's* of big data. Volume: *how big is "big"?* What can we do with big data? How can we program *big data systems*? Velocity: what can we compute *if we cannot even store the input*? * [[https://drive.google.com/file/d/1-QZUhXEkIY_oxnBu9I1dYeOZH9mwqxqA/view?usp=sharing][Slides with course introduction]] * Nice reading: [[http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/1][the sexiest job of the 21st century]] <font color="#AF0F0F"><b>Friday, March 9</b></font> - Processing data on disk: the *external memory model* (Aggarwal & Vitter): parameters M and B. Fundamental external memory bounds: *scan* and *sort*. *Data permutation*: I/O-analysis of the naive linear-time algorithm, scan&sort algorithm. *Sorting data on disk*: 2-way mergesort, I/O analysis. * [[https://drive.google.com/file/d/1-xjImcN-PMUDll1-fLB1jxrSzXp2lgXH/view?usp=sharing][External memory: introductory slides]] * Notes on [[https://drive.google.com/file/d/1-eh-B-MtSdjA1Ye4zdV-LTDUWUbDz5fB/view?usp=sharing][scan&sort permutation algorithm, binary mergesort I/O analysis]] * K. Mehlhorn, P. Sanders. _Algorithms and data structures: The basic toolbox_, 2009, Chapter 5 (Sorting and selection), Section 7 (without sample sort). [[http://www.mpi-inf.mpg.de/~mehlhorn/Toolbox.html][On-line version]]. <font color="#AF0F0F"><b>Monday, March 12</b></font> - A key inefficiency of 2-way mergesort. *k-way mergesort*, with full analysis. Hints for an efficient implementation. The 1TB sorting problem: can we do more than one pass? A back of the envelope calculation. * Notes on [[https://drive.google.com/file/d/1hRqnX1R8YFTz8pSy0dj9OGd-LyiBZWqY/view?usp=sharing][k-way mergesort]] * K. Mehlhorn, P. Sanders. _Algorithms and data structures: The basic toolbox_, 2009, Chapter 5 (Sorting and selection), Section 7 (without sample sort). [[http://www.mpi-inf.mpg.de/~mehlhorn/Toolbox.html][On-line version]]. <font color="#AF0F0F"><b>Friday, March 16</b></font> - Spatial and temporal *locality*. Case study: *matrix multiplication*. Number of I/Os of the standard algorithm. Reuse distance. Data layout and blocking. Blocked iterative matrix multiplication. I/O-optimal recursive implementation. * C. Demetrescu and I. Finocchi: notes on [[%ATTACHURL%/matmul.pdf][blocked matrix multiplication: iterative algorithm]]. * [[http://twiki.di.uniroma1.it/pub/BDC/Schedule/ViS94.sorting_io.pdf][Divide-and-conquer algorithm]] by Vitter & Shriver (only Section 7, matrix multiplication). * You can now answer Question 1 of [[https://drive.google.com/file/d/13Mph5TV5NuzvHJHmG4E-ha5oG0F0MCRu/view?usp=sharing][last year's midterm]] <font color="#AF0F0F"><b>Monday, March 19</b></font> - <b>MapReduce</b>: stable storage, *distributed file system*, *programming model* (data as key-value pairs, map/reduce functions, wordcount example). * Dean & Ghemawat, [[http://research.google.com/archive/mapreduce-osdi04.pdf][MapReduce: Simplified Data Processing on Large Clusters]], OSDI 2004 paper * !MapReduce programming model: [[https://drive.google.com/file/d/1cypKKWfYF58LuadO8fDUMWendD8o5jKJ/view?usp=sharing][slides]] <font color="#AF0F0F"><b>Friday, March 23</b></font> - <font color=green><i><b>Hands-on class</b></i></font>. Introduction to *Hadoop*, *configuration*, *context*, *mapper and reducer classes*, basic *writable* types (Text, !IntWritable, !LongWritable). Implementing <b>WordCount</b>. * Wordcount [[https://drive.google.com/file/d/1qfFYjzwFhc2SUFX9mgNhrfailJP8WL8i/view?usp=sharing][implementation, data sets, how-to-run notes]] * Hadoop: [[https://drive.google.com/file/d/1PzUGdXo5ILzDV5oQ49Eij1P4Tna9dcl-/view?usp=sharing][slides]] * If you want to avoid installing Hadoop on your platform (or want to do that later), you can use the [[https://drive.google.com/file/d/0B1yYvm6QgJReUlgyc0pBQk1OOGc/view?usp=sharing][LXLE virtual machine]] (file size: 4GB). This requires !VirtualBox, which is available on the Oracle website. <font color="#AF0F0F"><b>Monday, March 26</b></font> - Running !WordCount: main *HDFS commands*. !MapReduce *runtime system*: distributed execution overview, master, workers, key partitioning, local storage vs. distributed file system, task coordination, failures. * !MapReduce runtime system: [[https://drive.google.com/file/d/11Hc6z-YCzoiPKoTJs9-_OGHg0F1yV8lj/view?usp=sharing][slides]] <font color="#AF0F0F"><b>Friday, April 6</b></font> - *Good coding practices*: examples of simple !MapReduce algorithms (matrix transpose, sum of n values). Towards a *theoretical cost model*. Computing a *minimum spanning tree* (MST) in !MapReduce: high-level description of the algorithm. * [[http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf][A model of computation for MapReduce]], by H. Karloff, S. Suri & S. Vassilvitskii, appeared in Proceedings of the ACM SIAM Symposium on Discrete Algorithms, SODA 2010 (Sections 1, 2, 3, only definitions covered in class). * Computational model, examples, MST: [[https://drive.google.com/file/d/1-8bFMvAiHQwZdrWYKKIepheG6LeSPc92/view?usp=sharing][slides]] <!-- and *correctness* proof (why critical edges are not discarded during the first round)--> <font color="#AF0F0F"><b>Monday, April 9</b></font> - *MST algorithm*: *correctness* and *space analysis*. Local space of reduce 1 functions, i.e., expected value of _|E<sub>i,j</sub>|_. Space analysis of round 2: size of subgraph _H_. When is local space sublinear? Deriving the optimal value for _k_, sublinearity on c-dense graphs. * MST: [[https://drive.google.com/file/d/1vxJBozOo4pYEbgNs5yl8QUoJlhR_Zkhs/view?usp=sharing][notes on space analysis - part 1]] * MST: [[https://drive.google.com/file/d/1KX9_XjLkQLMuSYAhLvHz77FwJIjSAhqR/view?usp=sharing][notes on space analysis - part 2]] * [[http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf][A model of computation for MapReduce]], by H. Karloff, S. Suri & S. Vassilvitskii, appeared in Proceedings of the ACM SIAM Symposium on Discrete Algorithms, 2010 (Section 5, only definitions and proofs covered in class). <font color="#AF0F0F"><b>Friday, April 13</b></font> - Mining networks: examples and properties of *real-world networks*, *Erdős-Rényi* model, locality and *clustering coefficient*, *small world* properties (Milgram's experiment, six degrees of separation), degree distribution and *power laws*, scale free networks. Generative network models: *Watts & Strogatz* model, *preferential attachment* model by Barabasi & Albert. * Slides on [[https://drive.google.com/file/d/1_gARPwx3axLFSgCEO1AE6GK3aK3KLRob/view?usp=sharing][real-world networks properties]] <font color="#AF0F0F"><b>Monday, April 16</b></font> - *Triangle counting*: naive approach, !NodeIterator, <b>!NodeIterator++</b>, !MapReduce implementations. * Slides on [[https://drive.google.com/file/d/1JSSt44ue6EM6Hteje3xxFz5mbL4GXu3W/view?usp=sharing][triangle counting]] * [[https://theory.stanford.edu/~sergei/papers/www11-triangles.pdf][Paper]] on triangle counting <font color="#AF0F0F"><b>Friday, April 20</b></font> - Exercises: exam [[https://drive.google.com/file/d/11V-EJy51hJKiYlY7H-JR1ycaTuY5Izhs/view?usp=sharing][June 8, 2017]], exam [[https://drive.google.com/file/d/10h0NOjgopi2D8gk0VKwyepXyz86mhQD0/view?usp=sharing][June 28, 2017]], [[https://drive.google.com/file/d/13Mph5TV5NuzvHJHmG4E-ha5oG0F0MCRu/view?usp=sharing][first midterm 2017]]. [[https://drive.google.com/file/d/0B1yYvm6QgJReTFdpZjFFb0daSWM/view?usp=sharing][Midterm solutions]]. <font color="#AF0F0F"><b>Monday, April 23</b></font> - First midterm. <font color="#AF0F0F"><b>Friday, April 27</b></font> - *Document similarity*: set similarity, *shingling*, *Jaccard similarity*, *minHashing*. Analysis of minHashing with one random permutation. Detailed !minHashing example. *Implementation of minHashing*: (1) how to do a single pass over the document matrix; (2) how to get rid of permutations. * [[https://drive.google.com/file/d/1tMO5IM-86mNurM5M73AFY0h9kiYnOfli/view?usp=sharing][Slides used in class]]. * [[http://www.mmds.org/][Mining of Massive Datasets]] book, sections 3.1, 3.2, 3.3 <font color="#AF0F0F"><b>Friday, May 4</b></font> - *Locality-sensitive hashing*: partitioning the signature matrix into bands, definition of candidate pair, intuition behind LSH, two examples, analysis. * Slides: see previous lesson * [[http://www.mmds.org/][Mining of Massive Datasets]] book, section 3.4 <font color="#AF0F0F"><b>Monday, May 7</b></font> - Algorithms for *data streams*. Motivations, applications, and theoretical model. Streaming puzzles: 1) missing number; 2) curious George goes fishing. * [[https://drive.google.com/file/d/1gQjvCctCVFd7M7OiT7dqJYN0FfeRZiVy/view?usp=sharing][Class slides]] * M. Muthukrishnan: survey on [[https://www.cs.rutgers.edu/~muthu/stream-1-1.ps][data streams: algorithms and applications]] * C. Demetrescu and I. Finocchi: survey on [[http://twiki.di.uniroma1.it/pub/BDC/WebHome/SurveyStreaming08-DemetrescuFinocchi.pdf][algorithms for data streams]], Handbook of Applied Algorithms: Solving Scientific, Engineering and Practical Problems, 2008. <font color="#AF0F0F"><b>Friday, May 11</b></font> - Another streaming puzzle: *pointer and chaser* (how to find a duplicate item). Sampling uniformly from infinite streams: *reservoir sampling* (algorithm and analysis). * See previous class for pointer and chaser slides * [[https://drive.google.com/file/d/1T9tFqzXTPwkjC6luycmQJK2ofUc5MXdJ/view?usp=sharing][Reservoir sampling slides]] <font color="#AF0F0F"><b>Monday, May 14</b></font> - IT Meeting <font color="#AF0F0F"><b>Friday, May 18</b></font> - *Triangle counting in* (adversarial) *data streams*: a sampling-based approach. Space *lower bound* based on a communication complexity reduction. * [[https://drive.google.com/file/d/1mzwXMKjosHsYORgGLsewd62QKGBTRsu1/view?usp=sharing][Triangle counting slides]]: upper and lower bounds on triangle counting in data streams <!-- * PODS 2006 (ACM Symposium on Principles of Database Systems) paper on triangle counting: [[%ATTACHURL%/p253-buriol.pdf][Triangle counting in data streams]], by Buriol, Frahling, Leonardi, Marchetti-Spaccamela, and Sohler. --> <font color="#AF0F0F"><b>Monday, May 21</b></font> - Paper hacking session. Discussion of the following papers: * Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (Apache Spark, 10th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012) * The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing (Google Cloud Dataflow, 41st International Conference of Very Large Databases, VLDB 2015) * !BigDebug: Debugging Primitives for Interactive Big Data Processing in Spark (IEEE/ACM 38th IEEE International Conference on Software Engineering, ICSE 2016) * Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics (13th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2016) <font color="#AF0F0F"><b>Friday, May 25</b></font> - <font color="#AF0F0F"><b>Monday, May 28</b></font> - <font color="#AF0F0F"><b>Friday, June 1</b></font> - <!-- <font color="#AF0F0F"><b>Tuesday, May 23</b></font> - Discussion of [[https://drive.google.com/file/d/0B1yYvm6QgJReTFdpZjFFb0daSWM/view?usp=sharing][midterm solutions]] <font color="#AF0F0F"><b>Thursday, May 11</b></font> - *Sketches*. Counting distinct items: *probabilistic counting*. * [[https://drive.google.com/file/d/0B1yYvm6QgJReUnNHLTNmS1lLYzA/view?usp=sharing][Probabilistic counting slides]] (except for slides 9 - 13). <font color="#AF0F0F"><b>Thursday, May 4</b></font> - Frequent items (aka *heavy hitters*). *Problem definition*: frequency threshold, approximating the solution (parameter epsilon) and probabilistic guarantees (parameter delta). *Sticky sampling*: algorithm, correctness and expected space. * Heavy hitters: [[https://drive.google.com/file/d/0B1yYvm6QgJReYzUzT1I3ZktoTk0/view?usp=sharing][problem definition and algorithm]], [[https://drive.google.com/file/d/0B1yYvm6QgJReNVJESVB6Wk5lT0U/view?usp=sharing][analysis]] * [[http://www.americanscientist.org/issues/pub/the-britney-spears-problem/1][The Britney Spears problem: tracking who's hot and who's not presents an algorithmic challenge]], by Brian Hayes. A popular paper on stream algorithmics appeared in volume 96 of the American Scientist, 2008. * Original VLDB 2002 paper describing Sticky Sampling: [[%ATTACHURL%/02vldb-freq.pdf][Approximate Frequency Counts over Streaming Data]], by G. Manku and R. Motwani. * On [[http://news.stanford.edu/news/2009/june10/rajeev_motwani-061009.html][Rajeev Motwani]], co-designer of sticky sampling. Stanford Report, 2009. <font color="#AF0F0F"><b>Tuesday, April 25</b></font> - Liberation day: classes suspended <font color="#AF0F0F"><b>From April 13 to April 18</b></font> - Easter holidays: classes suspended <font color="#AF0F0F"><b>Tuesday, April 11</b></font> - No class (we give back two hours to Prof. Massini) ============================================================================================================ <font color="#AF0F0F"><b>Tuesday, April 26</b></font> - <font color=green><i><b>Hands-on class</b></i></font>: stable storage and Elastic !MapReduce on *AWS* (Amazon Web Services). * [[https://drive.google.com/file/d/0B1yYvm6QgJReVXF5a3ZPbUp2SjQ/view?usp=sharing][Slides on AWS]] (by Emilio Coppa) * Slides on minHashing implementation: see [[https://drive.google.com/file/d/0B1yYvm6QgJReeUZpTWdtbi1vN0k/view?usp=sharing][April 19 slides]], last part <font color="#AF0F0F"><b>Monday, April 11</b></font> (longer class starting at 10:15) - <font color=green><i><b>Hands-on class</b></i></font> on network mining: computing the *degree distribution in !MapReduce* + output *post-processing* through Unix commands + *visualization* in Excel. *New Hadoop features*: * how to *split input files* (class !TextInputFormat vs. class !KeyValueTextInputFormat) * how to let Hadoop *parse common arguments automatically* (Tool and !ToolRunner) * how to *pass arguments* to mappers and reducers. * Degree calculator: [[https://drive.google.com/file/d/0B1yYvm6QgJReT09teXhFMXVQTXM/view?usp=sharing][Hadoop code]] <font color="#AF0F0F"><b>Tuesday, February 28</b></font> - <font color=green><i><b>Hands-on class</b></i></font>. Processing data on disk: tiny *experiments* on sequential vs random accesses, blocking, spatial locality issues. * [[https://drive.google.com/file/d/0B1yYvm6QgJReWkVGalc2anR3emM/view?usp=sharing][C code used in class]] and experiments vademecum <font color="#AF0F0F"><b>Tuesday, May 19</b></font> - Graduation day: class suspended --> -- [[http://www.dsi.uniroma1.it/~finocchi][Irene Finocchi]] - May 2018
This topic: BDC
>
WebHome
>
ScheduleOld
Topic revision: r2 - 2019-02-26 - IreneFinocchi
Copyright © 2008-2025 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback