This list is not meant to be bibliographically rigorous, but just to help you find links to the papers; that's why sources are often incomplete or not to the final archival version. Better search DBLP if you want to cite propertly.
On the other hand, some of the papers are here for historical completeness, you probably have better sources than the original one if you want to know what they do.
Surveys on stream algorithmics:
- Survey by Liberty and Nelson: http://www.cs.yale.edu/homes/el327/papers/streaming_data_mining.pdf
- A very general bibliography by K. Tufte: http://web.cecs.pdx.edu/~tufte/410-510DS/readings.htm
- Lecture notes by A. Chakrabarti: http://www.cs.dartmouth.edu/~ac/Teach/CS85-Fall09/Notes/lecnotes.pdf
- Survey by Lin and Zhang: http://www.cse.unsw.edu.au/~yingz/papers/apweb_2008.pdf
- Book by G. Cormode, M. Garofalakis, P. Haas, and C. Jermain:http://dimacs.rutgers.edu/~graham/pubs/html/CormodeGarofalakisHaasJermaine12.html
- Survey by G. Cormode: http://dimacs.rutgers.edu/~graham/pubs/papers/sk.pdf
- The original Morris77 paper: http://dl.acm.org/citation.cfm?id=359627, also available here: http://www.inf.ed.ac.uk/teaching/courses/exc/reading/morris.pdf
- An analysis of Morris' counter, by Flajolet: http://algo.inria.fr/flajolet/Publications/Flajolet85c.pdf
- The application of Morris' counters to counting n-grams, by Van Durme and Lall: http://www.cs.jhu.edu/~vandurme/papers/VanDurmeLallIJCAI09.pdf
- (math intensive) Two explanations of large deviation bounds that I like, by A. Sinclair http://www.cs.berkeley.edu/~sinclair/cs271/n13.pdf and by G. Lugosi http://www.econ.upf.edu/~lugosi/anu.pdf
- Noga Alon, Yossi Matias, Mario Szegedy: The space complexity of approximating frequency moments. J. Comput. Syst. Sci. 58(1): 137-147 (1999). Conference version, Symp. on the Theory of Computing (STOC) 1996.
- Piotr Indyk, David P. Woodruff: Optimal approximations of the frequency moments of data streams. STOC 2005: 202-208. The paper concentrates on proving lower bounds, which is out of the scope of this course.
- Good general survey of distinct element counting up to 2008: Ahmed Metwally, Divyakant Agrawal, Amr El Abbadi: Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic. EDBT 2008: 618-629.
- Also general discussion on disctint element counting: http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html
- Presentation including some sketches I didn't mention: http://www.cs.upc.edu/~conrado/research/talks/aofa2012.pdf
- K.Y. Whang, B. Vander-Zanden, H.M. Taylor, A Linear-time Probabilistic Counting Algorithm for Database Applications. ACM Trans. Database Syst., 15:2, 1990.
- Edith Cohen, Size-Estimation Framework with Applications to Transitive Closure and Reachability . FOCS 1994 and JCSS 1997.
- The flajolet-martin probabilistic counter. Philippe Flajolet, G. Nigel Martin: Probabilistic Counting Algorithms for Data Base Applications. J. Comput. Syst. Sci. 31(2): 182-209 (1985). See also http://en.wikipedia.org/wiki/Flajolet-Martin_algorithm
- Durand, M.; Flajolet, P. (2003). "Loglog Counting of Large Cardinalities". Algorithms - ESA 2003. Lecture Notes in Computer Science 2832. p. 605.
- Flajolet, P.; Fusy, E.; Gandouet, O.; Meunier, F. (2007). "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm". AOFA ’07: Proceedings of the 2007 International Conference on the Analysis of Algorithms.
- Flajolet's contributions explained beautifully by J. Lumbroso: http://www.stat.purdue.edu/~mdw/ChapterIntroductions/ApproxCountingLumbroso.pdf
- Kane, Nelson, Woodruff. An Optimal Algorithm for the Distinct Elements Problem. Principles of Database Systems (PODS) 2010
- http://en.wikipedia.org/wiki/HyperLogLog
- http://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
- A live demo of hyperloglog at the web above: http://content.research.neustar.biz/blog/hll.html
- http://www.slideshare.net/sunnyujjawal/hyperloglog-in-practice-algorithmic-engineering-of-a-state-of-the-art-cardinality-estimation-algorithm
- http://stackoverflow.com/questions/12327004/how-does-the-hyperloglog-algorithm-work
- Important optimizations that I'd like to try: http://druid.io/blog/2014/02/18/hyperloglog-optimizations-for-real-world-systems.html
- Also here: http://research.google.com/pubs/pub40671.html
- J. Vitter. Random Sampling with a reservoir. ACM Trans. on Mathematical Software, 1985.
- Good survey of heavy hitter algorithms. Radu Berinde, Graham Cormode, Piotr Indyk, Martin J. Strauss. Space-optimal Heavy Hitters with Strong Error Bounds
- Also very good survey: Graham Cormode, Marios Hadjieleftheriou. Finding Frequent Items in Data Streams. Proc. VLDB Endowment, 2008
- Richard M. Karp, Scott Shenker, Christos H. Papadimitriou. A Simple Algorithm for Finding Frequent Elements in Streams and Bags. ACM Transactions on Database Systems (TODS), Volume 28, 2003.
- The Space-Saving sketch paper. Ahmed Metwally, Divyakant Agrawal, Amr El Abbadi. Efficient Computation of Frequent and Top-k Elements in Data Streams. Intl. Conf. on Database Technology (ICDT) 2005.
- M. Charikar, K. Chen and M. Farach-Colton. "Finding Frequent Items in Data Streams." ICALP 2002 (conf. version) and Theoretical Computer Science 2004 (journal version)
- The CM-Sketch paper. Graham Cormode and S. Muthukrishnan: An improved data stream summary: The Count-min sketch and its applications. J. Algorithms 55: 29–38
- On Frugal Streaming, a neat sketch for estimating quantiles which I did not cover in the course: http://research.neustar.biz/2013/09/16/sketch-of-the-day-frugal-streaming/
- http://en.wikipedia.org/wiki/Count-min_sketch
- https://sites.google.com/site/countminsketch/
- https://tech.shareaholic.com/2012/12/03/the-count-min-sketch-how-to-count-over-large-keyspaces-when-about-right-is-good-enough/
- Discussions on sketch merging are a bit all over. This is sort of an overview.
- Christopher R. Palmer, Phillip B. Gibbons and Christos Faloutsos, ANF: A Fast and Scalable Tool for Data Mining in Massive Graphs, KDD 2002.
- Paolo Boldi, Marco Rosa, and Sebastiano Vigna. HyperANF: Approximating the neighbourhood function of very large graphs on a budget. WWW, 2011.
- Lars Backstrom, Paolo Boldi, Marco Rosa, Johan Ugander, Sebastiano Vigna. Four Degrees of Separation. ACM Web Science 2012, 2012.
- A survey on streaming graph algorithms: http://people.cs.umass.edu/~mcgregor/papers/13-graphsurvey.pdf
No comments:
Post a Comment