MIRI Seminar on Data Streams, spring 2015 edition : Practical work

SOME SUGGESTIONS FOR PRACTICAL WORK

Suggestion 1.

1.1 Implement the Morris counter. It should have two parameters: the basis b<1, and the number B of bits to be used for one counter in base b. In theory the counter can then count up to n where B = log_2(b^n)

1.2 Take B=8, so that one byte is used per counter. Write a program that feeds your counter a large number of imaginary items, and record the answer. Actually run it several times and compute the variance of the answer. If you run it enough times, draw a plot that indicates how good it is at getting (epsilon,delta) approximations: for each delta (in the x axis), for which epsilon (in the y axis) it is true that the approximation is good up to epsilon with probability at least 1-delta

Of course, every value of b gives different results. Try a few.

Suggestion 2.

1.1. Get implementations of space saving and count-min sketch (check for example: https://code.google.com/p/space-saving/, https://github.com/addthis/stream-lib, https://sites.google.com/site/countminsketch/code, http://s-yata.github.io/madoka/, https://github.com/alabid/countminsketch

1.2 Get access to a large text corpus. Try to clean it up minimially (keep only true words, formed by letters only, and remove numbers, punctuation marks, url's, and other junk; convert true words to lowercase).

1.3 Build a program that
1.3.1 Knows a manageable number N of "secret" words, say N = 1000, some very frequent, some less frequent.
1.3.2 Runs space-saving and count-min sketch on the corpus, to count frequencies of distinct words.
1.3.3 In parallel, it keeps the true counts of the "secret" words. They are "secret" because the sketches don't know them, so can't treat them in any special way
1.3.4 Now ask the sketches for their estimates of the frequencies of the N secret words. Compare them to the true frequencies that you have computed in the side. Give estimates of the errors made by the sketch, and compare them to the theoretical bounds on errors we gave in class.
1.3.5 Then compare the two sketches among themselves.

Recall that frequencies in natural language corpuses follow power laws with exponents between 1 and 2 (near 1.4..1.6 are typical), so the frequent words are quite frequent, but the tail of less frequent words is non-negligible.

If you can think of another stream with interesting item distribution, by all means try it out instead / in addition of the text corpus.

Suggestion 3.

Download MOA. Start it using the instructions in http://moa.cms.waikato.ac.nz/getting-started/

The MOA manual may be useful too.

We will use the Classification tab.

Let us first try the task "Measure Stream Speed". Have a look at the different stream generators:

afffile: reads from an existing ARFF file, processing instances in sequence
randomTreeGenerator: generates a random decision tree, then uses it to label instances
hyperplane generator: simulates a hyperplane that oscillates and is used as linear classifier. You can change number of attributes, number of attributes that actually vary, speed of change, and (sigma) probability with which it changes direction of oscillation
rbfgenerator: simulates a number of gaussians which randomly get assigned gaussians. There is a rbfgeneratordrift which lets us introduce drifts: the centroids of the gaussians drift slowly as time passes.

See that there is also a task "Write Stream To ARFF": Generate a stream and write it to an ARFF file.

Still within MeasureStreamSpeed, we can generate more complex streams with gradual or sudden changes by concatenating two or more streams. See the conceptddriftStream stream generator: lets us concatenate a base stream (to be chosen) and a drift stream (to be also chosen). You can choose the central position at which the change occurs, and the window over which the change occurs (i.e., change starts at p-w and is complete at p+w). The two distributions are merged using a sigmoid, and alternatively to the window you can use parameter alpha, the slope of the sigmoid at its middle point.

You can merge more than one stream recursively, by letting the driftstream be itself a conceptdriftstream.

A variant conceptdriftrealstream allows merging streams with different numbers of attributes.
Not all classifiers need to support this.

In the Classification tab still, check out the following tasks:

LearnModel: learns a classifier model from a stream and exports it to a file (as in WEKA, it is a serialized Java object)
EvaluateModel: reads a classifier from a file and evaluats it on a stream
EvaluatePrequential: builds a classifier and evaluates it on the same stream.
Each stream element is first evaluated then used for training.

Let's stay in this one.

Now the bottom part of the screen becomes alive: you can plot evolution of accuracy, kappa statistics, RAM-hours (memory used x for how long), time, and memory used. Memory is probably in Mb.

Prepare a stream with drift. Let the first part be a randomtreegenerator (so it's an easy task for decision tree classifiers) and the second one a hyperplane (so it's easy for linear classifiers). Ask for a few million instances or it (for a total of 4-10 million), and a few tens of attributes in both streams (say 50, and say 25 of them do drift). Now it will take a few minutes to run a classifier on this stream.

Try the MajorityClass classifier. Its accuracy will be 1/number of classes (because classes are equiprobable in these generators).

Try the Perceptron algorithm (a linear classifier that adapts to change, but of course is only good for linear problems). Should have relatively low accuracy in the first part, then higher one in the second one.

Try a plain HoeffdingTree. This is the VFDT explained in class. It will do well in the first part, then worse in the second (and you'll see how its error oscillates as the hyperplane rotates).

Try a HoeffdingAdaptiveTree. It'll still notice the change, but will do better on the second part, as it will learn more quickly the change and learn more quickly the hyperplane rotations. Note, on the other hand, it is slower and uses more memory (in my runs, from 30Mb aprpox to 50Mb aprox).

Suggestion 4.

Following the same lines as in Suggestion 3, do something with MOA's clusterers.

MIRI Seminar on Data Streams, spring 2015 edition

Practical work

No comments:

Post a Comment