Sunday, June 21, 2015

Session 10 cancelled

Dear all,

Session 10 is cancelled, so the seminar is finished. I hope you have enjoyed it.

You will find in the Slides section two sets of slides by other people that I was planning to use.

Cheers,
Ricard

Thursday, June 4, 2015

Deadline for delivering practical work

Except for a couple of people that I have contacted for email, let's say that I will accept your description of the practical work performed until July 20th.

Practical work = "lab" != exercises:  Practical work counts for the extra 1 credit.

For exercises, as stated during the course, the deadline is 2 weeks after they were proposed. Exercises give 2 credits, if about 70% of them were handed in reasonable shape. I'm open to negotiation.


Wednesday, May 27, 2015

No class on June 1st or 2nd

Just to clarify:

- no class on Monday June 1st, holiday
- no class on Tuesday June 2nd, even if it has Monday schedule in FIB

See you on June 8th.

Sunday, May 24, 2015

Lecture 7 and changes to Lecture 6 slides

The slides for Lecture 7 are available.

I changed the last part of Lecture 6, on the ADWIN sketch, that we will cover tomorrow as well.


Sunday, April 19, 2015

Another seminar on Data Streams - Toon Calders

By chance, another seminar on data streams will take place in UPC in the next days, as part of the IT4BI:


Toon Calders (http://cs.ulb.ac.be/members/tcalders/doku.php) is one of the leading researchers in Data Mining in Europe. His seminar largely overlaps with Part I of this one, probably covering less material overall because it's shorter, but no doubt Toon's view will be extremely illuminating.

Contact Óscar Romero (oromero AT essi.upc.edu). if you are interested in attending.

Saturday, April 18, 2015

Slides for Lecture 2 (april 20th) are available

From now on, I will normally not create an entry just to announce that slides are ready, unless there is something else to say for that lecture. You just check the Slides page.

EDIT, apr 19th, 13:35: rewrote slides 16-17 to 16-17-18.

Saturday, April 11, 2015

Slides for Lecture 1 (april 13th) are available

See the "Slides" page here.

EDIT, april 13th, 21:30:
  • Slides slightly revised after class.
  • The slides contain three exercises. As noted in Exercise 2, you have to do it for yourself to understand the material, but you don't have to deliver it.
For all sessions: Exercises can be given to me via email or on paper, in class.

Tuesday, April 7, 2015

Welcome to the Spring 2015 (and second) edition of the MIRI Seminar on Data Streams

Topic overview:

Streaming is one of the central aspects of the "Big Data" slogan. At a planet scale, we are today generating more data than we can store, most of which will never be seen by human eyes. It often takes the form of sequences or streams of data items, arriving at high speed, potentially infinite, and evolving over time. The data stream paradigm contrasts with the usual batch, input-compute-output algorithmic paradigm in this sequential nature, and also in the strong computational requirements it imposes: one pass over the data, small memory, small computation time per data item, ability to give answers in real-time and at any time.

In this seminar we will: 1) Describe some of the scenarios where this paradigm is necessary (sensor networks, smart cities, social media, network monitoring, ...). 2) Describe algorithms for computing basic and not so basic queries over data streams. 3) Describe algorithms for mining knowledge from data streams (predictive models, clustering, pattern mining), as an extension of traditional data mining and machine learning. 4) Off class, experiment with software implementing some of these algorithms.



Logistics:
  • Instructor: Ricard Gavaldà
  • Mail: gavalda AT cs then the UPC domain. 
  • Meeting time: Mondays, 15:00 - 17:00
  • Meeting place: Room A6101, campus Nord UPC
  • Start date: April 13th
  • End date: June 22nd (estimated)
  • Credits (for MIRI students): 2 ECTS credits for solving the problems proposed in class; 1  extra ECTS credit for showing significant effort on the proposed practical work.
  • Materials: I will post all the materials here, either as blog entries or in the pages on the right hand side.
  • Evaluation: in each session I will propose a few exercises; you should solve a reasonable number of these within the prescribed period, typically about 2 weeks. For the optional extra 1 credit, a few deliverable lab assignments. I am open to discussion for alternative evaluation methods.

Intended audience & requisites: 
  • Students enrolled in MIRI, particularly in the Advanced Computing and Data Mining & Business Intelligence specialities. 
  •  But everybody is welcome. Feel free to tell other people to attend (but tell them to please notify me).
  • Some familiarity with probabilistic reasoning and algorithmics is assumed. Some familiarity with machine learning / data mining is helpful, but probably not essential. 
  • Some programming may be necessary for the optional practical work. I will try to give as much freedom in the choice of programming language but MOA is mostly Java. 

Tentative Schedule:

Part 1: Data Stream Algorithmics
  • Lecture 1: The data stream model. Counting. Probability tools
  • Lecture 2. Frequency problems
  • Lecture 3. Finding frequent elements. The CM-sketch and applications
  • Lecture 4. Distributed sketching. Linear algebra. Dimensionality reduction
  • Lecture 5. Graph streams
Part 2: Learning and Mining in Data Streams
  • Lecture 6. Managing time change in data streams
  • Lecture 7. Data Stream Mining. Building decision trees
  • Lecture 8. Evaluation. More predictors. Clustering
  • Lecture 9. Frequent pattern mining in data streams 
  • Lecture 10. (Open to discussion) Distributed stream mining. Mining social media