Home » Node » 19542

Computational and Statistical Methods of Data Reduction - Data Science PhD Course

Prof. Alessio Farcomeni (Univ. of Tor Vergata) - Dr. Chris Schwiegelshohn (DIAG - Sapienza)
speaker DIAG: 
Data dell'evento: 
Lunedì, 9 March, 2020 - 09:00 to Martedì, 17 March, 2020 - 13:00
DIAG - Aula B203

1. Robust Statistics for Data Reduction (March 9th - 10th 2020, 09:00-13:00, DIAG, Aula B203)

Prof. Alessio Farcomeni (Tor Vergata)

2. Dimensionality Reduction in Clustering and Streaming (March 16th - 17th 2020, 09:00-13:00, DIAG,  Aula B203)

Prof. Chris Schwiegelshohn (Sapienza)


1. Robust Statistics for Data Reduction
We will briefly introduce the main principles and ideas in robust statistics, focusing on trimming methods. The working example will be that of estimation of location and scatter in multidimensional problems, together with outlier identification. We will then discuss some methods for robust clustering based on impartial trimming and snipping.  A simple robust method for dimensionality reduction will be finally discussed. Illustrations will be based on the R software and some contributed extension packages. 
Tentative schedule: Introduction to robust inference. Concepts of: masking, swamping, breakdown point, Tukey-Huber contamination, entry-wise contamination. Estimation of location and scatter based on the Minimum Covariance Determinant. The fastMCD algorithm. Outlier identification. Robust clustering: trimmed $k$-means, snipped $k$-means. The tclust and sclust algorithms. Selecting the trimming level and number of clusters though the classification trimmed likelihood curves. Plug-in methods for dimension reduction. Brief overview of most recent contributions and venues for further work. 
The course will be based on the book: Farcomeni, A. and Greco, L. (2015) Robust Methods for Data Reduction, Chapman & Hall/CRC Press


2. Dimensionality Reduction in Clustering and Streaming

First Day:The curse of dimensionality is a common occurrence when working with large data sets. In few dimensions (such as the Euclidean plane), we visualize problems very well and can often find interesting properties of a data set just by hand. In more than three dimensions, our ability to visualize a problem is already severely impacted and our intuition from the Euclidean plane may lead us completely astray. Moreover, algorithms often scale poorly:

  1. Finding nearest neighbors in 2d can be done in nearly linear time. In high dimensions, it becomes very difficult to improve over either n^2.
  2. Geometric data structures and decompositions become hard to implement. Line sweeps, Voronoi diagrams, grids, nets usually scale by at least a factor 2^d, where d is the dimension. In some cases, it may be even worse.
  3. Many problems that are easy to solve in 2D, such as clustering, become computationally intractable in high dimensions. Often, exact solutions require running times that are exponential in the number of dimensions.

Unfortunately, high dimensional data sets are not the exception, but rather the norm in modern data analysis. As such, much of computational data analysis has been devoted with finding ways to reduce the dimension. In this course, we will study two popular methods, namely principal component analysis (PCA) and random projections. Principal component analysis originated in statistics, but is also known under various other names, depending on the fields (e.g. eigenvector problem, low rank approximation, etc). We will illustrate the method, highlighting the problem that is solved and the underlying assumptions of PCA. Next, we will see a powerful tool for dimension reduction known as the Johnson-Lindenstrauss lemma. The Johnson-Lindenstrauss lemma states that given a point set A in an arbitrary high dimension, we can transform A into a point set A' in dimension log |A|, while preserving all pairwise distances. For both of these problems, we will see applications, including k-nearest neighbor classification and k-means. 

Second day:Large data sets form a sister topic to dimension reduction. While the benefits of having a small dimension are immediately understood, reducing the size of the data is a comparatively recent paradigm. There are many reasons for data compression. Aside from data storage and retrieval, we want to minimize the amount of communication in distributed computing, enable online and streaming algorithms, or simply run an accurate (but expensive) algorithm on a smaller dataset. A key concept in large-scale data analysis are coresets. We view coresets as a succinct summary of a data set that behaves, for any candidate solution, like the original data set. The surprising success story of data compression is that for many problems, we can construct coresets of size independent of the input. For example, linear regression in d dimensions admits coresets of size O(d), k-means has coresets of size O(k), irrespective of the number of data points of the original data set. In our course, we will describe the coreset paradigm formally. Moreover, we will give an overview of methods to construct coresets for various problems. Examples include constructing coresets from random projections, by analyzing gradients, or via sampling. We will further highlight a number of applications.


gruppo di ricerca: 
© Università degli Studi di Roma "La Sapienza" - Piazzale Aldo Moro 5, 00185 Roma