Multivariate Analysis of Distributional Data

Maria Paula Brito
LIAAD – INESC TEC – UP

November 26, 2019

In classical Statistics and Multivariate Data Analysis data is usually represented in a matrix where each row represents a statistical unit, or “individual”, for which one single value is recorded for each numerical or categorical variable (in columns). This representation model is however too restricted when the data to be analysed comprises variability. That is the case when the entities under analysis are not single elements, but groups formed on the basis of some given common properties.
Then, for each descriptive variable, the observed variability within each group should be taken into account, to avoid an important loss of pertinent information.
To this aim, new variable types have been introduced, whose realizations are not single real values or categories, but sets, intervals, or, more generally, distributions over a given domain.
Symbolic Data Analysis provides a framework for the representation and analysis of such data, taking into account their inherent variability.
In this seminar, we focus on the case of numerical data described by empirical distributions, known as histogram data. We introduce alternative representations of histogram observations, and present summary statistics and distance measures, referring the main properties.
Then we introduce open problems to be investigated :

Outlier detection in histogram data is an issue to be addressed.
Kernel density estimation may be applied to the empirical distributions, leading to densityvalued variables. Multivariate analysis of the resulting density-valued data is a new line of research in Symbolic Data Analysis.

In collaboration with Sónia Dias, ESTG-IPVC & LIAAD – INESC TEC