Conférenciers invités

Charles Bouveyron, Université Côte d'Azur

Bayesian sparsity for statistical learning in high dimensions

Statistical learning in high dimensions has recently benefitted from significant improvements. However, there is still a need for learning methods which are both efficient and interpretable. In particular, it is of great interest in many applications to be able to identify the original variables which are relevant for the studied problem. To this end, in the context of PCA, we propose a Bayesian procedure called globally sparse probabilistic PCA (GSPPCA) that allows to obtain several sparse components with the same sparsity pattern. This allows the practitioner to identify the original variables which are relevant to describe the data. To this end, using Roweis’ probabilistic interpretation of PCA and a Gaussian prior on the loading matrix, we provide the first exact computation of the marginal likelihood of a Bayesian PCA model. To avoid the drawbacks of discrete model selection, a simple relaxation of this framework is presented. It allows to find a path of models using a variational expectation-maximization algorithm. The exact marginal likelihood is then maximized over this path. This approach is illustrated on real and synthetic data sets. In particular, using unlabeled microarray data, GSPPCA infers much more relevant gene subsets than traditional sparse PCA algorithms.

*********************************

Rasmus Bro, Université de Copenhague

Why tensors are so useful in chemistry

The multi-way model PARAFAC has changed how complex fluorescence data are modeled. We will exemplify it's use and show what is possible using advanced data analysis. We will also show what some related methods such as PARAFAC2 can do e.g. in chromatography.

*********************************

Federico Marini, Université La Sapienza, Rome

Chemometric strategies for analyzing multivariate data coming from designed experiments: an overview

It is well known since the early 1930s, when the concept was systematized for the first time by Fisher, that, in order to obtain the maximum information from chemical measurements, the experiments to be conducted have to be rationally designed. At the same time, a statistical toolbox was provided for analyzing the results of designed experiments in terms of which factors and interactions accounted for significant variability in the outcomes and to what extent: the methods proposed were collectivelyreferred to as Analysis of Variance (ANOVA) and were soonextended to cope with multivariate responses (Multivare Analysis of Variance, MANOVA).

Although MANOVA is a well etablished method for the analysis of multivariate data coming from designed expiments, it requiresthe measured variables to be as uncorrelated as possible, and theirnumber should be lower than that of the observations. Since these requirements are not met by most of the fingerprinting techniques nowadays used to characterize the systems under investigation, especially in fields like omics or food sciences, new techniques have been introduced in the literature to address the issue of extracting as much information as possible from the multivariate profiles collected in designed experiments, some of those being ANOVA-PCA, Anova-Simultaneous component analysis (ASCA), ANOVA-Target projection (ANOVA-TP).

In the present communication, the main characteristics of the different approaches will be presented, also by means of realworld examples, stressing the shared and distinctive features among methods, especially as far as significant testing and interpretation are concerned.

*********************************

Onno de Noord, Amsterdam

Some recent developments in Process Chemometrics

The early days of chemometrics the phrase “data overload” was frequently used to motivate the use of multivariate methods. Principal Component Analysis (PCA) and Partial Least Squares (PLS) were the workhorses of chemometrics for extracting useful information from large datasets. However, what was considered large in those days is nothing compared to the data tsunami often encountered today. For instance, process industry is heavily instrumented and large data streams are collected on a continuous basis. PCA and PLS are still useful tools, but they are often not sufficient to deal with the currently observed data overload. Newer multivariate techniques are available to deal with the growing data set size and complexity and provide better insight in the underlying structures. The challenge in process chemometrics is to select the proper data arrangement and data analysis approach to match the process engineering questions at hand. Data analysis strategies will be explained using real world examples.

Personnes connectées : 1