- This event has passed.

# YES IX: "Scalable Statistics: on Accuracy and Computational Complexity"

## Mar 7, 2018 - Mar 9, 2018

#### Sponsors

#### Summary

The role of scalable inference methods is becoming more and more central in statistics and machine learning, due to an explosive increase in size and complexity of datasets. In such large-scale and complex scenarios that is, in a practical sense, a diminishing relevance of classic statistical methods that despite being endowed with sound statistical performance guarantees are not able to deal with the computational and memory constraints imposed by the existing hardware.

There are several avenues to deal with this problem. In some cases it is possible to relax the optimization problems that arise from the statistical procedure so that they become computationally appealing while still retaining good (sometimes nearly optimal) statistical performance. Another approach, which is fueled by the advent of cloud-computing, is to distribute the computation (and possibly data) among different machines. In its most simple form a dataset is split into smaller datasets that are processed in parallel on local servers - the results of those computations are then aggregated on a central machine. There are still no general principles that can be used to guide practitioners in splitting the data or later aggregating the results, and this is currently an active research area in the interface of statistics and computer science. Understanding the tradeoff between statistical accuracy and computational feasibility (in both distributed and non-distributed settings) is of paramount importance to address these issues.

This workshop aims to introduce this broad field of research to young researchers, including Ph.D. students, postdocs and junior early stage researchers, with a balanced focus on both theory and practice.

The workshop will take place at Eurandom, and will consist of tutorial courses given by three world experts in the field. There will be also contributed talks, as well as plenty of time for discussion.

#### Organizers

Rui Castro, TU Eindhoven

Paulo Serra, TU Eindhoven

Botond Szabó, Leiden University

#### Tutorial Speakers

Jianqing Fan, Princeton University

-Sparse Learning with Control of Statistical Errors and Computing Resources -Distributed PCA -A principle of Robustification for Big Data |

David Dunson, Duke University

Bayesian inference for big data |

Quentin Berthet, Cambridge University

Trade-offs in statistical learning |

#### Programme

Abstracts of Tutorial talks

**Jianqing Fan
**i) Sparse Learning with Control of Statistical Errors and Computing Resources

High-dimensional sparse learning and analysis of Big Data data pose significant challenges on computation and communication. Scalable statistical procedures need to take into account both statistical errors and computing resource constraints. This talk illustrate this idea by solving sparse learning via a computational framework named iterative local adaptive majorize-minimization (I-LAMM) to simultaneously control algorithmic complexity and statistical error when fitting high dimensional sparse models via a family of folded concave penalized quasi-likelihood. The algorithmic complexity and statistical errors are explicitly given and we show that the algorithm achieves the optimal statistical error rate under the weakest signal strength assumptions. |

ii) Distributed Estimation of Principal Eigenspaces

Principal component analysis (PCA) is fundamental to statistical machine learning. It extracts latent principal factors that contribute to the most variation of the data. When data are stored across multiple machines, however, communication cost can prohibit the computation of PCA in a central location and distributed algorithms for PCA are thus needed. This paper proposes and studies a distributed PCA algorithm: each node machine computes the top eigenvectors and transmits them to the central server; the central server then aggregates the information from all the node machines and conducts a PCA based on the aggregated information. We investigate the bias and variance for the resulting distributed estimator of the top $K$ eigenvectors. In particular, we show that for distributions with symmetric innovation, the distributed PCA is ``unbiased''. We derive the rate of convergence for distributed PCA estimators, which depends explicitly on the effective rank of covariance, eigen-gap, and the number of machines. We show that when the number of machines is not unreasonably large, the distributed PCA performs as well as the whole sample PCA, even without full access of whole data. The theoretical results are verified by an extensive simulation study. We also extend our analysis to the heterogeneous case where the population covariance matrices are different across local machines but share similar top eigen-structures. (Joint with Dong Wang, Kaizheng Wang, Ziwei Zhu) |

iii) A principle of Robustification for Big Data

Heavy-tailed distributions are ubiquitous in modern statistical analysis and machine learning problems. This talk gives a simple principle for robust high-dimensional statistical inference via an appropriate shrinkage on the data. This widens the scope of high-dimensional techniques, reducing the moment conditions from sub-exponential or sub-Gaussian distributions to merely bounded second moment. As an illustration of this principle, we focus on robust estimation of the low-rank matrix from the trace regression model. It encompasses four popular problems: sparse linear models, compressed sensing, matrix completion, and multi-task regression. Under only bounded $2+\delta$ moment condition, the proposed robust methodology yields an estimator that possesses the same statistical error rates as previous literature with sub-Gaussian errors. We also illustrate the idea for estimation of large covariance matrix. The benefits of shrinkage are also demonstrated by financial, economic, and simulated data. (Joint with Weichen Wang and Zhiwei Zhu) |

**David Dunson
**My talks will focus on recent methodology for scaling up Bayesian inference for huge data sets. The three talks will focus on the respective topics:

(i) Simple and efficient divide-and-conquer algorithms for huge sample sizes

(ii) Approximate MCMC algorithms for accelerating inference in high-dimensional problems

(iii) Modeling low-dimensional structure in high-dimensional data

The emphasis will be on presenting practical and general new computational and modeling frameworks that are concretely motivated by real world applications, while having theoretical guarantees.

**Quentin Berthet
**The field of statistics has been increasingly concerned with the potential limits inspired on statistical procedures. This is inspired by real-life constraints such as limits on time for computation, on reliability of observations, or communication between agents.

In this series of talks, I will explore the notion of constraints on learning procedures, and discuss the impact that they can have on statistical precision. I will show how these constraints can be shown to have a concrete cost on the statistical performance of these procedures, by describing several examples.

The first talk will be more tutorial and include some introductory material.

Based on joint work with Philippe Rigollet, Venkat Chandrasekaran, Jordan Ellenberg, Piyush Srivastava, and Nicolai Baldin.

#### Registration

There is no registration fee, but registration is compulsory for organizational purposes.

Please follow this link to register online.

#### Practical Information

Link to Information Page

On the information page, also more about our preferred hotel, for those who need accommodation and need to book this personally.