S T A R
Stochastics - Theoretical and Applied Research
STAR OUTREACH DAY
"NETWORKS AND BIG DATA"
December 12, 2014
Eurandom - Eindhoven
Frank van der Meulen (TU Delft)
Nelly Litvak (TU Twente)
Simon Tavaré (Cancer Research UK Cambridge Institute)
Professor Simon Tavaré is Director of the Cancer Research UK Cambridge Institute and a Professor in the Department of Applied Mathematics and Theoretical Physics at the University of Cambridge. His research focusses on cancer genomics (for example, next-gen sequencing), evolutionary approaches to cancer, and stochastic computation (for example, Approximate Bayesian Computation). He is an elected Fellow of the Academy of Medical Sciences and of the Royal Society, and has recently been named President-designate of the London Mathematical Society.
Sebastiano Vigna (Università degli Studi di Milano)
Research interests of Professor Sebastiano Vigna lie in the interaction between theory and practice. He has worked on highly theoretical topics such as computability on the reals, distributed computability, self-stabilization, inimal perfect hashing, succinct data structures, query recommendation, algorithms for large graphs and theoretical/experimental analysis of spectral rankings such as PageRank, but he is also (co)author of several widely used software tools ranging from high-performance Java libraries to a model-driven software generator, a search engine, a crawler, a text editor and a graph compression framework. In 2011 he collaborated to the computation the distance distribution of the whole Facebook graph, from which it was possible to evince that there on Facebook there are just 3.74 degrees of separation.
Ernst Wit (University of Groningen)
Professor Ernst Wit is professor of Statistics and Probability at the University of Groningen. He has a wide interest in the field and applications of Statistics, ranging from high-dimensional inference, network analyses, statistical bioinformatics, systems biology, optimal design and algebraic approaches to statistics.
Tijs van den Broek (TNO and University of Twente)
Tijs van den Broek is a researcher at TNO and a PhD student of University of Twente. His research focuses on the use of social media by activists to persuade firms to adopt sustainable policy and practices. He has recently received a prestigious Twitter data grant to analyze the diffusion and effectiveness of cancer early detection campaigns.
Claassen (Radboud University Nijmegen)
Tom Claassen is postdoctoral researcher at the Institute for Computing and Information Sciences at the Radboud University Nijmegen. His research focusses on causal discovery methods and applications. Presently he works on the NWO-project `More Confidence in Causal Discovery’, and the EU-FP7 projects 'Matrics' and 'Aggressotype' on antisocial conduct disorders in adolescents. He is a winner of the Willem R. van Zwet Award 2013 for the best PhD thesis in Statistics and Operations Research.
Sanchayan Sen (Eindoven University of Technology)
Sanchayan Sen works on random graphs, random walks in random environments, percolation, concentration Inequalities, random Matrices. He received his PhD from New York University, and has recently joint TU/e as a postdoc under the NWO Gravitation program `Networks’.
Registration is free, but for organizational reasons compulsory. Please fill in the online form: REGISTRATION
|09.45 - 10.10||Registration|
|10.10 - 10.15||Opening|
|10.15 - 11.15||Simon Tavaré||Statistical aspects of tumour heterogeneity|
|11.15 - 11.30||Break|
|11.30 - 12.00||Sanchayan Sen||The structure of critical random graphs|
|12.00 - 12.30||Tijs van den Broek||The Diffusion And Effectiveness of Cancer Awareness Campaigns on Twitter|
|12.30 - 13.30||Lunch|
|13.30 - 14.30||Sebastiano Vigna||In-core computation of geometric centralities with HyperBall: A hundred billion nodes and beyond|
|14.30 - 15.00||Tom Claassen||Modern day causal discovery: challenges and new applications|
|15.00 - 15.30||Break|
|15.30 - 16.30||Ernst Wit||Big Networks and Data|
Tijs van den Broek
The Diffusion And Effectiveness of Cancer Awareness Campaigns on Twitter
Twitter is gaining popularity as an instrument to
raise awareness about cancer research, cancer screenings, and cancer diagnosis
in an earlier stage of the disease. Our research team received a Twitter data
grant to study the diffusion process and effectiveness of these cancer awareness
campaigns. To do so, we analyze archival Twitter data about 9 popular Twitter
campaigns (e.g. Movember and Pink Ribbon), covering 6 cancer types and over 20
countries. First, we map the diffusion process in detail: what key events and
actors accelerate or decelerate the spreading of these campaigns? Second, we
assess the effect of the campaigns on the frequency and sentiment of tweets
about a particular type of cancer. Last, we determine the campaigns’ offline
effects: when do online cancer awareness campaigns result in offline behavior,
e.g. more donations to cancer research or an increase in cancer screenings?
Academically, our research adds to the growing body of knowledge about the
diffusion and effectiveness of healthcare awareness campaigns. Practically, our
research provides insights to healthcare organizations, patients, NGOs and
(social) enterprises on how to optimally organize cancer awareness campaigns.
Modern day causal discovery: challenges and new
Discovering causal relations from data lies at the heart of most scientific research today. Researchers want to know, for example, if neonicotinoid pesticides affect species distributions, or which social, genetic, and cognitive traits increase the chance on behavioural disorders in school-age children. All tend to involve analysis of available data in order to understand why and how things happen, and in particular to predict what will change if we intervene on (part of) the system, e.g. through treatments or policies. This is the domain of causal inference.
Current state-of-the-art methods are, in principle, able to infer such causal relationships from purely observational data. However, in practice, there is still a big gap between nice theoretical properties and routine application of these methods to analysis of scientific experiments. For example, in the “big data” era, observational databases are abundant, and likely to hide a wealth of important causal information, but existing implementations are often ill-suited to handle the corresponding large, high-dimensional data sets. Also, in real-world data many of the standard assumptions like multivariate Gaussian, or independent, identically distributed (i.i.d.) data do not apply, and standard causal algorithms may yield unreliable results or fail to run altogether.
Our aim is to bridge the gap between state-of-the-art statistical causal discovery and its application to data from real-world studies. We discuss how existing methods can be adapted to provide more robust, informative, and realistic estimates on underlying causal mechanisms, and how to combine information from multiple data sets.
The talk will start with a gentle introduction on the basic principles behind mainstream causal paradigms, before focussing on two examples from the fields of ecology and medicine to illustrate the challenges that still need to be overcome and how we can tackle them. Aim is to give an impression of purpose, power, and limitations of current approaches, but also to demonstrate the potential benefits that can be unlocked by applying principled statistical causal discovery algorithms in many areas of scientific research.
The structure of critical random graphs
Link to abstract: Sen
Statistical aspects of tumour heterogeneity
In this talk
I will provide the biological context for studies of cancer heterogeneity, and
then focus on two statistical aspects that arise in its study. The first, joint
work with Malvina Jospehidou and Andy Lynch, involves the identification of
somatic mutations from DNA sequence data from multiple spatially or temporally
separated tumour samples from the same cancer patient. Instead of performing
multiple pairwise analyses of a single tumour sample and a matched normal, we
jointly considers all available samples under a Bayesian framework to increase
sensitivity of calling shared SNVs. By leveraging information from all available
samples, we are able to detect rare mutations from sequencing experiments. In
the second, joint work with Anestis Touloumis and John Marioni, I will describe
some recent results on analysing the structure of the mean matrix in
high-dimensional transposable data, as might be found from expression arrays
from multiple samples taken from the same patient.
To address this, we have developed a generic and computationally inexpensive nonparametric testing procedure to assess the hypothesis that, in each predefined subset of columns (rows), the column (row) mean vector remains constant. The method will be illustrated with a glioblastoma data set.
In-core computation of geometric centralities with HyperBall: A hundred billion nodes and beyond
We approach the problem of computing geometric centralities, such as closeness and harmonic centrality, on very large graphs; traditionally this task requires an all-pairs shortest-path computation in the exact case, or a number of breadth-first traversals for approximated computations, but these techniques yield very weak statistical guarantees on highly disconnected graphs. We rather assume that the graph is accessed in a semi-streaming fashion, that is, that adjacency lists are scanned almost sequentially, and that a very small amount of memory (in the order of a dozen bytes) per node is available in core memory. We leverage the newly discovered algorithms based on HyperLogLog counters, making it possible to approximate a number of geometric centralities at a very high speed and with high accuracy. While the application of similar algorithms for the approximation of closeness was attempted in the MapReduce framework, our exploitation of HyperLogLog counters reduces exponentially the memory footprint, paving the way for in-core processing of networks with a hundred billion nodes using “just” 2TiB of RAM. Moreover, the computations we describe are inherently parallelizable, and scale linearly with the number of available cores.
Big Networks and Data
A graph is one possible way to describe complex relationships between many actors, such as for example RNA, proteins and metabolites. In many cases, genomic data comes from large monotoring systems with no prior screening. The combination of such indiscrimate data collection combined with the structured nature of genomic interactions, the actual set of relationships, therefore, tends to be sparse.
When data is obtained from noisy measurements of (some of) the nodes in the graph, then graphical models present an appealing and insightful way to describe graph-based dependencies between the random variables. Although potentially still interesting, the main aim of inference is not the precise estimation of the parameters in the graphical model, but the underlying structure of the graph. Graphical lasso and related methods opened up the field of sparse graphical model inference in high-dimensions. In this talk we will introduce the topic and give an overview of the current state of the art.
Registration is free, but for organizational reasons compulsory. Please fill in the online form: REGISTRATION
Eurandom, Mathematics and Computer Science Dept, TU Eindhoven,
Den Dolech 2, 5612 AZ EINDHOVEN, The Netherlands
Eurandom is located on the campus of Eindhoven University of Technology, in theMetaforum building (4th floor) (about the building). The university is located at 10 minutes walking distance from Eindhoven main railway station (take the exit north side and walk towards the tall building on the right with the sign TU/e).
For those arriving by plane, there is a convenient direct train connection between Amsterdam Schiphol airport and Eindhoven. This trip will take about one and a half hour. For more detailed information, please consult the NS travel information pages or see Eurandom web page location.
Many low cost carriers also fly to Eindhoven Airport. There is a bus connection to the Eindhoven central railway station from the airport. (Bus route number 401) For details on departure times consult http://www.9292ov.nl
The University can be reached easily by car from the highways leading to Eindhoven (for details, see our route descriptions or consult our map with highway connections.
The meeting-room is equipped with a data projector, an overhead projector, a projection screen and a blackboard. Please note that speakers and participants making an oral presentation are kindly requested to bring their own laptop or their presentation on a memory stick.
Upon arrival, participants should register with the workshop officer, and collect their name badges. The workshop officer will be present for the duration of the conference, taking care of the administrative aspects and the day-to-day running of the conference: registration, issuing certificates and receipts, etc.
Should you need to cancel your participation, please contact Patty Koorn, the Workshop Officer.
Mrs. Patty Koorn, Workshop Officer, Eurandom/TU Eindhoven, email@example.com