In this talk, I will present my work on assessing the cyber and biological threats in different countries. The first part of the talk will focus on cyber-crime. I will examine how exposure to and hosting of attacks vary internationally and test hypotheses about factors behind such variation. I will use Symantec's Worldwide Intelligence Network Environment (WINE) data, which consist of attack reports from more than 10 million customer computers worldwide. In the second part of the talk, I will present a methodology to assess countries’ bioweapon and cyber-warfare capabilities. The methodology consists of a socio-cultural model that assesses countries’ motivations for these capabilities and indicators that assess countries’ latent abilities to acquire such capabilities. I validate my methodology by comparing the methodology’s outputs to historical data.
Over the past decade, our lab has developed a variety of new statistical and computational approaches for detection of emerging events, and other relevant patterns, in data. This talk will focus on our most recent work in scaling up these approaches to deal with the size and complexity of data needed to address important real-world problems at the societal scale. As two concrete examples, we consider detection of "novel" disease outbreaks (e.g., those with unexpected patterns of symptoms or affected subpopulations) using free-text Emergency Department visit records, and prediction of civil unrest using online social network data. To deal with the massive size and high dimensionality of real-world data, we propose the fast multidimensional subset scan, a novel approach for accurate and computationally efficient pattern detection. Subset scanning treats the pattern detection problem as a search over subsets of data records and attribute values, finding those subsets which maximize some score function. One key insight is that this search over subsets can be performed very efficiently, reducing run times from years to milliseconds, using the "linear-time subset scanning" property of many commonly used score functions such as likelihood ratio statistics. To deal with the complexity of real-world data, we present two new approaches, the semantic scan and non-parametric heterogeneous graph scan, which incorporate free-text data and heterogeneous social network data into the subset scan framework. Finally, we demonstrate that these approaches achieve more accurate, precise, and computationally efficient detection and prediction of real-world events, as compared to the current state of the art. This work is in collaboration with many current and former members of Heinz College's Event and Pattern Detection Laboratory and is supported by funding from the National Science Foundation and MacArthur Foundation
Inverse Propensity Score Weighting is one of the most popular estimation techniques for causal inference in the presence of selection bias. While logistic regression remains the standard choice for estimating propensity scores among applied researchers, there have been some recent attempts at using semi-parametric and non-parametric estimation techniques. We extend this literature by introducing a data driven procedure to choose between different propensity score estimation models. In particular we compare propensity scores estimates obtained from linear probit model as well as from semi-parametric classifiers like Naive Bayes, Random Forests and Support Vector Machines. We evaluate three measures that can be used to choose between the different propensity score estimates -- Covariate Imbalance, Calibration Error and Classification Error Rate. We show via two sets of simulation studies why it is useful to choose from a variety of propensity score models. In particular we find that propensity score estimates with Minimum Covariate Imbalance perform very well in terms of Mean Squared Error of Average Treatment Effect estimates across all simulations. We also find that the best classifier (with the lowest Classification Error Rate) is not necessarily the best choice for propensity score estimation. Finally, we apply our method to a large public health dataset from India to study the effectiveness of the Safe Motherhood Scheme or Janani Suraksha Yojana, one of the world's largest conditional cash transfer program wherein women receive monetary compensation from the government for giving birth in a health care facility.
Speaker: Hanna Wallach (Microsoft Research / U Mass)