Whether it's making music, movies, or encyclopedias, collaborative projects in online communities are becoming more common. Yet we know little about how these creative teams form, and what leads to their ultimate success. In this talk, I will discuss a study of an online songwriting community called February Album Writing Month (FAWM.ORG). By analyzing four years of longitudinal behavioral data using a novel path-based regression method --- which performs random walks on the social network itself --- we can both (1) accurately predict "collabs" that will form, and (2) gain insight into factors affect how they form, contributing to theory as well. Combined with a large-scale survey of community members, we find that communication, compatible but complementary interests, and slight differences in social status are major contributors to collab formation; and that an equitable division of labor is a key factor in its success.
FYI: Burr Settles is a staff scientist and software engineer at Duolingo, an award-winning platform for free language education. Recently, he has been spearheading the company's "Test Center" initiative. He also runs FAWM.ORG, an annual online collaborative songwriting experiment. He was previously a postdoc in machine learning at Carnegie Mellon University, and earned a PhD in computer sciences from the University of Wisconsin-Madison. His book Active Learning — an introduction to learning algorithms that are adaptive, curious, or exploratory (if you will) — was published by Morgan & Claypool in 2012. He has also co-organized several workshops on the subject (e.g., at ICML and NAACL-HLT). Burr gets around by bike and among other things plays guitar in the pop band Delicious Pastries.
Inverse Propensity Score Weighting is one of the most popular estimation techniques for causal inference in the presence of selection bias. While logistic regression remains the standard choice for estimating propensity scores among applied researchers, there have been some recent attempts at using semi-parametric and non-parametric estimation techniques. We extend this literature by introducing a data driven procedure to choose between different propensity score estimation models. In particular we compare propensity scores estimates obtained from linear probit model as well as from semi-parametric classifiers like Naive Bayes, Random Forests and Support Vector Machines. We evaluate three measures that can be used to choose between the different propensity score estimates -- Covariate Imbalance, Calibration Error and Classification Error Rate. We show via two sets of simulation studies why it is useful to choose from a variety of propensity score models. In particular we find that propensity score estimates with Minimum Covariate Imbalance perform very well in terms of Mean Squared Error of Average Treatment Effect estimates across all simulations. We also find that the best classifier (with the lowest Classification Error Rate) is not necessarily the best choice for propensity score estimation. Finally, we apply our method to a large public health dataset from India to study the effectiveness of the Safe Motherhood Scheme or Janani Suraksha Yojana, one of the world's largest conditional cash transfer program wherein women receive monetary compensation from the government for giving birth in a health care facility.
Speaker: Hanna Wallach (Microsoft Research / U Mass)
Monday 1.26.2015 4.00pm
G\HC 4405 GHC 6115 NEW ROOM!
Speaker: Ghita Mezzour (ECE/CMU)
Topic: Assessing the Global Cyber and Biological Threat
In this talk, I will present my work on assessing the cyber and biological threats in different countries. The first part of the talk will focus on cyber-crime. I will examine how exposure to and hosting of attacks vary internationally and test hypotheses about factors behind such variation. I will use Symantec's Worldwide Intelligence Network Environment (WINE) data, which consist of attack reports from more than 10 million customer computers worldwide. In the second part of the talk, I will present a methodology to assess countries’ bioweapon and cyber-warfare capabilities. The methodology consists of a socio-cultural model that assesses countries’ motivations for these capabilities and indicators that assess countries’ latent abilities to acquire such capabilities. I validate my methodology by comparing the methodology’s outputs to historical data.
Monday 2.9.2015 4.00pm
Speaker: Daniel B. Neill (Heinz/CMU)
Topic: Event and Pattern Detection at the Societal Scale
Over the past decade, our lab has developed a variety of new statistical and computational approaches for detection of emerging events, and other relevant patterns, in data. This talk will focus on our most recent work in scaling up these approaches to deal with the size and complexity of data needed to address important real-world problems at the societal scale. As two concrete examples, we consider detection of "novel" disease outbreaks (e.g., those with unexpected patterns of symptoms or affected subpopulations) using free-text Emergency Department visit records, and prediction of civil unrest using online social network data. To deal with the massive size and high dimensionality of real-world data, we propose the fast multidimensional subset scan, a novel approach for accurate and computationally efficient pattern detection. Subset scanning treats the pattern detection problem as a search over subsets of data records and attribute values, finding those subsets which maximize some score function. One key insight is that this search over subsets can be performed very efficiently, reducing run times from years to milliseconds, using the "linear-time subset scanning" property of many commonly used score functions such as likelihood ratio statistics. To deal with the complexity of real-world data, we present two new approaches, the semantic scan and non-parametric heterogeneous graph scan, which incorporate free-text data and heterogeneous social network data into the subset scan framework. Finally, we demonstrate that these approaches achieve more accurate, precise, and computationally efficient detection and prediction of real-world events, as compared to the current state of the art. This work is in collaboration with many current and former members of Heinz College's Event and Pattern Detection Laboratory and is supported by funding from the National Science Foundation and MacArthur Foundation
Friday 3.20.2015 1.00pm [NOTE DAY/LOCATION!]
Speaker: Ben Wellington (I Quant NY)
Open Data Science: Leveraging Public Data to Explore Urban Life
In this talk, I will discuss how I apply the data science techniques I use at a quantitative investment firm called Two Sigma to a new domain: public New York City data. From parking ticket geography, to restaurant inspection scores to subway and taxi pricing, I will look at the work that has formed the foundations of my data and policy blog I Quant NY. I will discuss best practices for data science in the policy space, explore relationships to data journalism, and highlight the various data-driven interactions I've had with City agencies.
FYI: Ben Wellington is the creator of I Quant NY, a data science and policy blog that focuses on insights drawn from New York City's public data. Over the last year, the blog has been featured in dozens of publications including Five Thirty Eight, The News York Times and The Atlantic. Ben is a contributor to the New Yorker, and is a visiting assistant professor in the City & Regional Planning program at the Pratt Institute in Brooklyn. His day job involves working as a quantitative analyst at the investment management firm, Two Sigma. He holds a Ph.D. in Natural Language Processing from New York University.