In chronicling the events in a set of individuals' lives, encyclopedic biographies -- from Plutarch's Parallel Lives to Wikipedia -- provide an extraordinary amount of information detailing how the lives of the historically famous unfold. The life events described in these texts have natural structure: events exhibit correlations with each other (e.g., those who divorce must have been married), can occur at roughly similar times in the lives of different individuals (marriage is more likely to occur earlier in one's life than later), and can be bound to historical moments as well ("fights in World War II" peaks in the early 1940s). While social scientists have long been interested in the structure of these events in investigating the role that individual agency and larger social forces play in shaping the course of an individual's life, the data on which these studies draw has largely been restricted to categorical surveys and observational data; in this talk, I'll present a latent-variable model that exploits the correlations of event descriptions in biographies to learn the structure of abstract events, grounded in time, from text alone.
At the same time, subjects of biographies are not a random sample of the population, nor are their contents unbiased representations. Nearly all encyclopedias necessarily prefer the historically notorious (if due to nothing else than inherent biases in the preservation of historical records); many, like Wikipedia, also have disproportionately low coverage of women, minorities and other demographic groups. Rather than learning a representation of some true state of the world, what we are learning is a representation of discourse -- how a community is implicitly deciding to depict a set of individuals. In using this method to learn event classes from Wikipedia, we detect a strong systematic bias in the presentation of male and female biographies, with biographies of women containing nearly three times as much emphasis on events of marriage and divorce as biographies of men.
The randomized experiment has been an important tools for inferring the causal impact of an intervention; the most common analysis conducted in this context is the average treatment effect (ATE). However, the recent heterogeneous treatment effects literature is showing the utility in estimating the marginal conditional average treatment effect (MCATE), which estimates a treatment effect for a subpopulation--i.e., respondents who share a particular subset of covariates. Additionally, the literature proposes the use of data mining methods to estimate the exponential number (in covariate size) of MCATEs that exist in the data. However, each proposed method makes its own set of (restrictive) assumptions about the intervention's affect, the underlying data generating processes, and which subpopulations (MCATEs) to explicitly estimate. Moreover, the majority of the literature provides little guarantee for the estimation error of the specific MCATE estimates. Therefore, we propose Treatment Effect Subset Scan (TESS), a new method for identifying which subpopulation in a randomized experiment is most affected by a treatment. We frame the affected subpopulation identification challenge as a pattern detection problem where we maximize a nonparametric scan statistic (measurement of distributional divergence) over all subpopulations, while being extremely parsimonious in which specific subpopulations' effects to estimate. Furthermore, we identify the subpopulation which experiences the largest distributional change as a result of the intervention, while making minimal assumptions about the intervention's affect or the underlying data generating process. Finally, we validate the efficacy of the method by identifying heterogenous treatment effects in simulations and in well known program evaluations studies.
Many people with serious diseases use online support groups to exchange social support. For these groups to be effective, members must both seek support and provide it. For the groups to be sustained, some members must continue to participate. This talk presents three three studies in a large, online breast cancer group examining how people get support and the impact of support on group satisfaction and commitment. We use machine learning techniques to automate content analysis of 1.5 million messages measuring the extent to which message contain informational and emotional support, questions and self-disclosure and the strength of ties between participations. These variables are used in longitudinal regression analyses and structural equation modeling to predict the type of support people receive, their satisfaction with the exchanges and their commitment to the group. Although members asked explicit questions to get informational support, they used both positive and negative self-disclose to elicit emotional support. Because providing emotional support has implication for the relationship between the provider and recipient, it became less valuable as a signal of caring if it must be explicitly requested. Moreover, failing to receive support after explicitly requesting it has negative consequences for the seeker’s face. Receiving either informational or emotional support positively predicted participants’ satisfaction with support exchanges. Moreover, recipients were more satisfied if the support they received matched the support they sought, at least for informational support. In contrast, they were equally satisfied with emotional and informational support after seeking emotional support, presumably because any response to them was an indicator that others in the community cared about them. Receiving support also influenced members’ continued participation in the group, with emotional support increasing their commitment and informational support decreasing it.
Epidemics over networks is frequently used as a model to understand how information/virus/rumors/ opinions/failures spread in networks. This type of process is challenging to study because the behavior of the system is dependent on both the topology of the network and on the dynamics of the process. We developed the scaled SIS (susceptible-infected-susceptible) process, an epidemics process over arbitrary network topology, which accounts for both spontaneous and neighbor-to-neighbor infection as well as healing. The scaled SIS process is modeled as a continuous-time Markov process for which we are able to derive its closed-form equilibrium distribution for any arbitrary network topology. The adjacency matrix that describes the underlying network is explicitly reflected in this distribution.
We use the equilibrium distribution to formulate the Most-Probable Configuration Problem, which solves for the network configuration (i.e., states of all the agents) with the maximum equilibrium probability. The agents who are infected in the most-probable configuration are therefore, more vulnerable to the epidemics than the agents who remain healthy. Even though this is a combinatorial optimization problem, we can exactly solve this problem in polynomial time for a range of infection/healing parameters. Lastly, we will show the connection between subgraphs in the network and the identity of these more vulnerable agents.
Stereotypes of and prejudices against those “different” than “us” are omnipresent in ways both big and small in our social lives. While some are simple and void of affect (e.g. computer scientists like math), most stereotypes and prejudices exist within a complex web of socio-cognitive factors and can provoke significant emotional responses (e.g. computer science is only for men). These socio-cognitive factors surrounding stereotypes and prejudices are so complex that we are not aware of many of our own prejudices, and can be unable to fully explain why we hold those we are cognizant of.
In this talk I will discuss three research projects focused on better understanding how stereotypes and prejudices coexist and coevolve with socio-cultural structure. The first applies latent Dirichlet allocation to a large dataset of foursquare check-ins to test a social theory of segregation. The second utilizes a new agent-based model to better understand the effect of stereotypes on social network structure under different assumptions of the distribution of culture in artificial societies. The final project uses semi-supervised learning and a theory of stereotype content to better understand perceptions of individuals and social groups in newspaper data on the Arab Spring. I conclude with a synthesis of the three projects and a discussion of current and future work. .