Filter by type:

Model‐Based Clustering of Semiparametric Temporal Exponential‐Family Random Graph Models

Journal Paper
Kevin H. Lee, Amal Agarwal, Anna Y. Zhang, Lingzhou Xue
Stat Journal
Publication year: 2022

Model-based clustering of time-evolving networks has emerged as one of the important research topics in statistical network analysis. It is a fundamental research question to model time-varying network parameters. However, due to difficulties in modeling functional network parameters, there is little progress in the current literature to model time-varying network parameters effectively. In this work, we model network parameters as univariate nonparametric functions instead of constants. We effectively estimate those functional network parameters in temporal exponential-family random graph models using a kernel regression technique and a local likelihood approach. Furthermore, we propose a semi-parametric finite mixture of temporal exponential-family random graph models by adopting finite mixture models, which simultaneously allows both modeling and detecting groups in time-evolving networks. Also, we use a conditional likelihood to construct an effective model selection criterion and network cross-validation to choose an optimal bandwidth. The power of our method is demonstrated in simulation studies and real-world applications to dynamic international trade networks and dynamic arm trade networks.

Guided by Your Stars: Two-Mode Segmentation Using Large-Scale Online Product Rating Networks

Journal Paper
Qian Chen, Amal Agarwal (co-first author), Duncan Fong and Wayne DeSarbo.
Under revision in Marketing Science
Publication year: 2020

Abstract

Online customer ratings on products are abundant and the data can be used by E-commerce companies to gain insights on customer preference, product goodness, develop better recommendation and targeting strategies, etc. We propose a novel network-based two-mode segmentation methodology which utilizes large-scale online product ratings to group customers that are likely to engage in similar rating behavior, reveal underlying product goodness and customer preference toward numerous products. Specifically, to overcome some customer and product interaction biases and the massive missing data problem associated with an analysis of such online data, we (1) employ a machine learning method to infer the underlying product goodness which takes into account of customer heterogeneity in their product evaluations, (2) segment products and customers jointly through a two-mode network to handle the massive missing data issue, given that each customer rates only a few products, and (3) extend existing network modeling to incorporate covariates and devise an efficient algorithm to perform simultaneous segmentation of customers and products for large datasets. With simulation studies, we show that our method provides more accurate segmentation results compared to several benchmark models. An empirical application using an online product rating database illustrates how our methodology overcomes the modeling and computational challenges introduced by the large dataset with massive missing observations to offer useful marketing insights.

Temporal Exponential-Family Random Graph Models with Time-Evolving Latent Block Structure for Dynamic Networks

Journal Paper
Amal Agarwal, Kevin Lee and Lingzhou Xue
Submitted in Journal of Business & Economic Statistics
Publication year: 2020

Abstract

Model-based clustering of dynamic networks has emerged as an important research topic in statistical network analysis. It is critical to effectively and efficiently model the time-evolving latent block structure of dynamic networks in practice. However, the focus of most existing methods is on the static or dynamicly invariant block structure. We present a principled statistical clustering of large-scale dynamic networks through the dynamic exponential-family random graph models with a hidden Markov structure. The hidden Markov structure is used to effectively infer the time-evolving block structure of dynamic networks. We prove the identification conditions for both network parameters and transition matrix in our proposed model-based clustering. We propose an effective model selection criterion based on the integrated classification likelihood to choosing an appropriate number of clusters. We develop a scalable variational expectation-maximization algorithm to efficiently solve the approximate maximum likelihood estimate. The numerical performance of our proposed method is demonstrated in simulation studies and two real data applications to dynamic international trade networks and dynamic email networks of a large institute.

Model-Based Clustering of Nonparametric Weighted Networks with Application to Water Pollution Analysis

Journal Paper
Amal Agarwal and Lingzhou Xue
Technometrics, Volume 62, Issue 2, Pages 161-172
Publication year: 2020

Abstract

Water pollution is a major global environmental problem, and it poses a great environmental risk to public health and biological diversity. This work is motivated by assessing the potential environmental threat of coal mining through increased sulfate concentrations in river networks, which do not belong to any simple parametric distribution. However, existing network models mainly focus on binary or discrete networks and weighted networks with known parametric weight distributions. We propose a principled nonparametric weighted network model based on exponential-family random graph models and local likelihood estimation, and study its model-based clustering with application to large-scale water pollution network analysis. We do not require any parametric distribution assumption on network weights. The proposed method greatly extends the methodology and applicability of statistical network models. Furthermore, it is scalable to large and complex networks in large-scale environmental studies and geoscientific research. The power of our proposed methods is demonstrated in simulation studies and a real application to sulfate pollution network analysis in Ohio watershed located in Pennsylvania, United States.

Assessing Contamination of Stream Networks Near Shale Gas Development Using a New Geospatial Tool

Amal Agarwal, Tao Wen, Alex Chen , Anna Yinqi Zhang , Xianzeng Niu , Xiang Zhan , Lingzhou Xue, Susan L. Brantley
Environmental Science & Technology, Volume 54, Issue 14, Pages 8632-8639
Publication year: 2020

Chemical spills in streams can impact ecosystem or human health. Typically, the public learns of spills from industry, media, or government reporting rather than monitoring data. For example, ~1300 spills (76 ≥400 gallons or ~1,500 liters) were reported from 2007 to 2014 by the regulator for natural gas wellpads in the Marcellus shale region of Pennsylvania (U.S.), a region of extensive drilling and hydraulic fracturing. Only one such incident of stream contamination in Pennsylvania has been documented with water quality data in peer-reviewed literature. This could indicate that spills (1) were small or contained on wellpads, (2) were diluted, biodegraded, or obscured by other contaminants, (3) were not detected because of sparse monitoring, or (4) were not detected because of the difficulties of inspecting data for complex stream networks. As a first step addressing the last problem, we developed a geospatial-analysis tool, GeoNet, that analyzes stream networks to detectstatistically significant changes between background and potentially-impacted sites. GeoNet was used on data in the Water Quality Portal for the Pennsylvania Marcellus region. With the most stringentstatistical tests, GeoNet detected 0.2 to 2% of the known contamination incidents (Na±Cl) in streams. With denser sensor networks, tools like GeoNet could allow real-time detection of polluting events.

Assessing changes in groundwater chemistry in landscapes with more than 100 years of oil and gas development

Journal Paper
Tao Wen, Amal Agarwal (co-first author), Lingzhou Xue, Alex Chen, Alison Herman, Zhenhui Li and Susan L. Brantley
Environmental Science: Processes & Impacts, Volume 21, Issue 2, Pages 384-396
Publication year: 2019

With recent improvements in high-volume hydraulic fracturing (HVHF, known to the public as fracking), vast new reservoirs of natural gas and oil are now being tapped. As HVHF has expanded into the populous northeastern USA, some residents have become concerned about impacts on water quality. Scientists have addressed this concern by investigating individual case studies or by statistically assessing the rate of problems. In general, however, the lack of access to new or historical water quality data hinders the latter assessments. We introduce a new statistical approach to assess water quality datasets – especially sets that differ in data volume and variance – and apply the technique to one region of intense shale gas development in northeastern Pennsylvania (PA) and one with fewer shale gas wells in northwestern PA. The new analysis for the intensely developed region corroborates an earlier analysis based on a different statistical test: in that area, changes in groundwater chemistry show no degradation despite that area’s dense development of shale gas. In contrast, in the region with fewer shale gas wells, we observe slight but statistically significant increases in concentrations in some solutes in groundwaters. One potential explanation for the slight changes in groundwater chemistry in that area (northwestern PA) is that it is the regional focus of the earliest commercial development of conventional oil and gas (O&G) in the USA. Alternate explanations include the use of brines from conventional O&G wells as well as other salt mixtures on roads in that area for dust abatement or de-icing, respectively.

Detecting the effects of coal mining, acid rain, and natural gas extraction in Appalachian basin streams in Pennsylvania (USA) through analysis of barium and sulfate concentrations

Journal Paper
Xianzeng Niu, Anna Wendt, Zhenhui Li, Amal Agarwal, Lingzhou Xue, and Susan L. Brantley
Environmental Geochemistry and Health Journal, Volume 40, Issue 2, Pages 865-885
Publication year: 2018

Abstract

To understand how extraction of different energy sources impacts water resources requires assessment of how water chemistry has changed in comparison with the background values of pristine streams. With such understanding, we can develop better water quality standards and ecological interpretations. However, determination of pristine background chemistry is difficult in areas with heavy human impact. To learn to do this, we compiled a master dataset of sulfate and barium concentrations ([SO4], [Ba]) in Pennsylvania (PA, USA) streams from publically available sources. These elements were chosen because they can represent contamination related to oil/gas and coal, respectively. We applied changepoint analysis (i.e., likelihood ratio test) to identify pristine streams, which we defined as streams with a low variability in concentrations as measured over years. From these pristine streams, we estimated the baseline concentrations for major bedrock types in PA. Overall, we found that 48,471 data values are available for [SO4] from 1904 to 2014 and 3243 data for [Ba] from 1963 to 2014. Statewide [SO4] baseline was estimated to be 15.8 ± 9.6 mg/L, but values range from 12.4 to 26.7 mg/L for different bedrock types. The statewide [Ba] baseline is 27.7 ± 10.6 µg/L and values range from 25.8 to 38.7 µg/L. Results show that most increases in [SO4] from the baseline occurred in areas with intensive coal mining activities, confirming previous studies. Sulfate inputs from acid rain were also documented. Slight increases in [Ba] since 2007 and higher [Ba] in areas with higher densities of gas wells when compared to other areas could document impacts from shale gas development, the prevalence of basin brines, or decreases in acid rain and its coupled effects on [Ba] related to barite solubility. The largest impacts on PA stream [Ba] and [SO4] are related to releases from coal mining or burning rather than oil and gas development.

Discovery of Causal Time Intervals

Conference
Zhenhui Li, Guanjie Zheng, Amal Agarwal and Lingzhou Xue
SDM’17: the Seventeenth SIAM International Conference on Data Mining, Pages 804-812
Publication year: 2017

Abstract

Causality analysis, beyond “mere” correlations, has become increasingly important for scientific discoveries and policy decisions. Many of these real-world applications involve time series data. A key observation is that the causality between time series could vary significantly over time. For example, a rain could cause severe traffic jams during the rush hours, but has little impact on the traffic at midnight. However, previous studies mostly look at the whole time series when determining the causal relationship between them. Instead, we propose to detect the partial time intervals with causality. As it is time consuming to enumerate all time intervals and test causality for each interval, we further propose an efficient algorithm that can avoid unnecessary computations based on the bounds of F-test in the Granger causality test. We use both synthetic datasets and real datasets to demonstrate the efficiency of our pruning techniques and that our method can effectively discover interesting causal intervals in the time series data.

Asymmetric flows in the intercellular membrane during cytokinesis

Journal Paper
Vidya V. Menon, Soumya S S, Amal Agarwal, Sundar R. Naganathan, Mandar M. Inamdar and Anirban Sain
Biophysical Journal, Volume 113, Issue 12, Pages 2787-2795
Publication year: 2017

Abstract

Eukaryotic cells undergo shape changes during their division and growth. This involves flow of material both in the cell membrane and in the cytoskeletal layer beneath the membrane. Such flows result in redistribution of phospholipid at the cell surface and actomyosin in the cortex. Here we focus on the growth of the intercellular surface during cell division in a Caenorhabditis elegans embryo. The growth of this surface leads to the formation of a double-layer of separating membranes between the two daughter cells. The division plane typically has a circular periphery and the growth starts from the periphery as a membrane invagination, which grows radially inward like the shutter of a camera. The growth is typically not concentric, in the sense that the closing internal ring is located off-center. Cytoskeletal proteins anillin and septin have been found to be responsible for initiating and maintaining the asymmetry of ring closure but the role of possible asymmetry in the material flow into the growing membrane has not been investigated yet. Motivated by experimental evidence of such flow asymmetry, here we explore the patterns of internal ring closure in the growing membrane in response to asymmetric boundary fluxes. We highlight the importance of the flow asymmetry by showing that many of the asymmetric growth patterns observed experimentally can be reproduced by our model, which incorporates the viscous nature of the membrane and contractility of the associated cortex.

Bayesian inversion and sequential Monte Carlo sampling techniques applied to nearfield acoustic sensor arrays

Journal Paper
MR Bai, A Agarwal, CC Chen and YC Wang
Journal of the Acoustical Society of America, Volume 136, Issue 4, Page 2084
Publication year: 2014

Abstract

This paper demonstrates that inverse source reconstruction can be performed using a methodology of particle filters that relies primarily on the Bayesian approach of parameter estimation. The proposed approach is applied in the context of nearfield acoustic holography based on the equivalent source method (ESM). A state-space model is formulated in light of the ESM. The parameters to estimate are amplitudes and locations of the equivalent sources. The parameters constitute the state vector which follows a first-order Markov process with the transition matrix being the identity for every frequency-domain data frame. The implementation of recursive Bayesian filters involves a sequential Monte Carlo sampling procedure that treats the estimates as point masses with a discrete probability mass function (PMF) which evolves with iteration. It is evident from the results that the inclusion of the appropriate prior distribution is crucial in the parameter estimation.

Bayesian approach of nearfield acoustic reconstruction with particle filters

Journal Paper
MR Bai, A Agarwal, CC Chen and YC Wang
Journal of the Acoustical Society of America, Volume 133, Issue 6, Pages 4032-4043
Publication year: 2013

Abstract

This paper demonstrates that inverse source reconstruction can be performed using a methodology of particle filters that relies primarily on the Bayesianapproach of parameter estimation. In particular, the proposed approach is applied in the context of nearfield acoustic holography based on the equivalent source method (ESM). A state-space model is formulated in light of the ESM. The parameters to estimate are amplitudes and locations of the equivalent sources. The parameters constitute the state vector which follows a first-order Markov process with the transition matrix being the identity for every frequency-domain data frame. Filtered estimates of the state vector obtained are assigned weights adaptively. The implementation of recursive Bayesianfilters involves a sequential Monte Carlo sampling procedure that treats the estimates as point masses with a discrete probability mass function (PMF) which evolves with iteration. The weight update equation governs the evolution of this PMF and depends primarily on the likelihood function and the prior distribution. It is evident from the simulation results that the inclusion of the appropriate prior distribution is crucial in the parameter estimation.