Projects

My goal here is to describe all research projects that (will) lead to a publication, in some detail. These include my main research towards the dissertation, collaborative projects, and my undergraduate research projects. I exclude all projects with little/no impact. Each one has links to code, slides, applications and other relevant material, if available to share. Most of these projects which don’t have a link to the ‘Manuscript’ are at different stages of the peer review process. If you are interested in only a summary, please check out the section “Research.”

Statistics and ML intensive Projects

  • Recommender system using two mode segmentation in bipartite networks

    Stochastic BlockModels with an proportional odds structure for massive bipartite networks with ordinal ratings

    This is joint work with Qian Chen (Ph.D. student in Business Marketing in Pennstate), Dr. Duncan Fong, and Dr. Wayne Desarbo. The idea is centered around developing a novel recommender system using two-mode segmentation in Amazon’s product-review network. The goal was to segment the consumer group based on their heterogeneity to rate different products, while simultaneously segmenting the products to develop a generic as well as reviewer specific recommender system.

    Online reviews have several advantages as a data source: they are free, unsolicited and user-generated. I utilized this emerging data source for market segmentation by proposing a network model with two separable parts, one based on Stochastic Block Models for edge structure and another based on proportional odds model for modeling the ordered ratings. In particular, I faced two main modeling challenges:

    • Review databases are typically sparse since each user rates only a few products out of the vast amount of products the company carries. To address this missing data problem, I enriched the data by considering the online product rating database as a two-mode network (also called as “bipartite network”) and taking advantage of not only data value (i.e., ratings) but even network structure (i.e., who rates what).
    • Review databases are fairly large and computationally expensive. To scale up inference in these models, I designed a novel stochastic variational EM algorithm that could handle hundreds of thousands of nodes and tens of clusters within 3-4 days on a cluster with around 100 processors.

  • Non-parametric Weighted Networks

    Model-Based Clustering of large scale Non-parametric Weighted Networks

    This is a joint work my advisor Dr. Lingzhou Xue. The idea is centered around developing a nonparametric network clustering model that can estimate the block weight densities using kernel local likelihood in weighted networks.

    To get a brief introduction of what networks are, please refer to the project “Large-scale Dynamic Networks.” As I mentioned there, weighted networks frequently occur in real-world applications. To give one more example, analyzing water pollution networks is a huge challenge in geo-scientific research.  In these networks, it becomes imperative to detect polluter sources in a river network.  Here the nodes are represented by sampling sites of a pollutant. Edges exist between those sampling sites that are connected by the river flow. Weights on edges could be thought of as differences between concentrations of some pollutant measured on the sampling nodes.

    This kind of network structure could not be analyzed using popular spatial models based on the “neighborhood” concept. In the neighborhood approach the data points that are far enough essentially behave independently of each other. However, this is not true for river networks. Depending on the discharge (volume flowing per unit time) of river flow and pollutant flow, we could encounter strange correlations between points that are far enough but connected through the network. On the other hand, two nearby points which are not connected by river flow may show no correlations.

    The key challenges are,

    • Networks must be clustered as a whole with a unified and statistically principled model.
    • Assumptions about any parametric distributions over the weights between cluster pairs are likely to fail in the real world applications since weight distributions could be multimodal, heavy-tailed, leptokurtic and full of outliers.
    • Estimation of the number of clusters must be based on model-based selection criterion.
    • The algorithm must scale up to deal with environmental big data.

    I addressed these challenges by proposing a principled nonparametric weighted network model based on exponential-family random graph models and local likelihood estimation and study its model-based clustering with application to large-scale water pollution network analysis. In this approach, I do not require any parametric distribution assumption on network weights. The proposed method extensively extends the methodology and applicability of statistical network models. Furthermore, it is scalable to large and complex networks in large-scale environmental studies and geoscientific research. I demonstrated the power of our proposed method in simulation studies and a real application to sulfate pollution network analysis in Ohio watershed located in Pennsylvania, United States. The display picture shows the clustered Sulfate network in this application.

  • Time evolving community detection in Large-scale Dynamic Networks

    Temporal ERGMs with time evolving latent block structure for massive dynamic networks

    This is joint work with Dr. Kevin Lee (Assistant Professor at Western Michigan University) and my advisor Dr. Lingzhou Xue. The idea is centered around developing an integrated statistically principled model-based clustering approach that can estimate time evolving changing cluster memberships of nodes in a dynamic network.

    In simple language, a network can be thought of as a collection of nodes, with some of them possibly interconnected with each other. For example, a friendship social network on Facebook has nodes represented by people and edges by friendship ties. The edges could be defined in several ways. One way is to indicate whether people are friends or not merely. This is a binary network example. Another way is to count the number of people liking a particular post. We could then define a frequency as the number of likes per unit time. The edges are not binary in this example. We call these type of networks as weighted networks.

    The frequencies mentioned above continuously change with time. One important goal is to cluster the nodes in several groups based on some network statistic. For example, we could pick the statistic as the density of edges. In this case, the groups would be differentiated between people not posting much, the people who post a lot but don’t get any likes and the people who post a lot and get a lot of likes. It is possible that these assumed clusters “evolve” with time. An individual may even switch clusters if one of his posts goes viral. There are several challenges,

    • Dynamically changing edges need to be accounted for in the model.
    •  The underlying groups must be are flexible and nodes should be able to change their group memberships with time.
    • Estimation of the number of clusters should be achievable through some model-based selection criterion in such a novel framework.
    • The model should scale up for large-scale networks with more than 100K nodes.

    In this project, I addressed all these challenges by proposing a principled statistical clustering of large-scale dynamic networks through the dynamic exponential-family random graph models with a hidden Markov structure. The hidden Markov structure is used to infer the time-evolving block structure of dynamic networks effectively. We proved the identification conditions for both network parameters and transition matrix in our proposed model-based clustering. I proposed an effective model selection criterion based on the integrated classification likelihood to choose an appropriate number of clusters. Then I developed a scalable variational expectation-maximization algorithm to solve the approximate maximum likelihood estimate efficiently. Thorough simulation studies, interesting real data applications in dynamic international trade networks and dynamic email networks of a large institute demonstrate the power of my methodology.

  • Semiparametric Finite Mixture of Discrete TERGMs

    Mixture of Temporal Exponential-Family Random Graph Models with Varying Network Parameters

    This is joint work with Dr. Kevin Lee (Assistant Professor at Western Michigan University) and my advisor Dr. Lingzhou Xue. The idea is centered around developing a general language for describing time-evolving complex systems. It is a fundamental research question to model time-varying network parameters. However, due to difficulties in modeling functional network parameters, there is a little progress in the current literature to model time-varying network parameters effectively.

    In this work, we considered the circumstance in which network parameters are univariate nonparametric functions instead of constants. Using a kernel regression technique and a local likelihood approach, we effectively estimate those functional network parameters in discrete time exponential-family random graph models. Furthermore, by adopting mixture models, we extend our model to a semiparametric mixture of discrete time exponential-family random graph models which simultaneously allows both modeling and detecting groups in time-evolving networks.

    Also, we use a conditional likelihood to construct a useful model selection criterion and network cross-validation to choose an optimal bandwidth. The power of our method is demonstrated in-depth simulation studies and real-world applications to dynamic international trade networks and dynamic arm trade networks.

Collaborative Projects

  • Statistical Machine Learning of Environmental Big Data

    Detection of polluter sources based on multiple parametric/non-parametric tests.

    This is joint work with leading geoscientist Dr. Susan Brantley in Pennstate and her postdoctoral scholar Tao Wen. I have developed an automated novel statistical methodology to detect polluter sources in river networks. This was an engineering intensive project. The data was quite dirty, and I spent a lot of time cleaning it. The idea is simple, but the implementation has been quite a challenge.

    First I mapped all sampling locations and polluter source locations onto the river network. Then I made the software to take in a query for a polluter source; threshold flow distances upstream and downstream, the time interval for sampling and the polluter source’s “event” date. An event could be associated with some activity that resulted in the concentration of certain contaminants to increase in the river streams. Based on the query, the software then collects all the samples upstream and downstream that are within the threshold flow distances and the specified time interval. Also, the samples downstream must be tagged later than the polluter’s “event” date. It compares these two sets of samples using standard two-sample t-test and Wilcoxon rank-sum test and subsequently corrects the p values using the Benjamini-Hochberg procedure for multiple testing.

    I have implemented this approach as a software tool in R Shiny that can allow researchers to analyze “big” water pollution data with sampling sites over 20K and deduce the significant polluter sources based on the tests specified above.

    Here is a “free” version of the application, used on PA surface water dataset to detect significant spills.

  • Constrained penalized regression models for HiC and Epigenetic big data

    Detection of significant Topologically Associating Domains and genetic states affecting interaction intensities of chromosomal segments.

    This is joint work with Biostatistics researcher Dr. Yu Zhang and my advisor Dr. Lingzhou Xue. The chromosome data is collected using state of the art Hi-C method to study three dimensional architecture of genomes. I have developed a novel procedure of stratified minibatch sub-sampling procedure to fit constrained penalized LASSO model over this data. This data has huge potential in precision medicine particularly if we could identify distinctive chromatins between a normal and cancer cell.

    The dataset is quite large with the intensity matrix being 25K*25K which had to be subsequently collapsed in a single data frame with 625 Million rows to fit a constrained penalized LASSO model. The number of fields where close to 700. My stratified minibatch sub-sampling procedure for this HiC data allows fitting these models 100 times faster than the conventional approaches. I used University clusters to run this analysis. I have built a Shiny Application HiCEpigen to show the prediction results from the model.

  • Two sample tests to solve generalized nonparametric Behrens-Fisher problem

    Developing two sample testing procedures for small unbalanced samples with unequal variances

    This is joint work with leading geoscientist Dr. Susan Brantley in Pennstate, her postdoctoral scholar Tao Wen and Allison Herman. I did a thorough simulation study to compare different two-sample parametric tests (including Welch’s t-test and its permutation and Bootstrap versions) and non-parametric tests (including Wilcoxon and  Brunner-Munzel) based on empirical distributions over the multiple datasets and evaluated power/size under different cases of unequal variances and imbalanced sample sizes.

    The classical parametric form of the Behrens-Fisher problem considers the hypothesis of equal means in the presence of potentially different variances. This problem of heteroscedasticity (unequal variances) is usually encountered in Geoscientific data with concentrations of various analytes following differently scaled distributions.

    This problem has been further extended to the non-parametric case when the distributions of the samples are unknown. In the non-parametric tests like WMW, if the samples have unequal variances, the rejection of the null hypothesis only means the distributions are not equal stochastically. However, this still doesn’t imply whether one distribution is greater than the other regarding some location parameter, e.g., median (Chen, 2000). Moreover, in such cases, the WMW test does not maintain its level (Pratt, 1964).

    There have been several papers in the literature that give solutions to non-parametric BF problem under different conditions. For example, a partial solution to the non-parametric Behrens-Fisher problem was provided by (Babu, 2002) for testing the equality of the medians of two continuous distributions having the same shape, but possibly unequal variances. (Brunner, 2000) proposed a generalized version of WMW rank test to solve the non-parametric BF problem and gave extensive simulations to show WMW test may be conservative or liberal depending on the ratio of the sample sizes and the variances of the underlying distribution functions.

    We consider comparisons between t-test, WMW test BM test (also known as Generalized Wilcoxon test) and presented our findings for NW PA and Bradford datasets. This paper will appear soon in Environmental Science: Processes & Impacts.

  • Discovery of Partial Causal Time Intervals

    A Novel Pruning Algorithm to detect Granger Causal time intervals in two time series

    This is joint work with leading geoscientist Dr. Zhenhui Li from College of Information Sciences and Technology in Pennstate and her Ph.D. student Guanjie. The primary motivation behind this work comes from analyzing the observational time series. In real life, controlled experiments are hard to conduct, and often we end up merely collecting data from observational studies. Analyzing such time series to infer causality of one predictor over a response becomes a challenge specifically when the causality changes it’s behavior rapidly.

    We proposed a novel pruning algorithm based on the bounds of F test that can detect partial time intervals with causality. As it is time-consuming to enumerate all time intervals and test causality for each interval, we further propose an efficient algorithm that can avoid unnecessary computations based on the bounds of F -test in the Granger causality test.

Undergraduate Research Projects

  • Asymmetric flows in the intercellular membrane during cytokinesis

    Exploring the patterns of internal ring closure in the growing membrane in response to asymmetric boundary fluxes.

    This is joint work with Biophysics researcher Dr. Anirban Sain in IIT Bombay who was also my Master’s thesis advisor. The goal of the project was to study the fast relaxation of the septum post-laser ablation at some point on the circumference of the contractile ring during cytokinesis. I was particularly interested in getting the time scale of retraction just after the ablation, overall area change under fast relaxation and velocity proles associated with the dynamical ow. I set up a theoretical model by writing down the stress balance and constitutive relations and did image analysis using Particle Image Velocimetry (PIV) in MATLAB and ImageJ on the videos obtained from the experimental laser ablation of the septum. From the PIV analysis on the time series of snapshots, I got an overall velocity map and used it to get an idea about flows inside the septum. Further, I analyzed the velocity maps and the dynamical parameters with the theoretical model developed.

  • Bayesian approach of nearfield acoustic reconstruction with particle filters

    Bayesian inversion and sequential Monte Carlo sampling techniques applied to nearfield acoustic sensor arrays

    This is joint work with Dr. Ming-Sian R. Bai at National Tsing Hua University, Taiwan and his graduate students, Ching-Cheng Chen and Yen-Chih Wang. The goal of the project was to demonstrate that inverse source reconstruction can be performed using a methodology of particle filters that relies primarily on the Bayesian approach of parameter estimation. My proposed approach was novel at the time of publishing, and I applied it in the context of nearfield acoustic holography based on the equivalent source method (ESM). I formulated a state-space model in light of the ESM. The parameters to estimate were amplitudes and locations of the equivalent sources. The parameters constitute the state vector which follows a first-order Markov process with the transition matrix being the identity for every frequency-domain data frame. Filtered estimates of the state vector obtained are assigned weights adaptively. The implementation of recursive Bayesian filters involves a sequential Monte Carlo sampling procedure that treats the estimates as point masses with a discrete probability mass function (PMF) which evolves with iteration. The weight update equation governs the evolution of this PMF and depends primarily on the likelihood function and the prior distribution. It was evident from the simulation results that the inclusion of the appropriate prior distribution is crucial in the parameter estimation.