Publication
You can also find my articles on my Google Scholar profile.
In the Pipeline
- Asiaee, A., Oymak, S., Coombes, K. R., & Banerjee, A. High Dimensional Data Enrichment: Interpretable, Fast, and Data-Efficient. ArXiv Preprint ArXiv:1806.04047.
@article{aocb18, title = {High Dimensional Data Enrichment: Interpretable, Fast, and Data-Efficient}, author = {Asiaee, Amir and Oymak, Samet and Coombes, Kevin R and Banerjee, Arindam}, journal = {arXiv preprint arXiv:1806.04047} }
High dimensional structured data enriched model describes groups of observations by shared and per-group individual parameters, each with its own structure such as sparsity or group sparsity. In this paper, we consider the general form of data enrichment where data comes in a fixed but arbitrary number of groups G. Any convex function, e.g., norms, can characterize the structure of both shared and individual parameters. We propose an estimator for high dimensional data enriched model and provide conditions under which it consistently estimates both shared and individual parameters. We also delineate sample complexity of the estimator and present high probability non-asymptotic bound on estimation error of all parameters. Interestingly the sample complexity of our estimator translates to conditions on both per-group sample sizes and the total number of samples. We propose an iterative estimation algorithm with linear convergence rate and supplement our theoretical analysis with synthetic and real experimental results. Particularly, we show the predictive power of data enriched model along with its interpretable results in anticancer drug sensitivity analysis. - Abrams, Z. B., Zucker, M., Min, W., Asiaee, A., Abruzzo, L. V., & Coombes, K. R. Thirty Biologically Interpretable Clusters of Transcription Factors Distinguish Cancer Type. Manuscript Submitted for Publication.
@article{azwa18, title = {Thirty Biologically Interpretable Clusters of Transcription Factors Distinguish Cancer Type}, author = {Abrams, Zachary B and Zucker, Mark and Min, Wang and Asiaee, Amir and Abruzzo, Lynne V and Coombes, Kevin R}, journal = { Manuscript submitted for publication} }
Background: Transcription factors are important regulators of gene expression and play critical roles in development, differentiation, and in many cancers. To carry out their regulatory programs, they must cooperate in networks and bind simultaneously to sites in promoter or enhancer regions of genes. We hypothesize that the mRNA co-expression patterns of transcription factors can be used both to learn how they cooperate in networks and to distinguish between cancer types. Results: We recently developed a new algorithm, Thresher, that combines principal component analysis, outlier filtering, and von Mises-Fisher mixture models to cluster genes (in this case, transcription factors) based on expression, determining the optimal number of clusters in the process. We applied Thresher to the RNA-Seq expression data of 486 transcription factors from more than 10,000 samples of 33 kinds of cancer studied in The Cancer Genome Atlas (TCGA). We found that 30 clusters of transcription factors from a 29-dimensional principal component space were able to distinguish between most cancer types, and could separate tumor samples from normal controls. Moreover, each cluster of transcription factors could be either (i) linked to a tissue-specific expression pattern or (ii) associated with a fundamental biological process such as cell cycle, angiogenesis, apoptosis, or cytoskeleton. Clusters of the second type were more likely to also be associated with embryonically lethal mouse phenotypes. Conclusions: Using our approach, we have shown that the mRNA expression patterns of transcription factors contain most of the information needed to distinguish different cancer types. The Thresher method is capable of discovering biologically interpretable clusters of genes. It can potentially be applied to other gene sets, such as signaling pathways, to decompose them into simpler, yet biologically meaningful, components.
High Dimensional Statistics
- Asiaee T., A., Chaterjee, S., & Banerjee, A. (2016). High Dimensional Structured Estimation with Noisy Designs. In 16th SIAM International Conference on Data Mining (SDM) (pp. 801–809). SIAM.
@inproceedings{ascb16, title = {High Dimensional Structured Estimation with Noisy Designs}, author = {Asiaee T., Amir and Chaterjee, Soumyadeep and Banerjee, Arindam}, booktitle = {16th SIAM International Conference on Data Mining (SDM)}, pages = {801--809}, year = {2016}, organization = {SIAM} }
Structured estimation methods, such as LASSO, have received considerable attention in recent years and substantial progress has been made in extending such methods to general norms and non-Gaussian design matrices. In real world problems, however, covariates are usually corrupted with noise and there have been efforts to generalize structured estimation method for noisy covariate setting. In this paper we first show that without any information about the noise in covariates, currently established techniques of bounding statistical error of estimation fail to provide consistency guarantees. However, when information about noise covariance is available or can be estimated, then we prove consistency guarantees for any norm regularizer, which is a more general result than the state of the art. Next, we investigate empirical performance of structured estimation, specifically LASSO, when covariates are noisy and empirically show that LASSO is not consistent or stable in the presence of additive noise. However, prediction performance improves quite substantially when the noise covariance is available for incorporating in the estimator.
Social Network Analysis
- Golnari, G., Asiaee T., A., Banerjee, A., & Zhang, Z.-L. (2015). Revisiting Non-Progressive Influence Models: Scalable Influence Maximization in Social Networks. In 31st Conference on Uncertainty in Artificial Intelligence (UAI) (pp. 316–325).
@inproceedings{gabz15, title = {Revisiting Non-Progressive Influence Models: Scalable Influence Maximization in Social Networks.}, author = {Golnari, Golshan and Asiaee T., Amir and Banerjee, Arindam and Zhang, Zhi-Li}, booktitle = {31st Conference on Uncertainty in Artificial Intelligence (UAI)}, pages = {316--325}, year = {2015} }
Influence maximization in social networks has been studied extensively in computer science community for the last decade. However, almost all of the efforts have been focused on the progressive influence models, such as independent cascade (IC) and Linear threshold (LT) models, which cannot capture the \textitreversibility of choices. In this paper, we present the Heat Conduction (HC) model which is a \textitnon-progressive influence model and has favorable real-world interpretations. Moreover, we show that HC unifies, generalizes, and extends the existing non-progressive models, such as the Voter model \citeeven-dar_note_2007 and non-progressive LT \citekempe_maximizing_2003. We then prove that selecting the optimal seed set of influential nodes is NP-hard for HC but by establishing the submodularity of influence spread, we can tackle the influence maximization problem with a scalable and provably near-optimal greedy algorithm. To the best of our knowledge, we are the first to present a scalable solution for influence maximization under non-progressive LT model, as a special case of HC model. In sharp contrast to the other greedy influence maximization methods, our fast and efficient C2Greedy algorithm benefits from two analytically computable steps: closed-form computation for finding the influence spread as well as the greedy seed selection. Through extensive experiments on several and large real and synthetic networks, we show that C2Greedy outperforms the state-of-the-art methods, under HC model, in terms of both influence spread and scalability. - Asiaee T., A., Afshar, M., & Asadpour, M. (2013). Influence maximization for informed agents in collective behavior. In Distributed Autonomous Robotic Systems (pp. 389–402). Springer.
@incollection{asaa13, title = {Influence maximization for informed agents in collective behavior}, author = {Asiaee T., Amir and Afshar, Mohammad and Asadpour, Masoud}, booktitle = {Distributed Autonomous Robotic Systems}, pages = {389--402}, year = {2013}, publisher = {Springer} }
Control of collective behavior is an active topic in biology, social, and computer science. In this work we investigate how a minority of informed agents can influence and control the whole society through local interactions. The problem we specifically target is that a minority of people with a bounded budget for initiating new social relations attempt to control the collective behavior of a society and move the crowd toward a specific goal. Assuming that local interactions can only take place between friends, the minority has to initiate some new relations with the majority. The total cost of new relations is limited to a budget. The problem is then finding the optimal links in order to gain maximum impact on the society. We will model the problem as a diffusion process in a social network. The proof of NP-hardness of the problem for Local Interaction Game model of diffusion is presented. Simulations show that the proposed method surpasses the popular strategies based on degree and distance centrality in performance. - Asiaee T., A., Tepper, M., Banerjee, A., & Sapiro, G. (2012). If you are happy and you know it... tweet. In 21st ACM international conference on Information and knowledge management (CIKM) (pp. 1602–1606). ACM.
@inproceedings{atbs12, title = {If you are happy and you know it... tweet}, author = {Asiaee T., Amir and Tepper, Mariano and Banerjee, Arindam and Sapiro, Guillermo}, booktitle = {21st ACM international conference on Information and knowledge management (CIKM)}, pages = {1602--1606}, year = {2012}, organization = {ACM} }
Extracting sentiment from Twitter data is one of the fundamental problems in social media analytics. Twitter’s length constraint renders determining the positive/negative sentiment of a tweet difficult, even for a human judge. In this work we present a general framework for per-tweet (in contrast with batches of tweets) sentiment analysis which consists of: (1) extracting tweets about a desired target subject, (2) separating tweets with sentiment, and (3) setting apart positive from negative tweets. For each step, we study the performance of a number of classical and new machine learning algorithms. We also show that the intrinsic sparsity of tweets allows performing classification in a low dimensional space, via random projections, without losing accuracy. In addition, we present weighted variants of all employed algorithms, exploiting the available labeling uncertainty, which further improve classification accuracy. Finally, we show that spatially aggregating our per-tweet classification results produces a very satisfactory outcome, making our approach a good candidate for batch tweet sentiment analysis.
General Machine Learning
- Asiaee T., A., Goel, H., Gosh, S., Yegneswaran, V., & Banerjee, A. (2018). Time Series Deinterleaving of DNS Traffic. In 1st Deep Learning and Security Workshop (DLS).
@inproceedings{aggy18, title = {Time Series Deinterleaving of DNS Traffic}, author = {Asiaee T., Amir and Goel, Hardik and Gosh, Shalini and Yegneswaran, Vinod and Banerjee, Arindam}, booktitle = {1st Deep Learning and Security Workshop (DLS)}, year = {2018} }
Stream deinterleaving is an important problem with various applications in the cybersecurity domain. In this paper, we consider the specific problem of deinterleaving DNS data streams using machine-learning techniques, with the objective of automating the extraction of malware domain sequences. We first develop a generative model for user request generation and DNS stream interleaving. Based on these we evaluate various inference strategies for deinterleaving including augmented HMMs and LSTMs on synthetic datasets. Our results demonstrate that state-of-the-art LSTMs outperform more traditional augmented HMMs in this application domain.