This part of the project is responsible for analyzing and comparing multiple binary classification models for predicting the infectious status of cells. We collect, compare, and analyze three publicly available single-cell RNA sequencing datasets. Plenty of different models can be trained, as well as parameters to be tuned, but we focus on the most likely to transfer well to new data. The tests indicate that the Linear Support Vector Classification model and the Feed-Forward Neural Network model are the most robust and less prone to overfitting. We explore the underlying nature of the data and what can create an obstacle in improving the accuracy of models. We also identify features that contribute the most to the classification and discuss possible biological implications.
Single-cell RNA sequencing is an important technique, that contributed immensely to the understanding of the heterogeneity of biological responses to influences, as highlighted in the review by Gupta & Kuznicki (2020). An example of such influence is an infection, in our case Herpes simplex virus-1 (HSV 1), which is known to change the cellular RNA content on different levels, Wyler et al. (2019).
Single-cell sequencing creates datasets with thousands of data points (a.k.a. cells) and many variables (a.k.a. genes) that can be utilized in creating prediction models. Usually, researchers use single-cell sequencing results to highlight heterogeneity effects and explore different cell populations (Cao et al., 2022). The idea of our project is to create a model that would differentiate between infected and uninfected cells and test it with the data produced by our Wetlab team.
Ideally, we would use the data from the same cell type, because the difference in baseline transcription in different cell types could already drastically affect the prediction results. After searching open-access databases for the available data, we identified only three datasets with single-cell sequencing data with human cells. There are also a few available with bulk sequencing data, which cannot be used for training predictive models but are helpful for comparison of the produced bulk data by our team. None of the available datasets used HeLa cells, which our Wetlab team uses. That means that if we only use one dataset to create the model, it would most probably fail to transfer to our data well. Instead, we can combine the existing datasets to try and make the model more robust.
The binary classification task is widespread and extensively researched, there are numerous methods that can be applied to the task, such as Logistic Regression, Support Vector Machine, Neural Networks, and others, described and compared in detail in Bahel et al. (2020). The algorithms include linear and non-linear approaches, probabilistic, eager, and lazy learners, and choosing the correct one is not a trivial task. There are some guidelines that help narrow down the list of algorithms based on characteristics of data, such as, for example, the number of data points - observations, but overall, it's recommended to try out several methods to see which one might be better suited. Algorithms can also be further tweaked and boosted to improve their performance. It is not feasible to cover all possible algorithms and modifications, so we limit this work to the most well-known algorithms - Linear Support Vector Classification, Decision Tree Classifier, Random Forest Classifier, Gaussian Naive Bayes, K-Neighbors Classifier, and most importantly Feed-Forward Neural Network, where we explore the range of parameters.
We utilize accuracy, recall, and precision as quality and comparison metrics
Additionally, we perform a meta-analysis of the dataset to investigate the relationship between datasets and explain possible limits of classification models on the data.
Support Vector Machine is an algorithm that attempts to find a hyperplane in an M-dimensional space (M — the number of features or genes in our case) that distinctly separates the data points. It’s often compared to logistic regression, but this method is based on geometrical properties and not statistical and is less prone to overfitting (Ghosh et al., 2019). The version we are using is with the linear kernel and is advised for a large feature space and an intermediate number of observations.
Decision Trees are a non-parametric algorithm, that learns simple decision rules from the data to separate it into classes (Hastie et al., 2009). It is a useful algorithm that does not require heavy data preparation and is easily interpretable but is not stable, since even a small variance in data can produce very different trees. That can lead to problems with a generalization which is also a concern for our case.
The problem of overfitting with Decision Trees is usually solved with Random Forest Classifiers, which consist of multiple decision trees trained on the subsamples of data to average out the variance in trees (Louppe, 2014). Another interesting feature of this classifier is that in the process of building a tree not always all features are selected for creating a separation rule.
We keep both methods to compare and observe if there is a significant improvement in the model compared to the classical Random Forest Classifier.
Naive Bayes is an algorithm based on the Bayes theorem and is of immense use when one deals with the high dimensionality of the input, which is another way of saying the high number of features. However, this algorithm assumes strong independence of these features. The gaussian option of the method is for continuous data, which is assumed to follow normal distribution (Zhang, 2004). These two assumptions allow for an elegant model fitting but can also suffer in accuracy when the data distribution is far from normal.
Nearest Neighbor Classification relies on the majority vote of the neighbors of each data point to assign its class. It's very different from other algorithms because it does not create an internal model, but rather simply stores the training data (Peterson, 2009).
This method is advantageous because it does not have any assumptions about the data, but at the same time, it can be very hard for it to deal with high-dimensional data due to the curse of dimensionality, which in simple terms means, that distance means very little in higher dimensions, as explained in Aggarwal et al. (2001).
The network architecture is composed of multiple (hidden) layers that receive a weighted input and provides a modified version of the input to the next layer. There are linear and non-linear transformations in the different layers and a big array of possible hyperparameters that can be adjusted for better performance. There are different classes of architecture that include different structures of layers and for this, choosing the correct architecture is a crucial step in building a successful model. It's important to know, that the optimal choice of architecture depends on the type of data and not necessarily on the task of the model (Goodfellow et al., 2016).
That is why we build and analyze the performance of Deep Feed-Forward Networks - the basic type of neural network, whose architecture consists of multiple Dense layers, in which information flows from input to the first layer to the output layer with no feedback connections. This network is schematically presented in Figure 1.
Even though it might appear simple, this network still requires a myriad of parameters, that can be tested and optimized, such as the number of layers, number of neurons per layer, optimization parameter, regularizes, epoch number, batch size, use of dropout, and so on.
There is no widely accepted algorithm for choosing such parameters and a lot of these decisions are based on experience, provided by other papers, and common heuristics. For example, the optimizer is a function, that works with the weights of a neural network. Adam, introduced by Kingma, D. P. (2014), is a popular choice for this and that is why we keep this parameter. Parameter optimization can have minuscule or noticeable effects on the results, but generally, it’s hard to go from a 50 percent to a 99 percent accuracy just by tweaking parameters. That is why we define a set of parameters, that we believe to have a bigger impact on the most common problem with neural networks - overfitting. We want to ensure the generalization of the model - meaning that it would perform well enough on all datasets including the ones it was not trained on.
Principal Component Analysis (PCA) is a method of linear transformation for dimensionality reduction, that makes use of orthogonal transformation to maintain the symmetry of the inner product of the data (Abdi & Williams, 2010). PCA not only decreases the number of dimensions but also maximizes the variance of data and reduces noise. This data can then be clustered or further reduced by other methods.
Uniform Manifold Approximation and Projection (UMAP), presented by McInnes et al. (2018), is another dimension reduction and visualization technique. But whereas PCA is usually used to reduce the data to 10-30 dimensions, UMAP reduces it further to 2 dimensions that can be easily visualized. It's a relatively new technique, that gradually replaces t-Distributed Stochastic Neighbor Embedding (t-SNE). In comparison to t-SNE, it allows non-linear reduction as well. The intuitive explanation of UMAP is that it constructs a representation of the high-dimensional graph in a low-dimensional graph and attempts to optimize a low-dimensional graph to have as similar as possible structure as the high-dimensional one.
Accuracy, precision, and recall are the most popular metrics that evaluate the performance of classifiers and are well described in the article by Powers (2020). They work well together and cover the blind spots of each other. Accuracy measures how many of the data points were correctly assigned to the label, and precision indicates how good the model is at predicting a specific class - in our case infected cells, by calculating the fraction of true positives against true and false positives. Recall shows how often the model detects the specific class.
Figure 2 offers a visual explanation of the metrics.
A well-trained model would have a high score in all three metrics and deviations could reveal a bias.
Table 1 provides an overview of the most important packages used to prepare and analyze data. The analysis was performed in Python.
Package | Version | Usage | Reference |
Scanpy - Single-Cell Analysis in Python | 1.19 | Data preprocessing, PCA, UMAP | Wolf et al. (2018) |
GSEAPY - Gene Set Enrichment Analysis in Python | 0.13.0 | Pathway enrichment analysis | Fang (2020) |
Scikit-learn | 1.1.1 | Classification methods, metrics, KFold | Pedregosa et al.(2011) |
Keras | 2.4.3 | Feed-forward neural network, autoencoder | Chollet (2015) |
The data was produced by Wyler et al. (2019), Rybak-Wolf et al. (2021) and Drayman et al. (2019). The Wyler data is the single-cell sequencing of primary normal human dermal fibroblast cells, uninfected and infected, and harvested at three different time points (1, 3, and 5 hours post-infection) in two replicates. The Rybak-Wolf used human induced pluripotent stem cells and much later time points (3- and 6-days post infection). The Drayman had neonatal primary dermal fibroblast normal cells infected for 5 hours.
From the Wyler dataset, we kept the cells in the data of two types - uninfected and infected with 5 hours of incubation. We kept 2 replicates to diversify the data. The first vital step is the elimination of outliers that could easily skew the data. Figure 3 shows the number of genes with at least 1 count in a cell compared to the total counts. As you can see more counts per cell allow us to identify more genes, but there are too few data points to have a meaningful impact on the analysis. That is why we filter out the cells with more than 6 thousand expressed genes. We also filtered out the cells that have less than 200 genes, as those would probably be technical artifacts. Before training an important step is to get rid of the features that have a 0 count, so genes that are not expressed in any cells.
The cells were subsequentially normalized and transformed by log function to prevent highly active genes from having the biggest weight on the model. After that, we identified the most variable genes and used those as variables for dimensionality reduction and clustering. Principal Component Analysis was used for dimensionality reduction and Uniform Manifold Approximation and Projection was used for the visualization in 2D.
Figure 4 shows displays cells embedded in 2D dimensions and colored by specific characteristics. Figure 4.2 has different colors for different cell types - infected vs uninfected in 2 different replicates. Red and green (infected) cluster together and are clearly separated from blue and orange (uninfected). An interesting observation is that infected cells do not have a clear separation between different replicates, which could point to similar cellular behavior under infection, whereas uninfected cells can still be separated, showing how much influence cell baseline can have on data.
Figure 4.1 also shows that the number of expressed genes has a high impact on data clustering, with cells having a low count (less than 1000) clustering together. Here we could try setting the lower filter for gene count to get rid of the blob on the right to ensure that the number of expressed genes does not outweigh individual genes. Figure 4.3 offers quality control with a housekeeping gene, that no cells cluster with extreme behavior.
Figure 5 displays the data after filtering out the cells with less than 1000 genes expressed in them. Even though we do not have the extra cluster of cells, we also lose the clearer separation between samples. That's why we proceed with all the cells even under 1000 expressed genes.
A similar analysis is performed on the other two datasets - Rybak and Drayman.
For Rybak, we set the upper filter for the number of genes at 8000.
In Figure 7 the separation between control and infected cells is also visible, but not as clear as in the Wyler dataset, which could indicate potential effects on training.
For the Drayman dataset, we set the filter at 4000 genes. In Figure 9 we observe a clear separation between infected and non-infected cells and no visible bias with the number of genes in the first image.
This first preliminary analysis serves to show that the data has clear separation and thus the models trained on just one dataset should be close to 100 percent accuracy when tested with the new cells from the same dataset.
However, the question of model robustness requires a more in-depth analysis.
In this section we first take a look at the characteristics of the dataset, and implications for the classification, and then provide results and analysis of different classification techniques. We test the majority of the classification methods with the default parameters, but for the neural network, we explore the range of different parameters, which is why it is separated into another subchapter.
Meta-analysis of the available datasets is the most important part of the analysis, which reveals the relationship between datasets that can influence the classification. Firstly, let's take a look at the simple data parameters. The datasets have different amounts of observations - from 7 to 20 thousand and feature numbers ranging from 20 to 50 thousand genes. Since we are required to train on features, that are shared by every dataset, we end up with less than 20 thousand genes. That means that some of the meaningful genes for one dataset will not necessarily be included in the final feature set, however, the number of features should still compensate for that. All datasets have about 50/50 separation of observation between classes, which means that the accuracy will not be skewed. If one of the classes were to be overrepresented, which is often the case, then the accuracy metric can be overconfident. But we have well-balanced datasets.
Since our goal is to develop a model, that will generalize well, we need to analyze, how plausible this is. Data that is fed to the trained model should be similar to the data, that it was trained on. Similarity can be looked at from different perspectives. One is biological, as in data would have similar tendencies, the other one is the underlying characteristics of data, like distribution.
We can measure the similarity of the datasets with the help of various metrics. The datasets can be measured line by line, but this is computationally infeasible. Another problem is also the different sizes of the feature spaces. The following process is adapted from the column by James D. McCaffrey (2021). Different feature numbers can be solved with dimensionality reduction techniques like deep neural autoencoder.
All three datasets are trained on the same autoencoder architecture. Then all the data points are converted to the latent space using the encoder part of the model. Now each cell is represented not by 20 or 50 thousand features, but by a vector of 10 variables. That still leaves us with datasets of up to 20 thousand vectors each. To further reduce it, for each dataset we create a frequency distribution vector for each variable in the latent representation. For example, the first variable has the following distribution (the values in latent representation are rescaled in the 0 to 1 range):
0-0.1 | 0.1-0.2 | 0.2-0.3 | 0.3-0.4 | 0.4-0.5 | 0.5-0.6 | 0.6-0.7 | 0.7-0.8 | 0.8-0.9 | 0.9-1 |
3% | 18% | 28% | 21% | 15% | 8% | 8% | 1% | 1% | 1% |
So, given a vector of 10 variables, we get a 10 by 10 matrix for each dataset, that describes the dataset from another perspective, but still carries the information about the underlying data distribution.
Now we can compute the Wasserstein distance for these frequency distributions. Wasserstein distance is a function defined between probability distributions on a given metric space M, according to Vaserstein (1969). Wasserstein distance alone will not give us enough information to claim how close the datasets are, so we need to create a scale for comparison.
For that we take one dataset and gradually substitute first 1, then 2, then 3 and so on lines with random values. For each substitution, the experiment is repeated several times and the mean is taken. With that we have a reference scale in Figure 10.
The similarity between Wyler and Rybak is 0.099, between Wyler and Drayman, is 0.057, and Rybak and Drayman is 0.058. As we can infer from the scale, Wyler and Rybak have a dissimilarity of about 20 percent, whereas the other two groupings only of about 10 percent. This indicates that the transferability of the trained model onto the new data should be possible given that the tendencies in the data are similar enough.
Another way of looking at general similarity between datasets could be exploring the enriched pathways of genes that have high variability in the datasets. Even though the starting state of cells can differ, the pathways activated or down-regulated by infection in theory should be similar.
The pathway enrichment was performed with the help of GSEApy. Figure 11 and Figure 12 show the significantly up-regulated pathways for Wyler and Rybak datasets respectively. Drayman does not have any significantly enriched pathways. Only the Wyler dataset has an expected pathway enriched - HSV1 infection, whereas the others lack any enrichment in related pathways, which could speak to likely problems with classification.
First, let's take a look at the model built from one dataset - Wyler 2019.
In this dataset, there are more than 50 thousand features. Models can be built to include all of these features or only the highly variable genes. If we were to use only highly variable genes, the model would be untransferable to the data that had different highly variable genes. The idea is to include all possible features and see how it potentially transfers to other datasets.
The reproducibility of results is extremely important, which is why we use the K-Fold training and testing method - the data is subsetted and the model is trained independently several times (in our case 5), then the mean is taken for each metric. The same process is repeated for every model and for every experiment setup - different dataset combinations.
Model | Accuracy | Precision | Recall |
Linear Support Vector Classification | 99.8% | 99.8% | 99.9% |
Decision Tree Classifier | 99.3% | 99.3% | 99.2% |
Random Forest Classifier | 99.1% | 99.2% | 99.0% |
Gaussian Naive Bayes | 85.5% | 99.5% | 78.4% |
K-Neighbors Classifier | 91.4% | 85.9% | 97.3% |
As expected after the clear separation of data in UMAP plot, several models achieve very high results in all three metrics, as indicated in Table 2. Diminished accuracy in the Gaussian Naive Bayes model can indicate that the data does not follow one of the two assumptions, most probably it does not follow gaussian normal distribution. K-Neighbors Classifier has more trouble with misidentifying uninfected cells as infected as compared to other classifiers, but still does a good job at identifying most infected cells correctly.
The next step is to test how well the model trained on one type of data performs on another dataset. This experiment is repeated two times, once with data from Rybak, and the other with Drayman. Since not all features overlap between models, we can only take the common features, which reduces the number from 50 thousand to 20 thousand. This alone could already have an impact on accuracy and other metrics.
Model | Accuracy | Precision | Recall |
Linear Support Vector Classification | 80.8% | 85.6% | 78.4% |
Decision Tree Classifier | 85.4% | 94.8% | 80.5% |
Random Forest Classifier | 78.3% | 94.3% | 71.6% |
Gaussian Naive Bayes | 60.9% | 90.0% | 57.0% |
K-Neighbors Classifier | 61.2% | 89.5% | 57.4% |
Model | Accuracy | Precision | Recall |
Linear Support Vector Classification | 70.2% | 82.4% | 67.0% |
Decision Tree Classifier | 57.4% | 58.0% | 58.4% |
Random Forest Classifier | 51.1% | 91.8% | 51.1% |
Gaussian Naive Bayes | 46.1% | 55.5% | 47.5% |
K-Neighbors Classifier | 49.2% | 94.8% | 50.0% |
The overall tendency is as expected: consistently lower results in all three metrics. However, there are several interesting insights, that these results offer us. Rybak data seem to be more similar to the Wyler data than Drayman, when we compare results from Table 3 and 4, even though they have a bigger Wasserstein distance between them. It probably means that the data are closer from a biological perspective and the models are able to identify some relevant variables that behave similarly in those two datasets. For the Rybak datasets all the models have higher specificity, but lower sensitivity, meaning that if they identify a cell is infected, it's more reliable than when they label the cell uninfected.
A similar trend is observed for the Drayman dataset for the following models: Linear Support Vector Classification, Random Forest Classifier, and K-Neighbors Classifier, but more pronounced. If the cell is predicted to be infected, the confidence of this prediction is high, but if the label is uninfected, it's closer to random guessing.
The next step is to train the data on two datasets - Wyler and Rybak, and test it on the Drayman dataset (since it performed worse). The hypothesis is that combining datasets would make models more robust.
Model | Accuracy | Precision | Recall |
Linear Support Vector Classification | 72.6% | 78.9% | 70.8% |
Decision Tree Classifier | 56.7% | 70.8% | 56.0% |
Random Forest Classifier | 54.3% | 96.6% | 52.8% |
Gaussian Naive Bayes | 47.5% | 50.6% | 48.4% |
K-Neighbors Classifier | 51.0% | 99.5% | 51.0% |
In Table 5 we can observe in most of the models and metrics a small improvement by several percent after combining two datasets for training. We can be sure it’s not a fluke because the data is averaged over 5 independent test runs.
Linear Support Vector Classification consistently shows the best results when all three metrics are considered.
We use the datasets downloaded from the Rybak and Wyler papers to train the model and the dataset used in the Drayman paper to evaluate it. Each training and evaluation is done with K-Fold, so each training was repeated 5 times to ensure the stability of the results.
As described previously, there is a myriad of parameters that can be tested and optimized. The following parameters are chosen after consultation with the literature which appears to be the best fit:
Optimizer: Adam.
Regularizer: L2.
Activation function of the last layer: Sigmoid.
Activation function of other layers: ReLu.
Loss function: binary cross-entropy.
Batch size: 64.
There is an interesting note to be made on the batch size. Literature and forum search highlighted conflicting statements, some insisting that smaller batch sizes have been shown to have a wider range of acceptable learning rates, the others that increasing the batch size can help find optimum faster than decaying the learning rate. Batch size also significantly affects the speed of training, which is why we set this parameter to the medium size, hoping to find a compromise between training time and identifying the optimal learning rate.
We analyzed the following modifications:
Learning rate is commonly viewed as one of the most important hyperparameters of neural networks, as explained by Smith (2017), which is why we try out 5 different learning rates for all models from 1e-06 to 1e-02.
Learning rate | Accuracy | Precision | Recall |
1e-06 | 66.9% | 61.7% | 92.2% |
1e-05 | 70.6% | 65.4% | 90.5% |
1e-04 | 70.7% | 65.4% | 90.9% |
1e-03 | 70.4% | 66.3% | 87.0% |
1e-02 | 75.0% | 72.5% | 82.7% |
Learning rate | Accuracy | Precision | Recall |
1e-06 | 71.1% | 65.6% | 91.6% |
1e-05 | 72.3% | 66.6% | 91.6% |
1e-04 | 72.1% | 66.5% | 91.8% |
1e-03 | 71.8% | 67.8% | 86.7% |
1e-02 | 73.2% | 68.5% | 87.9% |
Learning rate | Accuracy | Precision | Recall |
1e-06 | 69.6% | 63.7% | 94.0% |
1e-05 | 69.6% | 64.0% | 93.5% |
1e-04 | 70.0% | 64.6% | 91.0% |
1e-03 | 73.1% | 69.4% | 85.5% |
1e-02 | 72.7% | 72.4% | 77.5% |
Learning rate | Accuracy | Precision | Recall |
1e-06 | 71.3% | 65.8% | 91.0% |
1e-05 | 71.3% | 65.7% | 91.4% |
1e-04 | 70.8% | 65.1% | 92.0% |
1e-03 | 72.5% | 67.8% | 88.6% |
1e-02 | 72.2% | 68.5% | 84.7% |
Different neural network architectures have an accuracy from 69 to 75 percent, which even outperforms Linear Support Vector Classification by a few percents. However, there is another interesting trend that is immediately visible in the data. Compared to other classification models, neural network has a high recall and reduced precision, meaning that it's more likely to detect infected cells, but can also mislabel some non-infected cells as infected. Regarding the architecture of networks, it seems that more layers and dropout improve results by a percentage or so (though the highest accuracy is observed in the simplest architecture). An increase in the learning rate is correlated with mostly increased accuracy and precision but sacrifices recall.
There are also some fluctuations in the accuracy results for each individual measurement, in some cases up to a 7 percent difference between different runs with exactly the same parameters. This shows, how essential it is to have multiple network trainings to make sure, that any improvement is not a fluctuation.
Additionally, we selected one of the best runs and tried pruning the weights of the network, meaning that we deleted some of the weights of the trained model to see if this improves the accuracy of the model on the Drayman dataset. It was shown in the publication by Frankle & Carbin (2018), that pruning the network does not necessarily compromise the training accuracy and there has been evidence, that pruning helps against overfitting, as demonstrated by Tian et al. (2019).
However, in our case pruning does not seem to improve the performance of the network, yielding similar results: for the six-layer model with dropout with a learning rate of 1e-03 average accuracy was 73.7%, precision 69.7%, and recall 86.2%. All within 2 percent difference from the regular training conditions. It does show some improvement in accuracy, but not noteworthy.
An additional part of analysis is identification of features, that contributed the most to the model of Linear Support Vector Classification.
RASD1 has been shown to be upregulated in infected cells and is responsible for reducing virion production (Wyler et al., 2019).
COL6A2 has an identified relationship with virion host shutoff protein, as per Friedel et al. (2021). HSPA1B is shown to be a binding partner to some of the viral and bacterial components in Carter (2009).
However, there are no publication directly exploring relationship between CEBPB and HSV-1. There is evidence that CEBPB is a regulator of some IFN-stimulated genes, as per Kalvakolanu & Gade (2012), and interferon system is extremely important as a viral-defense system, especially considering that HSV-1 has ways to downregulate the IFN response (Danastas et al., 2020).
The approach to the classification is to use two datasets as a training batch and test models with the data they have not seen before. We aim to explore how well models trained on different cell types and conditions can transfer to the new data. From the biological perspective, the cells are already behaving differently in the non-infected state, which could influence the accuracy of the prediction model. A model with decent accuracy could be used to analyze whether the cells are infected or not. This could be applied in our project to see whether cells with introduced siRNA would be classified as infected by our model. However, no classification model had an accuracy score higher than 75 percent. The two best approaches also have different behavior concerning precision and recall, where Linear Support Vector Classification has more confidence in not assigning non-infected cells an "infected" label, but can miss infected cells, whereas Neural Network has a lower chance of missing an infected state, but also mislabels non-infected cells. Depending on which bias is preferable, these classification models could be used for testing the cells created by our Wetlab team, but due to monetary constraints, we could not proceed with single-cell sequencing of the cells.
The 75 percent accuracy could be a threshold that cannot be overcome due to the data itself rather than models and parameter tuning. Evidence of that is the better accuracy values with the Rybak dataset as test data on the model trained exclusively with the Wyler dataset. In that case, no further parameter tuning can improve the accuracy of the given data.
The number of datasets is underwhelming for a good model and comprehensive testing. In the future, this project could be extended if there are more datasets available. The code and notebooks’ structure allows for easy dataset integration and analysis pipeline.
The role of CEBPB could be examined more closely in the infection of cells with HSV 1.