Research Post # 7

Aug 23

I did not write a post last week, so I am just going to update you on what I have been doing for the past two weeks.

Last time I mentioned that I used Node2Vec To generate 128 dimensional vectors to plot the nodes. The algorithm assigns similar vectors to nodes that are closer to each other or have higher levels of connectivity. previously, I inputted these vectors into a software called you map which reduced the dimensions to two dimensions and visualized the data with it. However, now I'm using these vectors to train machine learning models to predict whether a gene is diseased or not.

I did some online research regarding commonly used classifiers, specifically k-nearest neighbors, multi-layer perceptron, support vector machines, decision trees, and naive Bayes.

To start I decided to use a multi-layer perceptron (MLP) and achieved a 60% accuracy rate. Here's how MLP works: there is an input layer and an output layer and several “hidden layers” in between. These hidden layers perform mathematical functions on the input and pass the output to the next hidden layer and this continues until it reaches the output layer. The math behind MLP is advanced and requires calculus and linear algebra, so I will not go into it. There are several parameters on an MLP model, more than other models.Surprisingly, training the model wasn't very computationally intensive: it only took around 5 seconds on 30,000 edges or 3,800 nodes. However, as I increased the sample size to 100,000 edges, the model actually decreased inaccuracy, to about 40%. Typically when a machine learning model is given more data the accuracy should increase because it's given more examples and opportunities to find patterns, however the reason that the accuracy decreased on this MLP model is likely because hyperparameter tuning for MLP is difficult to do. I only performed basic hyperparameter tuning using grid search cv, which is a function on scikit-learn that just Cycles through different parameters of the model and Returns the parameters that had the best results. However, in reality if you want a successful MLP model you do need to do much more hyperparameter tuning because there's so many different parameters that can be altered in an MLP model.

Because of the difficulties in tuning the MLP model, from there, I decided to move to support Vector machines, which are much simpler than MLPs. Here is a good overview on SVMs. on the SVM, I also achieved around a 62% accuracy rate. I also used GridSearchCV for hyperparameter tuning.

In addition to tuning the parameters of the model, I also experimented with different vector sizes and variations of the random walk parameters on both models. (Note: since MLP was doing worse with more edges, I only used 10,000 on the MLP model. As the vector sizes increased and decreased, no significant changes took place on the accuracy of the model. The same was true about changing the random walk parameters. I suspected that adding the other node features into the model would not increase accuracy, and when I tried adding publication count and ortholog count, it actually decreased the accuracy to about 40%. This makes sense because the publication count and ortholog count are relatively arbitrary numbers with regards to the classification of the genes.

There are a few next steps. First, I am going to continue manipulating the random walk parameters, but rather than manipulating the p and q probabilities, I will increase the number of iterations. Next is applying cross validation to the data (using a 70-20-10 split). And finally, think about visualization of the results, although this is not a priority. By visualization, I mean comparing the predictions of the model to the previously made visualization to see where it is making the largest errors and see if there are any patterns.

So, that’s where I am going with this project. By the way, next week is technically my last week of the internship, but I will continue to work on this project, so I will continue to post updates. See you later.

Anusha Kumar

Research Post # 7

Research Post # 8

Research Post # 6