Research Post # 8

Aug 23

This was the last week of the summer internship at NYU. Decidedly, I am continuing to work with the lab into the school year, so I will continue to write about my experiences and post them.

This week, at our weekly team meeting, I put together a deck regarding my project and presented it. It incorporated mostly everything from these posts from the start to finish. I discussed the project objective and significance of the project. And I talked about the process I went through to achieve the project goal: exploratory data analysis, preliminary data visualizations, node embedding data visualizations, and the classifiers. You can see my previous blog posts to learn about each of these things.

Ultimately, next steps for this project are using other classifiers, applying Bayesian Optimization/GridSearchCV to optimize hyperparameters, creating a visualization of the whole dataset (and storing those embeddings), continuing to use cross validation for model evaluation, and visualizing the results to test classifier performance. The first three are steps to increase the accuracy of the classifier. After that, creating a visualization of the whole data and storing the embeddings is important so that different models are comparable on the same vectors and embeddings. It would also just help to see the whole dataset visualized. And finally, cross validation is important to properly evaluate the model (specifics below), and visualizing the results is important for comparing models and deciding which is the best model to use.

In the last post, I did not go over in specific what cross validation (CV) is, so here is a run-through. CV is a way to evaluate model performance with higher accuracy and prevent overfitting. Without CV, it is possible to tweak the parameters until the model performs optimally on the test data, which is overfitting. What CV does is it splits the training data into k folds and the model is trained on k-1 folds. After which, it is evaluated on the remaining 1 fold. This is done for all k folds, so it can be computationally intensive, and it returns the mean accuracy. CV is important to present the accurate performance of the model. To learn more about CV and its implementation, check this out.

Apart from CV, I also implemented a Kernel SVC with hopes that it would increase accuracy of the model. Prior to that, I had been using Linear SVC, which just uses a different algorithm than Kernel SVC. However, Kernel SVC did not lead to significantly higher accuracy rates of the model. Next, I will implement a Random Forest Classifier and see if it returns better results. I also am going to keep track of the results of each classifier in a more formal way (until now I have been commenting the results in the code). Since I am trying so many models and comparing various parameters, etc., it is important to formally track them and compare the results.

Further, I will be visualizing the results using the classifier comparison function offered by scikit-learn. If this built-in function does not work, I can always just take the results, put them into the data, and use the already generated vectors to graph them – this will allow us to see which data points are more frequently incorrectly classified (one on top of the other). Either way, by visualizing the results, it will be helpful for us to see which data points the model most frequently classifies as incorrectly (i.e: peripheral nodes, hub nodes, disease nodes, etc.).

So, those are the major next steps - this project is supposed to end in the next month or so. After that, I will continue to post about the next project I am working on and keep you updated.

Anusha Kumar

Research Post # 8

Research Post # 9

Research Post # 7