Anusha Kumar 11/20/24 Anusha Kumar 11/20/24

Research Post # 9

Hi everyone, it’s been a minute. I haven’t written for several weeks because I haven’t had much time given that school restarted and everything, but I have a ton of project updates.

So, we spoke with the collaborators at Johns Hopkins about the project, and it turns out that we have been solving a different problem than what was intended! Over the summer, there were the disease, non-disease, and unknown (non-OMIM) genes. We had been focused on developing a classifier to classify the non-OMIM genes into disease or non-disease. However, the way the OMIM database works is that people manually enter genes that have been researched into the database. When the people from JH coded the data scrape to find the genes in the publications across the internet, if a gene had not been inputted into the OMIM database, or if it had been inputted incorrectly, then it would be classified as non-OMIM. non-OMIM genes are, by definition, genes that were just overlooked in the database. The label “non-OMIM” can mean that they were just misinputted, overlooked, or irrelevant or not worth studying to begin with. The focus of this project was intended to be on the non-disease category. When the researchers were inputting the genes into the database, some of the genes that were entered as “non-disease”, may have actually been “disease”, so the “non-disease” label is not a true label. The goal of the project was actually to classify the “non-disease” genes into actual “non-disease” and “disease” genes. This turns the project into one-way label classification. The only true labels we have are the disease genes and we have to classify the non-disease genes. The non-OMIM genes are not really relevant to the project. That said, if we can find a way to classify the non-OMIM genes into “disease” and “non-disease”, then we also have a way to classify non-disease genes into “disease” and “non-disease”. If you have one, then you have the other.

So, now that the problem statement is different, there are a few ways to go about doing this. The first is turn it into a regression problem. If we can figure out which disease genes are more confidently disease genes (based on attributes in the dataset), then it turns into a regression problem in which a regressor can predict the probability of a mutation in a gene resulting in disease. Another potential avenue is defining the non-OMIM genes as negative labels (so that we do not only have the “disease” label). However, this is likely not the best solution because we do not have disease association information about the non-OMIM genes, which is why they are non-OMIM, so they can’t really act as a true negative label. We also discussed incorporating other features of the data that we had not been as focused on, such as the pLI score and publication count. pLI stands for loss-intolerance probability, which essentially is quantifying the likelihood that a gene is intolerant to a mutation. A higher pLI score implies a lower tolerance to mutations in the gene. Publication count can be used as a confidence metric to determine how certain we are that a gene is disease associated.

Generally speaking, there are a ton of papers on this topic of disease gene classification and more specifically one way classification. I have read a few papers so far, but once I have a more concrete idea of how this project is actually moving forward, I will update. That’s it for today.

By the way, another thing is that we have been finding that label propagation has actually been performing better than the classifiers that are being run on the node2vec embeddings. That is interesting because there are a few implications: the labels are likely more biased toward localized graph structure, rather than overall graph structure; the classifiers that are being run on the embeddings need more hyperparameter tuning; and, the skew in the data is likely influencing the performance on the embeddings classifiers because those are focusing on the whole graph. We are going to continue to explore avenues to solve this problem and our next meeting with JH is at the start of December, so I will post another update around then. I have to go read a bunch of papers now, so I will post back later.

Anusha Kumar 8/23/24 Anusha Kumar 8/23/24

Research Post # 8

This was the last week of the summer internship at NYU. Decidedly, I am continuing to work with the lab into the school year, so I will continue to write about my experiences and post them.

This week, at our weekly team meeting, I put together a deck regarding my project and presented it. It incorporated mostly everything from these posts from the start to finish. I discussed the project objective and significance of the project. And I talked about the process I went through to achieve the project goal: exploratory data analysis, preliminary data visualizations, node embedding data visualizations, and the classifiers. You can see my previous blog posts to learn about each of these things.

Ultimately, next steps for this project are using other classifiers, applying Bayesian Optimization/GridSearchCV to optimize hyperparameters, creating a visualization of the whole dataset (and storing those embeddings), continuing to use cross validation for model evaluation, and visualizing the results to test classifier performance. The first three are steps to increase the accuracy of the classifier. After that, creating a visualization of the whole data and storing the embeddings is important so that different models are comparable on the same vectors and embeddings. It would also just help to see the whole dataset visualized. And finally, cross validation is important to properly evaluate the model (specifics below), and visualizing the results is important for comparing models and deciding which is the best model to use.

In the last post, I did not go over in specific what cross validation (CV) is, so here is a run-through. CV is a way to evaluate model performance with higher accuracy and prevent overfitting. Without CV, it is possible to tweak the parameters until the model performs optimally on the test data, which is overfitting. What CV does is it splits the training data into k folds and the model is trained on k-1 folds. After which, it is evaluated on the remaining 1 fold. This is done for all k folds, so it can be computationally intensive, and it returns the mean accuracy. CV is important to present the accurate performance of the model. To learn more about CV and its implementation, check this out.

Apart from CV, I also implemented a Kernel SVC with hopes that it would increase accuracy of the model. Prior to that, I had been using Linear SVC, which just uses a different algorithm than Kernel SVC. However, Kernel SVC did not lead to significantly higher accuracy rates of the model. Next, I will implement a Random Forest Classifier and see if it returns better results. I also am going to keep track of the results of each classifier in a more formal way (until now I have been commenting the results in the code). Since I am trying so many models and comparing various parameters, etc., it is important to formally track them and compare the results.

Further, I will be visualizing the results using the classifier comparison function offered by scikit-learn. If this built-in function does not work, I can always just take the results, put them into the data, and use the already generated vectors to graph them – this will allow us to see which data points are more frequently incorrectly classified (one on top of the other). Either way, by visualizing the results, it will be helpful for us to see which data points the model most frequently classifies as incorrectly (i.e: peripheral nodes, hub nodes, disease nodes, etc.).

So, those are the major next steps - this project is supposed to end in the next month or so. After that, I will continue to post about the next project I am working on and keep you updated.

Anusha Kumar 8/23/24 Anusha Kumar 8/23/24

Research Post # 7

I did not write a post last week, so I am just going to update you on what I have been doing for the past two weeks.

Last time I mentioned that I used Node2Vec To generate 128 dimensional vectors to plot the nodes. The algorithm assigns similar vectors to nodes that are closer to each other or have higher levels of connectivity. previously, I inputted these vectors into a software called you map which reduced the dimensions to two dimensions and visualized the data with it. However, now I'm using these vectors to train machine learning models to predict whether a gene is diseased or not.

I did some online research regarding commonly used classifiers, specifically k-nearest neighbors, multi-layer perceptron, support vector machines, decision trees, and naive Bayes.

To start I decided to use a multi-layer perceptron (MLP) and achieved a 60% accuracy rate. Here's how MLP works: there is an input layer and an output layer and several “hidden layers” in between. These hidden layers perform mathematical functions on the input and pass the output to the next hidden layer and this continues until it reaches the output layer. The math behind MLP is advanced and requires calculus and linear algebra, so I will not go into it. There are several parameters on an MLP model, more than other models.Surprisingly, training the model wasn't very computationally intensive: it only took around 5 seconds on 30,000 edges or 3,800 nodes. However, as I increased the sample size to 100,000 edges, the model actually decreased inaccuracy, to about 40%. Typically when a machine learning model is given more data the accuracy should increase because it's given more examples and opportunities to find patterns, however the reason that the accuracy decreased on this MLP model is likely because hyperparameter tuning for MLP is difficult to do. I only performed basic hyperparameter tuning using grid search cv, which is a function on scikit-learn that just Cycles through different parameters of the model and Returns the parameters that had the best results. However, in reality if you want a successful MLP model you do need to do much more hyperparameter tuning because there's so many different parameters that can be altered in an MLP model.

Because of the difficulties in tuning the MLP model, from there, I decided to move to support Vector machines, which are much simpler than MLPs. Here is a good overview on SVMs. on the SVM, I also achieved around a 62% accuracy rate. I also used GridSearchCV for hyperparameter tuning.

In addition to tuning the parameters of the model, I also experimented with different vector sizes and variations of the random walk parameters on both models. (Note: since MLP was doing worse with more edges, I only used 10,000 on the MLP model. As the vector sizes increased and decreased, no significant changes took place on the accuracy of the model. The same was true about changing the random walk parameters. I suspected that adding the other node features into the model would not increase accuracy, and when I tried adding publication count and ortholog count, it actually decreased the accuracy to about 40%. This makes sense because the publication count and ortholog count are relatively arbitrary numbers with regards to the classification of the genes.

There are a few next steps. First, I am going to continue manipulating the random walk parameters, but rather than manipulating the p and q probabilities, I will increase the number of iterations. Next is applying cross validation to the data (using a 70-20-10 split). And finally, think about visualization of the results, although this is not a priority. By visualization, I mean comparing the predictions of the model to the previously made visualization to see where it is making the largest errors and see if there are any patterns.

So, that’s where I am going with this project. By the way, next week is technically my last week of the internship, but I will continue to work on this project, so I will continue to post updates. See you later.

Anusha Kumar 8/23/24 Anusha Kumar 8/23/24

Research Post # 6

Last week, I was able to get access to the university’s High Performance Computing (HPC) and now I am not experiencing any more computational issues relating to coding the data visualization. That means that I was able to visualize the entire network (all the nodes/edges in the data!) and it did not take an outrageous amount of time or storage. I also finalized the graph in the sense that I picked all the relevant features to incorporate. The edge colors are based on ChatGPT4’s 6 categories of interaction type (for those who don’t know, the data had 84 different interaction types, so ChatGPT categorized them into 6 sections), the node colors are based on disease association, their gradients are the publication counts, edge style (dotted/solid) is based on interaction category, and edges are weighted based on their DIOPT scores (the number of interactions across organisms).

The next steps are graph sparsification and node embedding. Last time I talked about how I used NetworkX’s built in dedensify method which merged hub nodes. It did not occur to me at the time, but graph sparsification is just removing certain nodes/edges based on some attributes. Some algorithms (like in the paper I linked in my last post) are complex, but it is super easy to manually prune. I pruned my graph based on two different features: node degrees and DIOPT scores. For node degrees, I created several different graphs and pruned different amounts of nodes. There are graphs with degrees <10, <100, <1000, >10, >100, >1000. For DIOPT scores, I just created two graphs with greater than the DIOPT score of 2 and less than or equal to 2. I coded a loop to find nodes and edges that had these attributes and add them to the new graph. However, I found manual pruning to be extremely computationally intensive. This script, even with HPC, took ~18 minutes for each graph, since there was so much data to iterate over.

Apart from graph sparsification, my next major step is node embedding. I will be using the Node2Vec and Word2Vec algorithms and UMAP software for this. Node embeddings are multi-dimensional (~200 dimensional) vectors assigned to each node by an algorithm. The way Node2Vec assigns vectors to nodes is it uses nodes’ features and groups nodes with similar features on the coordinate plane. In this sense, each edge is a ‘feature’ of a node and the algorithm will assign nodes that are highly interconnected vectors that are locationally close to each other. These vectors can then be visually plotted using a software like UMAP or matplotlib.

Node2Vec is an algorithm that was developed by Stanford Network Analysis Project (SNAP). It is similar to the Python library gensim’s Word2Vec which groups words, instead of nodes, based on their similarity to each other in context. You train Word2Vec by feeding it a corpus and it starts to recognize which words are frequently used together and assigns those words similar vectors. For example, it would assign similar vectors to ‘train’ and ‘station’ as opposed to ‘train’ and ‘rock’.

In order to implement the Node2Vec algorithm, you first need to implement second order random walk on the graph. First order random walk is an algorithm that takes a node, randomly selects one of its neighbors, moves to that node, and repeats, ultimately creating a ‘path’. Here is some sample code for first order random walk:

# Random walk (first order)
def random_walk(G, start_node, num_steps):
   walk = [start_node]
   current_node = start_node


   for _ in range(num_steps):
       neighbors = list(G.neighbors(current_node))
       if neighbors == 0:
           break
       next_node = random.choice(neighbors)
       walk.append(next_node)
       current_node = next_node


   return walk

Second order random walk is very similar, except it takes into account two probabilities, the return probability, p, and the in-out probability, q. p is the probability of revisiting the node you just visited and q is the probability of visiting further nodes as opposed to nodes closer to the previous node. These parameters are adjustable. Here is sample code for second order random walk:

# Random walk (second order)
def second_order_random_walk(graph, nodes, steps, n, p, q): #graph, nodes, length of walk, number of walks per node, return probability, in-out probability
   all_walks = []
   for start_node in nodes:
       for _ in range(n):
           walk = [start_node]
           current_node = start_node
           previous_node = None


           for _ in range(steps):
               neighbors = list(graph.neighbors(current_node))
               if not neighbors:
                   break


               if previous_node is None:
                   # First step, no previous node
                   next_node = random.choice(neighbors)
               else:
                   # Adjusting the probabilities based on the neighbors' connections to the previous node
                   probabilities = []
                   for neighbor in neighbors:
                       if neighbor == previous_node:
                           probabilities.append(1 / p)
                       elif graph.has_edge(previous_node, neighbor):
                           probabilities.append(1)
                       else:
                           probabilities.append(1 / q)


                   # Normalize probabilities
                   probabilities = np.array(probabilities, dtype = float)
                   probabilities /= probabilities.sum()


                   # Choose next node based on the transition probabilities
                   next_node = np.random.choice(neighbors, p=probabilities)


               walk.append(next_node)
               previous_node = current_node
               current_node = next_node
               all_walks.append(walk)


   return all_walks

Once you have implemented second order random walk, you find all the possible paths in the graph - these are analogous to sentences in Word2Vec. The algorithm then picks up systems of nodes that are neighbors with each other and assigns all of them similar vectors and repeats this for all the nodes in the graph.

Keep in mind that each vector is multi-dimensional. In order to transform the high-dimensional vectors into low-dimensional space (2 or 3 dimensions) we use an algorithm called t-SNE (t-distributed Stochastic Neighbor Embedding). t-SNE is an unsupervised non-linear method for reducing dimensions and visualizing high-dimensional data. This algorithm is applied to all the node embeddings.

Once Node2Vec performs the node embeddings and t-SNE reduces the dimensions, we use UMAP to plot the data and create an asymmetric graph with communities of nodes based on their connectivity. This type of graph allows us to make qualitative analyses about the data and predict which unknown nodes are disease or non-disease associated by observing its connections and where it lies on the graph.

In general, when addressing a problem like gene classification, why is visualizing the data important? At first, my intuition was that it would just help to tackle the larger problem of training the model, and was ultimately busywork. Aren’t machine learning models able to make predictions based on graph data structures, like an adjacency matrix, without a visual representation of the data? On the contrary, visualizing the data is a critical step in solving the gene classification problem. A machine learning model is able to predict the classification of unknown genes without a visualization, but we can only assess its veracity by comparing it to a visual representation. The model’s results would be compared to a data visualization by making a qualitative comparison to the graph where it is easier to classify the genes. Furthermore, the node2vec algorithm classifies the genes in low-dimensional space, which is, on a high-level, a form of label propagation that can capture higher structures and patterns than pure baseline label propagation models.

That is where I’m at right now. Over the next few weeks I will implement node2vec and research the background of the algorithms on a high-level. I’ll talk more about the specifics of the node2vec and t-SNE algorithms and implementation in my next post.

Anusha Kumar 8/23/24 Anusha Kumar 8/23/24

Research Post # 5

Just finished another week at my internship. As you remember, I have been working on a data visualization and am continuing to do so. This week I experimented with graph sparsification. Unfortunately, my computer only has 8 GB of RAM so I can’t visualize all the data without my computer crashing or slowing down. I am supposed to get access to the university’s High Performance Computing (HPC) servers soon, but in the meantime my professor suggested I utilize graph sparsification. Graph sparsification is a technique that makes the size of a graph smaller and require less memory by pruning edges and nodes. I tried it out on smaller graphs and it works! There are a bunch of different algorithms that can be used and altered for different results. I started out with a basic randomizing sparsification algorithm that was built-in in the software that I am using - NetworkX. This algorithm combined nodes of high densities that were close to each other - reducing the total number of edges. In the end, the sparsified graph had about 200 less edges than the original graph (with a starting number of 1,000 edges).

Graph sparsification is a really interesting concept since there are so many different algorithms for it. This paper does an in depth analysis of different graph sparsification algorithms. This study uses 14 graphs obtained from real world scenarios with different characteristics, 12 common graph sparsification algorithms (sparsifiers), and assesses their performance on 16 various graph metrics. The results found that no one sparsifier has an overall best performance in preserving all the graph characteristics, the algorithms each perform well in different cases. The paper talks about the math behind each algorithm, the significance of each metric, and the impact of each algorithm on each metric.

Apart from graph sparsification, in addition to creating a larger data visualization I have been tasked with finding external platforms and softwares for graph visualization. NetworkX is specifically for aesthetically showing graphs - it does not do extensive analysis of the data itself. For this reason, all their layouts (which can be seen in my last post) are more or less symmetrical and do not offer any informative qualitative results. By using external softwares to position the nodes and then feed that into NetworkX, the result may be a more asymmetrical graph that would inform us on communities that are formed and better predict the unknown genes. We have tried community-based clustering, however that did not work well because the algorithm kept reading the entire graph as one giant community, which is uninformative and not useful when clustering.

Next steps are continuing with graph sparsification, using other softwares to visualize the network, and hopefully start to get some informative visualizations. See you next week.

Anusha Kumar 8/23/24 Anusha Kumar 8/23/24

Research Post # 4

I just finished my fourth week at the NYU Internship. Because of the July 4th weekend, this week was cut short, but we still did a lot. I continued to work on the visualization of my data and here are the updates.

Following these steps will enhance your blog post and provide readers with a more engaging experience.I decided to color the edges of the graph as per the interaction category (physical or genetic) and thicknesses are based on interaction count (or the number of organisms the interaction is present in). Once this was done, I tried experimenting with different layouts that the graphing software (NetworkX) offers. I experimented with circular/spiral layouts, force-directed graphs, clustered by color, and spectral clustering. The results are linked here:

These are the five relevant layouts that I found. Each graph has 10,000 edges and 2,061 nodes. The nodes are colored red for disease associated, blue for not disease associated and gray for unknown. The edges are colored orange for physical gene interactions and green for genetic gene interactions. The thicknesses are based off interaction count. Out of all the layouts, force-directed is likely the most informative because it directly shows which nodes are the most connected, allowing us to make predictions about whether a node is disease associated. To elaborate, if a node is primarily connected to disease associated nodes, it is likely also disease associated, and if it is connected to primarily non-disease associated, it is likely not disease associated. The force-directed layout helps visualize this.

There were two force-directed layouts I was experimenting with, each used a different algorithm. The force-directed (spring layout) graph that is shown here is also known as Fruchterman-Reingold and uses a force-directed algorithm that treats the nodes like charged springs, with attractive and repulsive forces. The attractive force comes into play when nodes are connected with an edge, pulling connected nodes closer together. Contrastingly, there is a repulsive force between any pair of nodes, pushing all nodes away from each other. The process is iterative, beginning with random node placement then slightly adjusting the position of the node based on edges and iterating through the graph. Ultimately the result is a clear graph, requiring relatively low computational power, with connected nodes closer together and disconnected nodes further apart.

The second layout is called the Kamada-Kawai layout. This layout is also force-directed, and, similar to the spring layout it uses the idea of attractive and repulsive forces (springs). However, the Kamada-Kawai layout aims to minimize the energy of the nodes and edges. That makes it much more computationally intensive and not logical to use on larger graphs (which is why it is not in the pictures above).

Moving forward I will most likely be using the spring layout, since it is informative and able to be done on larger graphs. Other than this visualization stuff, I played around with the data a bit more making plots and charts. Next week I will continue to do the visualization and hopefully get to some modeling.

See you next week!

Anusha Kumar 8/23/24 Anusha Kumar 8/23/24

Research Post # 3

It’s Thursday, so that means I’m wrapping up my third week at the NYU internship! This week I had to code a visualization of the data (in graphical representation). I used the module NetworkX in python to code the graph. I took Theoretical Computer Science last year and we did a lot of graph theory and other theoretical concepts. I thought those concepts could only be applied to obscure fields, however, that class helped me build a strong understanding of the concepts behind data flow and data structures. So, when I had to code a visual representation of the data, it was surprisingly easy, since I understood how it worked.

I really enjoyed just coding for three or four hours straight. It gave me a lot of time and opportunities to just experiment with the code and try new things and conceptualize what was going on. I ultimately ended up with a graph with 1,000 edges and 1,462 nodes. Apart from just coding a visualization of the data, I also coded an adjacency matrix with all the nodes and played around with that. I was surprised to find that it was really sparse (primarily 0s), but in retrospect it does make sense for an adjacency matrix of 1462x1462 to have mostly 0s.

My professor suggested an interesting idea that instead of just coding the graph, I could also find a way to color the adjacency matrix itself based on the connections and see if there is any pattern there - this is another way of visualizing the data. My professor also made a few other suggestions regarding my graph, including changing the edges’ colors and widths as per different attributes, weighting the edges. He also said that instead of making the nodes red/blue (to represent diseased/non-diseased) I could make them different shades - a darker shade of red representing a higher likelihood that the gene actually is disease-associated, and a lighter shade of red representing a lower likelihood that the gene is disease-associated. I’ve found a feature on NetworkX that I might be able to use for this, so I will try it out and we’ll see.

Something else that I have found is that, in research, you need to accept that you know very little, if anything. The entire point of the field of research is to dive into the unknowns and discover something new. Adapting to this kind of environment proved to be a slightly difficult for me at first. In school, you are expected to always know the answer and outperform your peers by studying harder and learning more. On the contrary, this lab’s environment is completely different from school - there is no competition for getting good grades or more points - everyone is working together and trying to help their peers succeed. No one knows all the answers. The students in the lab and the professor are constantly bouncing ideas off each other, trying to learn and optimize the outcome. I have been learning so much at the lab, not only about neuroscience and machine learning, but also how to approach problems and work in a team.

That’s all I have for this week.

Anusha Kumar 8/23/24 Anusha Kumar 8/23/24

Research Post # 2

I’ve just finished my second week at the research internship I am doing at a university neuroscience lab. This week I was assigned my project, and it is really interesting!

A research group, separate from ours, coded a data scrape to find nearly every mention of a certain kind of gene and interaction found in neurons in any scientific publication. This machine came back with two datasets: one containing all the genes and their specific attributes and one containing interactions between two genes and the attributes of the interaction.

The gene data contains 19,867 genes and each one has several four attributes: its disease associated category (whether or not it is associated to disease), the number of publications it appeared in, its ortholog count and which specific species the gene was present in. For clarification, orthologs are genes that originated from a common ancestor and were separated by a speciation event, resulting in that genes presence in two species. This data observes nine species: homo sapiens, c. elegans, mus musculus, danio rerio, s. pombe, rattus norvegicus, xenopus tropicalis, drosophila melanogaster, and s. cerevisiae. In the dataset, there are 11,424 genes that are not associated to disease, 4,796 genes that are associated to disease, and 3,647 genes were returned with an unknown category.

The gene interaction data contains 1,048,576 connections and seven main attributes: the interaction category, the interaction type, the method the interaction was found by, the number of organisms the interaction is present in, the specific organisms the interaction appeared in, and whether the interaction is bidirectional.

I have been performing exploratory data analysis with the datasets in pandas to understand the distributions of the various attributes, frequencies, unique variations, and visual plots. Next, I will be coding the simplest form of a visualization of the datasets. The brain is thought of as a graph or network with nodes and edges. I will be using the genes as nodes and gene interactions as edges for this visualization. To begin, I will take the first 10,000 edges every corresponding node that is included. This data will be sufficient to create an adjacency matrix (which is a data structure used to visualize graphs). Matrices can be thought of as a list of lists. Essentially, an adjacency matrix is symmetrical and would have a list of nodes on the horizontal and vertical axes. If there is an edge between them, there would be a 1 on the intersection and if there is no edge, there would be a 0. This adjacency matrix will be combined with a one dimensional array containing a 0 or 1, representing non-disease-associated or disease-associated for every node. This will be visualized in the graph using different colors. Since this is the simplest visualization of the graph, the unknown nodes are going to be ignored. Using this graph, we will train a model using the data and eventually apply it to the unknown genes and predict the probability of the unknown genes being disease associated.

The goal of this project is to develop and train a machine learning model to predict the probability that the unknown genes are associated to disease. Knowing this information can help medical professionals in a number of ways including understanding the underlying biological mechanisms of diseases, therapeutic target identification, drug development, and can even serve as biomarkers for diagnostic testing, potentially allowing personalized treatments.

See you next week.

Anusha Kumar 8/23/24 Anusha Kumar 8/23/24

Research Post # 1

This week I read a several research papers to gain background knowledge on neuroscience and machine learning (ML) applications to neuroscience datasets. One paper that I found interesting was about the applications of human brain connectomics to clinical psychiatry (linked here).

Connectomics is the study of connections between neurons in the human brain. We can take various measures of connectomics through noninvasive techniques such as functional magnetic resonance imaging (fMRIs), diagnostic tests, prognostic indicators, and therapeutic predictors.

There are two main models of applying connectomics to psychiatry: the “internal medicine” model and the “surgical” model. The internal medicine model is essentially using brain imaging as a tool to diagnose diseases. By observing abnormalities in the imaging, doctors can diagnose patients with more certainty. However there are two main problems with this model. First, in studies pertaining to this method thus far, sample sizes have been inadequate leading to false positive findings; people were “diagnosed” when they did not really have the disease. Second, since every person’s brain is different there are consistent fluctuations in measurements from individuals, making it challenging to study connectivity at a reliable level. Even when looking at the same individual, after a period of time, the test may yield different results. This only stresses the unreliability of the internal medicine model.

Dissimilarly, the “surgical” model focuses on using imaging as a method of treatment and does not face the same challenges. There are two key benefits of the surgical model. First, unlike the internal medicine model, consistent results were conceived across a variety of subjects as well as temporally on the same individual. This consistency depicts the reliability of the model. Reliability is critical as reliable results within individuals can potentially allow for personalized treatment. The second benefit is brain imaging can identify which part of the brain is causing the disorder, even if the disorder is not well defined.

Surgeons may take advantage of the second benefit by targeting brain regions using neuromodulation techniques like deep brain stimulations (DBS) and transcranial magnetic stimulation (TMS). Neuromodulation is the technology that acts directly upon nerves, often to ail a mental illness; DBS and TMS are some examples of neuromodulation procedures. After a comprehensive psychiatric evaluation, regions of the brain would be classified as the most relevant to various psychiatric conditions (e.g: obsessive compulsive disorder, depression, anxiety, etc.). Then, an MRI would be used to localize a targeted brain region based on its connectivity to the relevant brain regions and after that, DBS and TMS would be utilized to treat the patient.

There are several applications of connectomics in clinical psychiatry, primarily in treatment. As of now the surgical model is feasible to incorporate into present-day psychiatry, and in the future, there is hope that connectomics will evolve to also allow for a successful internal medicine model as well. In the field of psychiatry, researchers predict that psychiatrists may need to learn about the use of connectomics to monitor and modulate neurobehavioral systems as it becomes more prevalent to the treatment and diagnoses of psychiatric conditions.