Research Post # 6

Last week, I was able to get access to the university’s High Performance Computing (HPC) and now I am not experiencing any more computational issues relating to coding the data visualization. That means that I was able to visualize the entire network (all the nodes/edges in the data!) and it did not take an outrageous amount of time or storage. I also finalized the graph in the sense that I picked all the relevant features to incorporate. The edge colors are based on ChatGPT4’s 6 categories of interaction type (for those who don’t know, the data had 84 different interaction types, so ChatGPT categorized them into 6 sections), the node colors are based on disease association, their gradients are the publication counts, edge style (dotted/solid) is based on interaction category, and edges are weighted based on their DIOPT scores (the number of interactions across organisms).

The next steps are graph sparsification and node embedding. Last time I talked about how I used NetworkX’s built in dedensify method which merged hub nodes. It did not occur to me at the time, but graph sparsification is just removing certain nodes/edges based on some attributes. Some algorithms (like in the paper I linked in my last post) are complex, but it is super easy to manually prune. I pruned my graph based on two different features: node degrees and DIOPT scores. For node degrees, I created several different graphs and pruned different amounts of nodes. There are graphs with degrees <10, <100, <1000, >10, >100, >1000. For DIOPT scores, I just created two graphs with greater than the DIOPT score of 2 and less than or equal to 2. I coded a loop to find nodes and edges that had these attributes and add them to the new graph. However, I found manual pruning to be extremely computationally intensive. This script, even with HPC, took ~18 minutes for each graph, since there was so much data to iterate over.

Apart from graph sparsification, my next major step is node embedding. I will be using the Node2Vec and Word2Vec algorithms and UMAP software for this. Node embeddings are multi-dimensional (~200 dimensional) vectors assigned to each node by an algorithm. The way Node2Vec assigns vectors to nodes is it uses nodes’ features and groups nodes with similar features on the coordinate plane. In this sense, each edge is a ‘feature’ of a node and the algorithm will assign nodes that are highly interconnected vectors that are locationally close to each other. These vectors can then be visually plotted using a software like UMAP or matplotlib. 

Node2Vec is an algorithm that was developed by Stanford Network Analysis Project (SNAP). It is similar to the Python library gensim’s Word2Vec which groups words, instead of nodes, based on their similarity to each other in context. You train Word2Vec by feeding it a corpus and it starts to recognize which words are frequently used together and assigns those words similar vectors. For example, it would assign similar vectors to ‘train’ and ‘station’ as opposed to ‘train’ and ‘rock’.

In order to implement the Node2Vec algorithm, you first need to implement second order random walk on the graph. First order random walk is an algorithm that takes a node, randomly selects one of its neighbors, moves to that node, and repeats, ultimately creating a ‘path’. Here is some sample code for first order random walk:

# Random walk (first order)
def random_walk(G, start_node, num_steps):
   walk = [start_node]
   current_node = start_node


   for _ in range(num_steps):
       neighbors = list(G.neighbors(current_node))
       if neighbors == 0:
           break
       next_node = random.choice(neighbors)
       walk.append(next_node)
       current_node = next_node


   return walk

Second order random walk is very similar, except it takes into account two probabilities, the return probability, p, and the in-out probability, q. p is the probability of revisiting the node you just visited and q is the probability of visiting further nodes as opposed to nodes closer to the previous node. These parameters are adjustable. Here is sample code for second order random walk:

# Random walk (second order)
def second_order_random_walk(graph, nodes, steps, n, p, q): #graph, nodes, length of walk, number of walks per node, return probability, in-out probability
   all_walks = []
   for start_node in nodes:
       for _ in range(n):
           walk = [start_node]
           current_node = start_node
           previous_node = None


           for _ in range(steps):
               neighbors = list(graph.neighbors(current_node))
               if not neighbors:
                   break


               if previous_node is None:
                   # First step, no previous node
                   next_node = random.choice(neighbors)
               else:
                   # Adjusting the probabilities based on the neighbors' connections to the previous node
                   probabilities = []
                   for neighbor in neighbors:
                       if neighbor == previous_node:
                           probabilities.append(1 / p)
                       elif graph.has_edge(previous_node, neighbor):
                           probabilities.append(1)
                       else:
                           probabilities.append(1 / q)


                   # Normalize probabilities
                   probabilities = np.array(probabilities, dtype = float)
                   probabilities /= probabilities.sum()


                   # Choose next node based on the transition probabilities
                   next_node = np.random.choice(neighbors, p=probabilities)


               walk.append(next_node)
               previous_node = current_node
               current_node = next_node
               all_walks.append(walk)


   return all_walks

Once you have implemented second order random walk, you find all the possible paths in the graph - these are analogous to sentences in Word2Vec. The algorithm then picks up systems of nodes that are neighbors with each other and assigns all of them similar vectors and repeats this for all the nodes in the graph.

Keep in mind that each vector is multi-dimensional. In order to transform the high-dimensional vectors into low-dimensional space (2 or 3 dimensions) we use an algorithm called t-SNE (t-distributed Stochastic Neighbor Embedding). t-SNE is an unsupervised non-linear method for reducing dimensions and visualizing high-dimensional data. This algorithm is applied to all the node embeddings.

Once Node2Vec performs the node embeddings and t-SNE reduces the dimensions, we use UMAP to plot the data and create an asymmetric graph with communities of nodes based on their connectivity. This type of graph allows us to make qualitative analyses about the data and predict which unknown nodes are disease or non-disease associated by observing its connections and where it lies on the graph.

In general, when addressing a problem like gene classification, why is visualizing the data important? At first, my intuition was that it would just help to tackle the larger problem of training the model, and was ultimately busywork. Aren’t machine learning models able to make predictions based on graph data structures, like an adjacency matrix, without a visual representation of the data? On the contrary, visualizing the data is a critical step in solving the gene classification problem. A machine learning model is able to predict the classification of unknown genes without a visualization, but we can only assess its veracity by comparing it to a visual representation. The model’s results would be compared to a data visualization by making a qualitative comparison to the graph where it is easier to classify the genes. Furthermore, the node2vec algorithm classifies the genes in low-dimensional space, which is, on a high-level, a form of label propagation that can capture higher structures and patterns than pure baseline label propagation models.

That is where I’m at right now. Over the next few weeks I will implement node2vec and research the background of the algorithms on a high-level. I’ll talk more about the specifics of the node2vec and t-SNE algorithms and implementation in my next post.

Previous
Previous

Research Post # 7

Next
Next

Research Post # 5