Research Post # 9

Nov 20

Hi everyone, it’s been a minute. I haven’t written for several weeks because I haven’t had much time given that school restarted and everything, but I have a ton of project updates.

So, we spoke with the collaborators at Johns Hopkins about the project, and it turns out that we have been solving a different problem than what was intended! Over the summer, there were the disease, non-disease, and unknown (non-OMIM) genes. We had been focused on developing a classifier to classify the non-OMIM genes into disease or non-disease. However, the way the OMIM database works is that people manually enter genes that have been researched into the database. When the people from JH coded the data scrape to find the genes in the publications across the internet, if a gene had not been inputted into the OMIM database, or if it had been inputted incorrectly, then it would be classified as non-OMIM. non-OMIM genes are, by definition, genes that were just overlooked in the database. The label “non-OMIM” can mean that they were just misinputted, overlooked, or irrelevant or not worth studying to begin with. The focus of this project was intended to be on the non-disease category. When the researchers were inputting the genes into the database, some of the genes that were entered as “non-disease”, may have actually been “disease”, so the “non-disease” label is not a true label. The goal of the project was actually to classify the “non-disease” genes into actual “non-disease” and “disease” genes. This turns the project into one-way label classification. The only true labels we have are the disease genes and we have to classify the non-disease genes. The non-OMIM genes are not really relevant to the project. That said, if we can find a way to classify the non-OMIM genes into “disease” and “non-disease”, then we also have a way to classify non-disease genes into “disease” and “non-disease”. If you have one, then you have the other.

So, now that the problem statement is different, there are a few ways to go about doing this. The first is turn it into a regression problem. If we can figure out which disease genes are more confidently disease genes (based on attributes in the dataset), then it turns into a regression problem in which a regressor can predict the probability of a mutation in a gene resulting in disease. Another potential avenue is defining the non-OMIM genes as negative labels (so that we do not only have the “disease” label). However, this is likely not the best solution because we do not have disease association information about the non-OMIM genes, which is why they are non-OMIM, so they can’t really act as a true negative label. We also discussed incorporating other features of the data that we had not been as focused on, such as the pLI score and publication count. pLI stands for loss-intolerance probability, which essentially is quantifying the likelihood that a gene is intolerant to a mutation. A higher pLI score implies a lower tolerance to mutations in the gene. Publication count can be used as a confidence metric to determine how certain we are that a gene is disease associated.

Generally speaking, there are a ton of papers on this topic of disease gene classification and more specifically one way classification. I have read a few papers so far, but once I have a more concrete idea of how this project is actually moving forward, I will update. That’s it for today.

By the way, another thing is that we have been finding that label propagation has actually been performing better than the classifiers that are being run on the node2vec embeddings. That is interesting because there are a few implications: the labels are likely more biased toward localized graph structure, rather than overall graph structure; the classifiers that are being run on the embeddings need more hyperparameter tuning; and, the skew in the data is likely influencing the performance on the embeddings classifiers because those are focusing on the whole graph. We are going to continue to explore avenues to solve this problem and our next meeting with JH is at the start of December, so I will post another update around then. I have to go read a bunch of papers now, so I will post back later.

Anusha Kumar

Research Post # 9

Research Post # 8