Protein Function Prediction

Dr.K.E. Kannammal

A. Avanthika

A.J. Dhanushwaran

S. Agalya

M. Muneeshwaran

Keywords: Proteins, Deep Learning Architecture.


Abstract

Predictive language modeling is considered one of the fundamental concepts of natural language processing derived from raw data. Gaining insight from sequence data, such as biological sequences, is a key challenge in genomics, proteomics, and protein secondary structure classification and helps researchers understand protein function. This is considered one of the most important steps in drug development. Traditional methods such as sequence models, probabilistic methods, and statistical methods are widely used in model prediction to gain insight from amino acid sequences. However, removing the handheld became a difficult task, ultimately leading to decreased accuracy. Our new method uses embedding to generate vectors, i.e., the content of amino acids, thereby improving the accuracy of secondary prediction models. This is considered a good solution to the second-guessing problem. In this way, various word embeddings are prepared, which is the Continuous Bag of Words (CBOW) method, which stores the order of all amino acids in the protein chain. This vector is used as input for the deep neural network classifier and the class labels are Helix, Sheet and Coil. The NLP-based approach was tested on the GenBank dataset. The process required for this analysis comes from Google Colab.