Archives of Molecular Medicine and Genetics (AMMG)

Drug Repositioning Network System Using the Power of Network Analysis and Machine Learning to Predict New Indications for The Approved Drugs "Drug Repositioning and Rate the Level of The Drug Similarity"



Sherief Ahmed Hassan El-Rweney1*


1Computer science systems and information technology, Royal Holloway, University of London, UK


*Corresponding Author:Sherief Ahmed Hassan El-Rweney,Computer science systems and information technology, Royal Holloway, University of London, UK, TEL:+447717317626 ; FAX:+201023993902;E-mail:shriefelrweney22@hotmail.com


Citation:Sherief Ahmed Hassan El-Rweney (2017) Drug Repositioning Network System Using the Power of Network Analysis and Machine Learning to Predict New Indications for The Approved Drugs "Drug Repositioning and Rate the Level of The Drug Similarity".Arch Mol Med & Gen 1:103.


Copyright: : © 2017 Sherief Ahmed Hassan El-Rweney, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited


Received date:December 04, 2017; Accepted date:December 30, 2017; Published date:January 06, 2018


Abstract

Statement of the Problem: Drug discovery is a lengthy process, taking on average 12 years for the drugs to reach the market –but as Sir James Black OM once said “the best way to discover a new drug is to start with the old one”. As result, this will drive to Drug repositioning concept.


Drug Repurposing and repositioning is Finding a new clinical use for an approved drug. There are many factors that can be used to predict new target disease. I.e. protein-protein interaction, chemical structure, gene expression and functional genomics, Phenotype and side effect, genetic variation and Machine learning.


Protein-protein interaction PPI is Physical contacts with molecular docking between proteins that occur in a cell or in a living organism in vivo. There are Two Alternative Approaches PPI “Binary: yeast two‐hybrid (Y2H) and co‐complex: (TAP‐MS)”.


Drug Repositioning System, is a system built based on protein-protein Binary interaction to predict new targets for the approved drugs. The system curate the data sets for human PPI, Drugs and diseases from well- known online sources (PPI from HRPD, drugs from DrugBank, Diseases from DisGeNET), Drug Repositioning System relates the 3 data sets based on genes name.


Drug Repositioning Network System consisting of two interfaces: backend system where the curated data sets stored based on rational database and using Big Data tools, and frontend web interface where the end users can use many search engines to search inside the system for diseases, genes and drugs to predict and find new targets for the approved drugs based on protein interactions, from the web interface the user can make analysis based on his search result and build network between the genes, diseases and drugs and generate statistics to be able to answer his question.


There are many Questions that can be answered by Drug Repositioning System and generate statistics: for example, the main question is can we find new indications for existing approved drugs.


Drug similarity: from the Drug Repositioning System we able to measure the percentage of drugs similarity between any pair genes interaction based on the number of shared drugs between them to rate the level of drug repositioning strength and then use the ROC analysis.


Introduction

Many definition approaches for Drug repositioning


• Drug Repurposing or repositioning is finding a new clinical use for an approved drug. From the perspective of the repositioning drug, we going to use the drugs that already have been approved which is the first step in the drugs discovery.


• Drug repositioning (also referred as drug repurposing, re‐profiling, therapeutic switching and drug re‐tasking) is the identification of new therapeutic indications for known drugs. These drugs can either be approved and marketed compounds used daily in a clinical setting, or they can be drugs that have been “shelved”, namely molecules that did not succeed in clinical trials. But in this Research we will just use the approved drugs [1].


• Drug repositioning is the application of known drugs and compounds to treat new indications (i.e., new diseases) or by other meaning the goal of a repositioning initiative is to establish a link between a drug and a disease.


Drug discovery


• Drug discovery is a lengthy process, taking on average 12 years for the drugs to reach the market –but as Sir James Black OM once said “the best way to discover a new drug is to start with the old one [3].


• EMBL-EBI defined the process of searching inside its databases during the phases of the drug design [4].


Basic Concept


In the following figure1 based on the interaction between ProteinA and ProteinB we can make network path from ProteinA to DiseaseB and DrugB and network path from ProteinB to DiseaseA and DrugA


Figure 1

Big concept


Using the power of network analysis and machine learning to predict new indications for the approved drugs.


• Drug Repositioning


• Protein–Protein Interactions PPI


• Network analysis


• Biological Networks


Aims from research


Figure 2

Output example from our experiment:


In the left side of the picture, it’s clear that there are two groups of drugs target numerous diseases related to HDAC6 gene, and also on the right side we will find one group of drugs targets two groups of disease that are related to TUBB gene while there link between TUBB and HDAC6 indicates the interaction between them. As result we can make drug repositioning between the two genes.


Figure 3

Success story of drug repositioning, Thalidomide


Many drugs have been successfully repositioned in the past; classical examples such as sildenafil (Viagra) and thalidomide [6].


A significant advantage of drug repositioning over traditional drug development is that since the repositioned drug has already passed a significant number of toxicity and other tests, its safety is known and the risk of failure for reasons of adverse toxicology are reduced. More than 90% of drugs fail during development, and this is the most significant reason for the high costs of pharmaceutical R&D [6].


Drug repositioning has been growing in importance in the last few years for many reasons, for example:


Pharmaceutical companies see their drug pipelines drying up and realize that many previously promising technologies have failed to deliver ‘as advertised.


Computational approaches based on virtual screening of comprehensive libraries of approved and other human use compounds against large numbers of protein targets simultaneously have been developed to enhance the efficiency and success rates of drug repositioning [6].


Thalidomide


Was marketed to treat morning sickness in pregnant women. The drug was assumed to be safe, based on an in vivo study in rodents. The drug caused severe skeletal birth defects in children born from women taking the drug. Over 15,000 new-borns were affected, suffering from anatomical malformations. Because of this disastrous side-effect, the molecule was quickly withdrawn and triggered important reforms in the drug regulatory system. The story could have ended here, if it were not for an incidental discovery by Jacob Sheskin. The practitioner was trying to treat patients affected by erythema nodosum leprosum, a particularly painful inflammatory condition characterised by red nodules under the skin. An evening of 1964, an affected patient could not sleep as the pain was so intense. Sheskin decided to ultimately use some thalidomide, as the compound was known for its potent sleep-inductive properties and was available in this hospital. The drug worked and the patient was well rested in the morning. And as a general surprise, all pain and soreness disappeared overnight too. Sheskin further studied the action of thalidomide in clinical trials and successfully showed that the drug can indeed treat erythema nodosum leprosum in two weeks’ time in most subjects. Thalidomide found a new life and became the first and only drug approved for this indication [5].


Basic development Points included on the research


This research will go through basic points to achieve the aims from this study:


a. Curate data for drugs, proteins, diseases from online on line known resources


b. Clearing collected data and make data mining and statistics.


c. Build a large dataset containing drugs, proteins, diseases with known interaction between them with programing interface to able to query the dataset to answer questions.


d. Build network medicine to analyses the new targets for the approved drugs.


e. Strength the success of drug repositioning hypothesis between genes pairs by Apply the machine learning to Find and calculate the percentage of the drug similarity


Research Backgrounds Will Be Discussing Within This Research

a. Protein–Protein Interactions Essentials


b. Basic terminologies of networks and networks analysis


c. Biological Networks


d. Elements and principles of network theory


e. The principles of Network Medicine


Protein–Protein Interactions Essentials: Key Concepts to Building and Analyzing Interactome


PPI Definition


Physical contacts with molecular docking between proteins that occur in a cell or in a living organism in vivo.


Definition considerations:


1. Any protein in the ribosome or in the basal transcriptional apparatus shares a functional contact with the other proteins in the complex, but certainly not all the proteins in the particular complex interact.


2. The interaction interface should be intentional and not accidental.


3. The interaction interface should be non-generic.


4. That PPIs imply physical contact between proteins does not mean that such contacts are static or permanent.


5. Not all possible interactions will occur in any cell at any time.


Two Alternative Approaches PPI [7]


Binary and Co‐Complex:


Interactions between proteins are done at either a large or small scale by using two techniques:


-Binary: yeast two-hybrid (Y2H)


Measure direct interactions between proteins.


-Co-complex : tandem affinity purification coupled to mass spectrometry (TAP-MS)


Measure both direct and indirect interactions between proteins. Both are widely applied in large‐scale investigations


The following figure: Binary methods and co‐complex methods: two approaches to determine PPIs.


The interactions shown in the left panel (green links) correspond to the true interactions existing between two groups of proteins (set A with four proteins and set B with three proteins). The interactions shown in the right panels correspond to the networks derived from the experimentally measured interactions existing between the six proteins analyzed: the network in the top right panel (blue links) presents the interactions obtained using a binary method; the network in the bottom right panel (red links) presents the interactions obtained using a co‐complex method. The red links are calculated applying the spoke model To the TAP‐MS experimental data, but three of the interactions deduced (links with an X) do not occur.


Figure 4

The Main Databases and Repositories That Include PPIs [7]


As we mentioned before we are going to build our drugs re-poisoning network based on protein interaction, so practical users have to know which types of interaction databases are available, what are the differences between them, and which are the most comprehensive and stable repositories.


Over the past few years, the number of known protein–protein interactions has increased substantially. To make this information more readily available, a number of publicly available databases have set out to collect and store protein–protein interaction data. Protein–protein interactions have been retrieved from six major databases, integrated and the results compared The six databases (the Biological General Repository for Interaction Datasets [BioGRID], the Molecular INTeraction database [MINT], the Bimolecular Interaction Network Database [BIND], the Database of Interacting Proteins [DIP], the IntAct molecular interaction database [IntAct] and the Human Protein Reference Database [HPRD]).


With respect to human protein–protein interaction data, HPRD seems to be the most comprehensive. To obtain a complete dataset, however, interactions from all six databases have to be combined. To overcome this limitation, meta-databases such as the Agile Protein Interaction Database (APID) offer access to integrated protein–protein interaction datasets, although these also currently have certain restrictions.


A comparison of the main databases and repositories that include protein interactions is shown in Table 1


Table-1

Analysis of Coverage and Ways to Improve PPI Reliability [7]


-A first obstacle to evaluate the reliability of PPIs is the low coverage of the databases for each specific interactome.


-One way to increase coverage is to integrate data reported by different primary databases


The following figure is example, the data on human PPIs coming from six different primary databases show a small overlap [7]


Figure 5

Basic terminologies of networks and networks analysis

[8,9]


To able to build our study and understand the networks we have to go through Basic terminologies of networks and networks analysis.


Network science


Main definition


Network analysis is a recently new area of data analysis.


From a data science point of view, a network is a collection of interconnected Objects. We may call the network objects “nodes,” “vertexes,” or “actors,” and call the connections between them “arcs,” “edges,” “links,” or “ties.” We may represent networks graphically and mathematically as graphs.


Mathematically speaking, a graph is a set of nodes connected with edges.


Figure 6

Graph Elements, Types, and Density


-Digraph: Is directed graph that for example connects node A to node B but not the other way around.


-Multigraph: parallel edges in a graph while node A may be connected to node B by more than one edge.


-Simple graph: graph without loops and parallel edges


-Weighted graph: graph with weighted edges while weights assigned to graph edges. Weight is usually (but not necessarily) a number between 0 and 1, inclusive. The larger the weight, the stronger the connection between the nodes.


-Degree: number of edges connected (incident) to the node.


-In degree: the number of edges coming into the node and this is for directed graph


-Out degree: (the number of edges going out of the node).


-Graph density d (0≤d≤1): how close the graph is to a complete graph Density for directed graph with e edges and n nodes:


Figure 7

Density for undirected graph:


Figure 8

-Diameter: The largest distance between any two nodes in a graph A


-Connected component: set of all nodes in a graph such that there is a path from each node in the set to each other node in the set.


-A clique: set of nodes such that each node is directly connected to each other node in the set.


-A star: set of nodes such that one node is connected to all other nodes in the set, but the other nodes are not connected to each other.


-Neighbourhood (G(A): A set of nodes directly connected to a node A.


-The local clustering coefficient of a node A: density of the neighborhood of A without the node A itself. The clustering coefficient of any node in a star is 0. The clustering coefficient of any node in a complete graph is 1.


-Network community: set of nodes such that the number of edges interconnecting these nodes is much larger than the number of edges crossing the community boundary measure of quality of community structure.


-Centralities: measure of the importance of a node in a network. Often scaled to the range between 0 (an unimportant, peripheral node) and 1 (an important, central node).


Type of Centralities:


Degree centrality Closeness centrality Betweenness centrality Eigenvector centrality


Network analysis sequence


1-Identifying discrete entities and the relations between them. The entities become the network nodes, and the relations become edges network


2- Measures are calculated: density, number of components, GCC size, diameter, centralities, clustering coefficients, and so on.


3-Network communities are identified. If the network ends up being modular. Finally, results are interpreted, and a report with a lot of appealing pictures is produced Using net Using networkx tools for example.


Tools to Exploring and Analyzing a Network


Here we will explorer some known tools As Example:


1-networkx


The networkx module included in Python contains essential tools for creating, modifying, exploring, plotting, exporting, and importing networks. It supports simple and directed graphs and multigraphs.


By looking into the values inside the graph we have to call many functions connected to networkx:


-To get the number of nodes: len(borders)


-To list nodes: borders.nodes()


-To list edges: borders.edges()


-List of neighbours: borders.neighbors(ʺGermanyʺ)


➾ [ʹCzech Republicʹ, ʹFranceʹ, ʹNetherlands, Kingdom of theʹ, ʹDenmarkʹ,


➾ ʹSwitzerlandʹ, ʹBelgiumʹ, ʹNetherlandsʹ, ʹLuxembourgʹ, ʹPolandʹ, ʹAustriaʹ]


-To calculate the number of degree related to specific node borders. degree(ʺPolandʺ)


➾ 7


-To list the number of degrees for all nodes: borders.degree()


➾ {ʹIranʹ: 8, ʹNigeriaʹ: 4, ʹChadʹ: 6, ʹBulgariaʹ: 5, ʹFranceʹ: 14,


➾ ʹLebanonʹ: 2, ʹNamibiaʹ: 4, «...»}


-List dictionary of all clustering coefficient for all nodes: nx.clustering(borders)


{ʹIranʹ: 0.2857142857142857, ʹNigeriaʹ: 0.5, ʹChadʹ: 0.4, ʹBulgariaʹ: 0.4,


➾ ʹFranceʹ: 0.12087912087912088, ʹLebanonʹ: 1.0, ʹNamibiaʹ: 0.5, «...»} nx.clustering(borders, ʺLithuaniaʺ)


➾ 0.8333333333333334


-List the connected component to create subgraph: list(nx.connected_components(borders))


[{ʹIranʹ, ʹChadʹ, ʹBulgariaʹ, ʹLatviaʹ, ʹFranceʹ, ʹWestern Saharaʹ, «...»}]


-To calculate which node has the highest centrality:


nx.degree_centralit(borders) # People’s Republic of China


nx.in_degree_centrality(borders)


nx.out_degree_centrality(borders)


nx.closeness_centrality(borders) # France


nx.betweenness_centrality(borders) # France


nx.eigenvector_centrality(borders) # Russia


-Functions detect maximal cliques:


find_cliques() and isolates() to find (zero‐degree nodes).


list(nx.find_cliques(borders))


[[ʹIranʹ, ʹNagorno‐Karabakh Republicʹ, ʹArmeniaʹ, ʹAzerbaijanʹ],


➾ [ʹIranʹ, ʹAfghanistanʹ, ʹPakistanʹ], «...»]


nx.isolates(borders)


➾ [ʹPenguiniaʹ]


2-NetworKit


Is highly efficient and parallelizable network analysis toolkit suitable for the large network analysis rather than networkx.


NetworKit developers claim that “community detection in a 3 billion edge web graph can be performed on a 16‐core server in a matter of minutes”


NetworKit is integrated with matplotlib, scipy, numpy, pandas, and Networkx.


3-Gephi


An interactive visualization and exploration platform for all kinds of networks and complex systems


Biological Networks


In this research, we going to use the network concept as one of the data science tool analysis to make the primary analysis of our data set.


Biological Network types and Characteristics.


Networks of coupled dynamical systems have been used to model biological oscillators, Josephson junction arrays, excitable media, neural networks, spatial games, genetic control networks and many other self‐organizing systems. Ordinarily, the connection topology is assumed there are two kind of network to be either completely regular or completely random, But many biological, technological and social networks lie somewhere between these two extremes [10].


A.Regular network (large world) and Random network (small world) [10]


Regular networks ‘rewired’ to introduce increasing amounts of disorder. These systems can be highly clustered, we call them ‘small‐world’ networks, and simple ‘majority‐rule’ running on a small‐world graph can outperform all known human and genetic algorithm‐generated rules running on a ring lattice.


Figure 9

In this graph is indication of the Random rewiring procedure for interpolating between a regular ring, lattice and a random network, Start with a ring of n vertices, each connected to its k nearest neighbors by undirected edges in a clockwise sense.


With probability p, we reconnect this edge to a vertex chosen uniformly at random over the entire ring,


Different values of p. For p = 0, the original ring is unchanged Regular network; as p increases, the graph becomes increasingly disordered until for p = 1, all edges are rewired randomly random network.


B.Characteristic path length L(p) and clustering coefficient C(p) [10]


The family of randomly rewired graphs described as following:


-Characteristic path length L(p): number of edges in the shortest path Between two vertices, averaged over all pairs of vertices


Figure 10

Clustering coefficient C(p): C measures the cliquishness of a typical Friendship circle.


Figure 11

The ratio of number of Ei of edges that exist among neighbors, over the number of edges that could exist.


Suppose that a vertex v has kv neighbors then at most kv(Kv-1)/2 edges can exist between them (this occurs when every neighbor of v is connected to every other neighbor of v). Define C as the average of Cv over all v.


Characteristics of network whereas intermediate values of p


The graph is a small-world network: Highly clustered like a regular graph, yet with small characteristic path length, like a random graph.


Figure 12

Characteristics of Regular network whereas p close to 0


Highly clustering coefficient C(p), Highly Characteristic path length L(p) and large world


Figure 13

Characteristics of Random network Whereas P close to 1


Low clustering coefficient C(p), Low Characteristic path length L(p) and Small world


Figure 14

C.Graph theory of Erdo˝s and Re´nyi (ER) [11]


Networks of complex topology have been described with the random graph theory of Erdo˝s and Re´nyi (ER) but in the absence of data on large networks, the predictions of the ER theory were rarely tested in the real world.


To build the graph: start with N nodes and connect each pair of nodes with probability p. This creates a graph with approximately pN(N–1)/2 randomly placed links. The node degrees follow a Poisson distribution which indicates:


-That most nodes have approximately the same number of links (close to the average degree )


-The tail (high k region) of the degree distribution P(k) decreases exponentially, which indicates that nodes that significantly deviate from the average are extremely rare Independent of the system and the identity of its constituents, the probability P(k) that a vertex in the network interacts with k other vertices decays as a power law: P(k) ~ k –γAs we mentioned before In the small‐world model that introduced by Watts and Strogatz (WS) [10], N vertices form a one‐dimensional lattice, each vertex being connected to its two nearest and next‐nearest neighbours. With probability p, each edge is reconnected to a vertex chosen at random. The long‐range connections generated by this process decrease the distance between the vertices, leading to a small-world phenomenon.


A common feature of the ER and WS models is that the probability of finding a highly connected vertex (that is, a large k) decreases exponentially with k; thus, vertices with large connectivity are practically absent. In contrast, the power‐law tail characterizing P(k) for the networks studied indicates that highly connected (large k) vertices have a large chance of occurring, dominating the connectivity.


D.Network medicine: a network-based approach to human disease [12]


Most cellular components exert their functions through interactions with other cellular components, which can be located either in the same cell or across cells, and even across organs. Which resulting complex network in the human intercom Network‐based approaches to human disease have multiple potential biological and clinical applications. A better understanding of the effects of cellular interconnectedness on disease progression may lead to the identification of disease genes and disease pathways, which, in turn, may offer better targets for drug development.


Biological network maps [12,13]


In the following, we briefly discuss the most studied network maps and their limitations


1-Protein–protein interaction networks [12,13]


Figure 15

2-Metabolic networks [12,13]


Figure 16

3-Regulatory networks [12,13]


Nodes are either proteins or a putative DNA regulatory element and directed edges represent:


Techniques: ChIP-chip and ChIP-seq


• Databases: UniPROBE, JASPAR, TRANSFAC, BCI


Figure 17

Databases: PhosphoELM, PhosphoSite, PHOSIDA


4-RNA networks


They capture the interactions between RNAs and DNA in regulating gene expression.


Nodes represent small non-coding RNAs (miRNAs) or small interfering RNAs (siRNAs) and DNAregulatory elements. Links represent regulation.


Databases:


1.Predicted microRNA targets: TargetScan, PicTar, microRNA,miRBase, miRDB


2.Experimentally supported targets: TarBase, miRecords


Elements and principles of network theory

[12,13]


In the following, we summarize the aspects of network theory that pertain to biological networks.


1- Modules


Most networks show a high degree of clustering, implying the existence of topological modules that represent highly interlinked local regions in the network. Although the identification of such modules can be computationally challenging, a wide array of network‐clustering tools have emerged over the past few years.


2- Degree distribution and hubs


-In a random network most nodes have approximately the same number of links, and highly connected nodes (hubs) are rare.


-The fraction of links with a given degree, called the degree distribution, follows the well-known Poisson distribution.


-In real networks, human protein–protein interaction and metabolic networks are scale free, which means that the degree distribution has a power-law tail; that is, the degree distribution P(k), with degree k, follows


P(k) ~ ky where y is called the degree exponent. Highly connected hubs that hold the whole network together


3-Small-world phenomena


Have the small-world property there are relatively short paths between any pair of nodes means that most proteins (or metabolites) are only a few interactions (or reactions) from any other proteins (metabolites Motifs a group of nodes that link to each other, forming a small sub network within a network).


4-Betweenness centrality


Nodes with a high betweenness centrality (a measure of the number of shortest paths that go through each node) are often called bottlenecks. In networks with directed edges, such as regulatory networks, bottlenecks tend to correlate with essentiality hypotheses and organizing principles that link network structure to biological function and disease.


The principles of Network Medicine

[12,13]


1-Hubs


Disease genes tend to avoid hubs and segregate at the functional periphery of the interactome. In human’s essential genes, not disease genes are encoded in hubs.


2-Local Hypothesis


If a gene or molecule is involved in a specific biochemical process or disease, its direct interactors might also be suspected to have some role in the same biochemical process.


Proteins involved in the same disease have an increased tendency to interact with each other.


3-Corollary of the local hypothesis


Mutations in interacting proteins often lead to similar disease phenotypes.


4-Disease module hypothesis


Cellular components associated with a specific disease phenotype show a tendency to cluster in the same network neighborhood. Network parsimony principle


Causal molecular pathways often coincide with the shortest molecular paths between known disease‐associated components.


5-Shared components hypothesis


Diseases that share disease‐associated cellular components (genes, proteins, metabolites or microRNAs) show phenotypic similarity and comorbidity


Local clustering of disease genes: disease modules Based on network science principle Modules that we mentioned before and network medicine principle local’ hypothesis the disease modules concept comes.


6-Gene associated with a specific disease tend to cluster in the same neighborhood.


As we said before the process or disease, its direct interactors might also be suspected to have some role in the same biochemical process. In line with this ‘local’ hypothesis, proteins that are involved in the same disease show a high propensity to interact with each other.


Figure 18

Based on this principle we have 3 modules


1.‘topological module’: a locally dense neighborhood in a network, such that nodes have a higher tendency to link to nodes within the same local neighborhood than to nodes outside it.


2.‘functional module’: nodes of similar or related function in the same network neighborhood, where function captures the role of a gene in defining detectable phenotypes.


3.‘disease module’: a group of network components that together contribute to a cellular function and disruption of which results in a particular disease phenotype [12].


Drug repositioning network system Process

There are different kinds of similarity that can be used to predict another indication to approved target, my system will depend on protein interaction to find another target for existing drugs


The following steps describe Drug repositioning system Process


Figure 19

Curate data for drugs, proteins, diseases from online known resources. [14-16]


Figure 20

Raw Data filtering And Initial statistics


1-Drug Data set


Raw drugs data set has been filtered by using human species and calculated 1474 gene names that mapped to their drugs.


The following Chart indicates the distribution of the connections between the drugs and each genes inside the drug database.


After filtering the Drug database based on human species, we found that there are 1474 Genes. Each gene has connection with group of drugs whilst the total number of connections between all genes and their drugs groups are 1611.


This means that some genes have more than one connections with more than different group of drugs.


For example: chart connections axe is the number of connections for each gene and the distribution has tail indicates that the majority of Genes connected with one group of drugs. Form original data we will find that the Gene name “HTR1B” has 3 connections with different group of drugs.


Figure 21

2-Diseases Data set


The following Chart indicates the distribution of the connections between the diseases and each genes inside the Diseases database.


After filtering the diseases database based in comparison with human drug genes, we found that there are 1219 Genes. Each gene has connection with group of diseases or one disease while the total number connections between all genes and their diseases groups are 31928.


This means that some genes have more than one connections with more than different group of diseases.


For example: chart connections axe is the number of connections for each gene, the distribution has tail indicates that the majority of Genes have less than 50 connections with diseases.


Note: the connection relation between the Genes and diseases is many to many, this means that it can be one gene connects to many diseases and it can be one disease connects to many genes.


Form original data we will find that the Gene name “HRAS” has 230 connections with the diseases while “CFB” has 10 connections with the diseases.


Figure 22

3-PPI Data set


The following Chart indicates the distribution of the binary Interactions between Proteins.


After filtering the protein interaction database based in comparison with human drug genes, we found that there are 1472 Genes. Each gene has interaction with another pair of genes. The total number of interactions between all genes are 10952.


This means that some genes have more than one interaction with more than one another genes.


For example: chart connections axe is the number of interactions for each gene, the distribution has tail indicates that the majority of Genes have less than 20 interactions with other genes.


Note: the gene interaction relation between the pair Genes is many to many, this means that it can be one gene interacts with many other genes and vice versa.


Form original data we will find that the Gene name “SMAD” has 90 interactions with other genes while “PYGM” has 3 interactions with other genes.


Figure 23

Drug repositioning network System manual

On this part we will explorer the first prototype of the Drug Repositioning Network System and how to use to predict and discover new targets.


The system has two main parts backend and frontend where the backend Back end of the system uses Big Data tools and network analysis to provide any queries and questions coming from users using the frontend web interface and search engines.


Backend of the system is supply chain for many search engines used by end user to extract many kinds of data, make predictions and analysis and answer questions.


Fronded system where contains search engines where the users can “Find new indications for approved drugs based on Gene interaction”. The purpose from using this part is achieving the main aims form that research and answer questions like “are there any new diseases can the approved drugs target?” or “are there any approved drugs can target existing disease rather than its original drugs”. By answer this questions, we trying to predict a new indications for the approved drug.


Also to make network analysis after building route path between the pair genes and their corresponding drugs and diseases.


The Snapshots figures show example how to use the web interface of the Drug Repositioning system


Cas1


-At first snapshot


For example search for “Obesity” disease on the first snapshot indicates how the data appeared on the disease table columns.


The “Obesity” disease might be appeared on the diseaseName1 column or on diseaseName2 and the both genes interactors will appear on the Gene1 column and Gene2 column.


If the “Obesity” appeared on diseaseName1 column then the gene that will appear on Gene1 column is the original gene of the “Obesity”, for example “TUB” gene. And the gene that appears on Gene2 column, for example “LCK” is the interactor pair of the “TUB”, and “LCK” gene also is the original of the disease that will appear on diseaseName2 column for example “Diarrhea” disease. It is noticeable that “LCK” gene connected to 12 diseases at the same time.


-As result, our hypothesis predicts that all drugs originally target all the 12 diseases of “LCK” gene can also target “Obesity” disease.


-To be sure,


- At the second snapshot,


We use the gene of “Obesity” disease “TUB” to search on “Drug search engine to know the original drugs related to “Obesity” disease and another predicable groups of drugs that can target “Obesity” diseaseFor example: “DB02028” is original drug for “Obesity disease”, while “DB01064; DB05210; DB08059” are drugs group that connected to “PIK3R1” gene and new predictable drugs target “Obesity disease”.


First snapshot


Figure 24

Second snapshot


Figure 25

At the snapshot 3


We search by the drug ID on “Drug search engine” and use the gene of the drug that will appear on the result to search on “Disease search engine” to know the diseases that originally related to the drug and the other predictable diseases that can be targeted by the drug. For example: we can search on “Drug search engine” for “Aspirin” drug by ID, we can get the Aspirin ID from drugBank site. Here Aspirin ID is (DB00945) and ”NFKB1” gene is connected to Aspirin which we can use to search on “Disease search engine” to know which diseases targeted by “Aspirin” and other predictable diseases can be targeted by “Aspirin”. We can make this exercise many times on each gene connected to “Aspirin”


Snapshot 3


Figure 26

Case 2


Imagine that we need to make test and experiment between two of genes, for example we need to know if there is protein interaction between them or not, and if there is connection between them we can build network route path between the drugs and diseases for gene1 and the drugs and diseases for gene2


From the system we search for gene1 to generate its related drugs and disease and search for gene2 to generate its drugs and disease and then generate network path between them and do the drug repositioning between the two genes


For example at the snapshot 4


We search for HDAC6 as gene1 and generate all its related drugs and disease


And then at snapshot5 we search for TUBB as gene2 and generate all its related drugs and diseases


Finally we click to generate network route between them and if there is protein interaction between the two genes we will find connection has been built between the two genes and then do drug repositioning between them, and here we will find that there is connection between HDAC6 and TUBB and we can make drug repositioning.


Snapshot 4


Figure 27

Snapshot5


Figure 28

Discussions, experiments, questions and Answers

Question 1: can we search for specific disease to know its original drugs and new predicted drugs can use or find the original genes for disease and their pair interactors?


Answer:


Based on Case 1 on the Drug repositioning network System manual section, yes we can


Question 2: can we find new indications for existing approved drugs?


Answer:


Based on Case 1 on the Drug repositioning network System manual section, yes we can


Question 3:can we search to know if there is any relation between one drug and disease or between two genes?


Answer:


Based on case2 Drug repositioning network System manual section, yes we can


Question 4:if we know that there is interaction between the gene of drug and the gene of disease, can we make basic analysis and plot?


Answer:


Based on case2 Drug repositioning network System manual section, yes we can


For the example the following TUBB picture is network plot for search done for TUBB gene that shows each diseases and drugs connected to TUBB gene with some initial statistics statistics like number of Nodes and connected edges included in the network


And HDAC6 picture is network plot for search done for HDAC6 gene that shows each diseases and drugs connected to HDAC6 gene with some initial statistics like number of Nodes and connected edges included in the network.


Figure 29
Figure 30

In the following network picture is example of network analysis between TUBB and HDAC6 and this after our system found connection between the two genes because there is interaction between their proteins


in the left side of the picture, it’s clear that there are two groups of drugs target numerous diseases related to “HDAC6” gene, and also on the right side we will find one group of drugs targets two groups of disease that are related to “TUBB” gene while there is link between “TUBB” and “HDAC6” indicates the interaction between them. Under our predication no we can predict that the 2 groups of the drugs related to “HDAC6” gene can target the two groups of diseases related to “TUBB” and vice versa. We can study this network example that generated from the drug repositioning system from the perspective of Protein–protein interaction networks map that we discussed on section ”Biological network maps” we will find that the of Protein–protein interaction networks map is a part form our network where is nodes represent proteins and edges represent a physical interaction between two proteins (“TUBB” and “HDAC6”) but also each protein connect to its drugs and diseases.


Reference to section “3.3.2 Elements and principles of network theory” our network here has two main hubs (“TUBB” and “HDAC6”) that connect all the network groups with each other but the question here, what if there is more than one genes have interactions? And another question, what if there are huge number of genes of the same network that have interaction and there is more than one drug and disease connected to more than one gene as the same time, the answer here will drive us to the drug similarity hypotheses or disease similarity and this is what we will discuss on the next section.


Also, the network follow Degree distribution and hubs which means if we analyze the second picture “Sample of Drug repositioning network degree distribution” and the statistics. We will find the network average in degree and average out degree is the same 2.6875 but the number of nodes and edges not the same but still close,. And From the graph our network like human protein–protein interaction and metabolic networks are scale free, which means that the degree distribution has a power‐law tail, and this is because we highly connected hubs that hold the whole network together.


Figure 31
Figure 32

Current experiment, Drug similarity

Main hypothesis:


My experiment developed by test each pair genes have interaction to find shared drugs between their related drugs groups, and based on the number of shared drugs exist between the two groups to the total number of the drugs on the two groups we can calculate the percentage of the similarity.


The percentage of the drug similarity indicates to what extent the level of drug repositioning hypothesis is high or low.


For example:


We know that there is interaction between the proteins “TUBB” as gene interactor2 and “HDAC6” as interactor1.


Gene “TUBB” has drug group ( DrugA ,DrugB, DrugC, DrugD)


Gene “HDAC6” has drug group ( DrugE, DrugA, DrugB)


4. The shared drugs between the two groups are (DrugA ,DrugB)


The percentage of drug similarity between two interactos = 2/8*100 = 25 %


The hypothesis of drug repositioning level is 25%


This means that drugs for each gene can target and suitable for the diseases for


the other gene by 25%


At the following the actual experiment picture is explanation of Calculate the highest level of drug similarity similarity= 100% , Calculate the similarity less than 100%, Applying ROC Curve


Figure 33

At the following pictures is explanation of how to Calculate the highest level of drug similarity similarity= 100% using Drug repositioning Network System


Figure 34
Figure 35

Other factors considered to strength drug repositioning network system

Many approaches can be used to define drug similarity which is considered to be used as future development for Drug repositioning network system.


Chemical structure similarity [2]


In the context of drug repositioning, one can search only among approved compounds for instance. This approach was successfully used by implementing an unsupervised machine learning algorithm in order to cluster chemicals based on their structure.


Figure 36

Drug repositioning using the chemical structure. Compounds with similar structures have similar biological activities. Drug A shares some similarity with molecule B, indicated by the blue areas. This observation leads to the conclusion that molecule A could be active on the canonical target of molecule B, and indicated accordingly.


Gene expression and functional genomics similarity [2]


Certain genes are going to be over or under expressed, identiable from the relative number of their messenger RNA (mRNA) molecules transcribed. Messenger RNA expression can reflect the activity of a drug, but it can also be used to characterize disease states.


This type of experiment is usually performed on a microarray.


Figure 37

Drug repositioning using gene expression. (A) Example of result obtained from a gene expression experiment. Some of the probed genes are up‐regulated (green), some of them down‐regulated (red). (B) And (C) The gene expression data from the Connectivity Map provides a signature which can relate drugs on their functional aspect. For instance drug X and Y are considered similar because they share a significant amount of genes up and down related. (D) An analogue reasoning can be made with the relation drug‐disease: disease signature can be treated by drugs with an anti‐correlating signature.


Protein structure and molecular docking similarity


A series of recent studies focused on binding sites and compared their relative similarities and that we going to use in our research.


In the following picture, Drug repositioning using protein structure and binding site.


It is assumed that similar binding sites can bind the same ligand. For instance, knowing that protein X has a similar binding site to the one in protein Y, and that molecule Z binds to protein X, one can forward the hypothesis stating that molecule Z should bind to protein Y too. Illustrations from the Protein Data Bank.


Figure 38

Phenotype and side effect-based similarity [2]


The set of characteristics or traits attributed to an organism. Examples of phenotypes are the morphology, developmental, biochemical or physiological properties.


Drug repositioning using phenotype information. At the following picture is example using reported side-effects: the more side effects are commonly shared by two drugs, the more similar these two drugs are. The similarity can be used to either derive potential off targets or new indications [2,17].


Figure 39

Genetic variation-based


Genetic variations can also provide valuable insights regarding drug repositioning opportunities.


In context of DNA sequencing methods and analysis pipelines, we identify genome-wide association study (GWAS) is isolate common mutations in the DNA that are significantly associated with a phenotypic trait. (GWAS) used to relate a single-nucleotide polymorphism (SNP) to a disease


The data about SNPs and their association to pathologies is indexed in databases, such as the one provided by National Human Genome Research Institute


(http://www.genome.gov/gwastudies/).


The basic idea is to use (GWAS) database to find new indication for protein target. The association between a SNP and a trait from a GWAS is the relation between a gene and a disease. Then knowing that a drug targets the given gene product.


The following picture indicates Drug repositioning using genetic variation.


On (A) side


Single-nucleotide poly-morphism (SNP) are associated with a phenotypic trait, here LDL cholesterol.


The gene where the SNP is found (HMGCR) encodes for a protein targeted by statins (drug class). Statins are indicated as cholesterol lowering agents, which is con_rmed by the trait associated with the SNP [2].


On (B) side


Sometimes the trait associated with the SNP diverges from the indication of the drug, as shown on the diagram (post‐traumatic stress disorder against smoking cessation)


Figure 40

Disease network-based [2]


Diseases have been grouped together, the cause of based on the pathology or infection or the biological dysfunction observed.


Similar diseases are treated in a similar fashion


The relation holding between pathologies can generate drug repositioning hypotheses.


In the following picture Drug repositioning using disease relationships


(A)The similarity calculated by looking at using disease relationships (diseasome).


(B)The similarity calculated by looking at the commonly shared pathways.


(C)The similarity calculated by looking at the shared drugs used for the treatment of these diseases [2].


Figure 41

Machine learning and concepts combination [2]


Train a machine‐learning Algorithm and then generate predictions out of the statistical model.


First a series of biomedical heuristics is defined, then the model is trained on known data and predictions are made.


In the following picture A machine


Learning algorithm is trained over a series of features, such as chemical similarity, shared target proteins, etc.


After evaluation of the model, some repositioning predictions can be generated from the statistical learning.


Figure 42

Two recent studies address drug repositioning from Machine learning


The first method presented is called PREDICT that use drug-drug and disease- disease associations separately.


The second method using Support Vector Machine (SVM) that use the 2 kinds of structural similarity, protein-protein interaction network distance and gene expression [2].


References

  1. Barratt MJ, Frail DE (2012) Drug repositioning: Bringing new life to shelved assets and existing drugs.
  2. Samuel Croset (2014) “Drug repositioning and indication discovery using description logics”.
  3. http://www.cresset-group.com/2012/12/new-from-old/
  4. Melissa Burke, Laura Huerta “Functional genomics (I): Introduction and designing experiments. Functional Genomics Case Studies/ Drug Discovery”
  5. Trent D. Stephens, Rock Brynner (2001) “Dark Remedy: The Impact of Thalidomide and Its Revival as a Vital Medicine”. Basic Books.
  6. https://en.wikipedia.org/wiki/Drug_repositioning
  7. De Las Rivas J, Fontanillo C (2010) “Protein–Protein Interactions Essentials: Key Concepts to Building and Analyzing Interactome Networks”. PLoS Comput Biol 6: e1000807.
  8. Dmitry Zinoviev (2016) “Data Science Essentials in Python: Collect - Organize - Explore - Predict - Value (The Pragmatic Programmers)”. https://www.amazon.com/Data-Science-Essentials-Python-Programmers/dp/1680501844
  9. Maksim Tsvetovat, Alexander Kouznetsov (2012) “Social Network Analysis for Startups: Finding Connections on the Social Web”.
  10. Duncan J Watts, Steven H Strogatz (1998) “Collective dynamics of ‘small-world’ networks”. NATURE 393: 440-442.
  11. Albert-La´szlo´ Baraba´si, Re´ka Albert (1999) “Emergence of Scaling in Random Networks”. Science 286: 509-512.
  12. Albert-László Barabási, Natali Gulbahce, Joseph Loscalzo (2011) “Network Medicine: A Network-based Approach to Human Disease”. Nat Rev Genet 12: 56-68.
  13. © A. Paccanaro, RHUL 2016, Network Science, biological networks
  14. https://www.drugbank.ca
  15. Prasad, T. S. K. et al. (2009) Human Protein Reference Database - 2009 Update. Nucleic Acids Research 37: D767-772.
  16. Janet Piñero, Àlex Bravo, Núria Queralt-Rosinach, Alba Gutiérrez- Sacristán, Jordi Deu-Pons, et al. (2016) DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Research 45: D833-D839.
  17. Janet Piñero, Núria Queralt-Rosinach, Àlex Bravo, Jordi Deu-Pons, Anna Bauer-Mehren, et al. (2015) DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database 2015: bav028.