1 Introduction

Target identification and validation is the top priority in drug discovery [1]. Molecules or drugs that interact with a rational target or selected combinations of targets have improved odds of therapeutic success. An analysis of AstraZeneca’s drug research and development programs showed that 82% of program terminations in preclinical studies were due to safety issues, of which 25% were target-related [2]. Meanwhile, 48% of safety failures in clinical trials are target-related. Therefore, guidance on the appropriate selection of candidate targets can help improve the success rate and portfolio value of drug discovery projects while also reducing time and cost [3].

Traditionally, target discovery has relied on wet experiments, a process that is time-consuming, expensive, and low in accuracy. With the development of bioinformatics, chemical informatics, and omics, computer-aided therapeutic target discovery methods or in silico methods have come to the fore [4,5,6]. By integrating big data with computational methods, computer-aided therapeutic target discovery greatly reduces the scope of experimental targets, shortens the drug discovery and development cycle, and reduces the experimental cost. At present, the two main categories of in silico methods for potential therapeutic target identification are comparative genomics [7] and network-based methods [8]. One of many important characteristics differentiating these methods is that comparative genomics is mostly used in infectious diseases, whereas network-based methods can be used not only in infectious diseases, but also in non-infectious diseases. Nonetheless, these categories of methods often complement each other in their advantages and disadvantages.

With the completely sequenced human genome, in addition to the completed genome sequences of many model organisms, there are increasing research-focused efforts to understand the function of a genome and molecular evolution. Finding potential therapeutic targets among cellular functions based on understanding their related biological processes in pathogens and their hosts has become imperative as antimicrobial resistance continues to spread rapidly. To identify therapeutic targets, comparative genomics combines the information contained in genome database resources and software to reveal fatal weaknesses of pathogens that affect their growth and reproduction in the host, such as genes essential for the survival, growth, and important functions of pathogens [9]. In addition, comparative genomics can also filter out homologs by comparing genomes of pathogens and hosts, avoiding the toxic and side-effects of newly designed drugs on the host, in turn, increasing the success rate of drug design [9].

With many pathogenic variants associated with disease in non-coding regions or difficult to target genes, the number of associations that are candidates for development into drugs is limited. Approaches that combine data from pathway databases or biological networks can broaden the number of potential targets to increase the number of associations that lead to effective treatments. As such, network-based strategies are among the state-of-the-art computation models for target identification and are also an important bridge connecting network pharmacology [10], network medicine [11], network biology [12], systems biology [13], and multi-omics data. By combining pathway analysis and the network graph theory concept, network-based strategies not only focus on the interactions (edges) between individual molecules (nodes) and coordinated pathways but also enable a systematic visual exploration of the biological (or biomedical) networks to identify the components of functional importance in the network. In this regard, network-based methods are invaluable in identifying biomarkers, discovering disease diagnosis targets, and finding potential therapeutic targets [14]. The main concept of network-based methods is to map all the relevant data to a visual network. Highly connected nodes (central nodes) that act as bridges between consecutive network components in a single network are predicted as essential proteins or genes of the pathogen (or biological process) and shown to be related to the modular structure of the physical and functional interaction network. Such nodes are hypothesized to be ideal therapeutic targets in the network because they maintain the network integrity [8]. Meanwhile, by searching for highly differential nodes in different networks, those nodes that specifically exist in disease cells can also be hypothesized as potential therapeutic targets [15].

Here, we provide a detailed review of the rationales of comparative genomics and network-based methods for the in silico identification of potential therapeutic targets (Fig. 1). We describe the commonly used databases, software, and applications and discuss these methods in the context of their advantages and disadvantages, contrasts and similarities, comparison with related target identification methods, and relevant published reviews and prospective studies. The information provided in this review will help readers and researchers quickly understand the rationales of in silico therapeutic target identification methods that could further advance research in this area.

Fig. 1
figure 1

Simplified workflow of in silico methods for identification of potential therapeutic targets

2 In silico Comparative Genomics and Network-Based Methods

2.1 Comparative Genomics Methods

In the past two decades, whole-cell screening (including large numbers of genetic screening) and in vitro screening of synthetic libraries have been used to identify novel lead compounds with powerful antimicrobial properties [16]. With the completely sequenced human genome, in addition to the completed genome sequences of numerous bacteria and fungi, the number of genes has been rapidly growing. Probing and comparing sequence characteristics between and within species have become a part of most biological queries [17]. Comparative genomics [7] and the recently emerged subtractive genomics (described later in Sect. 6.5) [18] are useful tools for the identification of potential therapeutic targets, such as conserved genes [17] and putative essential genes [9] that affect cell viability in pathogens. Comparative genomics approaches are based on the hypothesis that potential targets are critical in the survival of pathogens and constitute a key component of their metabolic pathways [19]. Moreover, to eliminate deleterious host responses, the target should have no conserved homolog in the human host [20]. Spaltmann et al. proposed two criteria for a gene to be considered a therapeutic target. First, the gene must be necessary for the survival and growth of the pathogen, thereby improving the therapeutic effect of the drug acting on the target. Second, the gene should exist in pathogens but not in mammals; in this way, the drug would have the potential to become a broad-spectrum antimicrobial agent [21]. A gene that meets these criteria can be found using a comparative genomics approach.

In Fig. 2, we have summarized the three main steps involved in comparative genomics-based identification of therapeutic targets [22]. The first step is the collection of metabolic pathway enzymes or essential genes of pathogens. It involves obtaining all the metabolic pathways that exist both in the host and pathogen from the Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathway Database [23]. Then, all pathogen pathways are compared with host pathways to determine any overlap [22]. Next, the metabolic pathways are classified. Pathways existing in both the pathogen and the host are removed and named shared pathways, while those existing in the pathogen but not in the host are pooled and named unique pathways [19]. Finally, the gene names and identification of all involved enzymes in the shared and unique pathways are identified and collected from the KEGG Genes Database [22].

Fig. 2
figure 2

The workflow of identifying potential therapeutic targets by comparative genomics

Step two is the retrieval analysis of the protein sequences and the use of the basic local alignment search tool (BLAST). First, the protein sequences of all enzymes involved in unique pathways are retrieved from the Universal Protein Resource (UniProt) database [24] in FASTA format. Then, each protein sequence is submitted to a BLASTp analysis (a protein–protein analysis that compares an amino acid sequence against a protein sequence database; discussed in further detail in Sect. 4.1) against the sequences of enzymes in the host metabolic pathways at a set E-value cutoff, the threshold to define a BLAST “hit.” BLAST results with no hits with host enzymes are identified as non-homologous enzymes of the pathogen [9].

The third and final step in the comparative genomics-based identification of therapeutic targets is the identification of essential non-homologous enzymes in the pathogen. To achieve this, the BLASTp analysis is carried out in the database of essential genes (DEG). The protein sequences with significant homology in the DEG database are described as protein sequences vital to the pathogen's survival [18].

Therapeutic targets identified by comparative genomics methods have two essential characteristics. One, the selected targets have significant impacts on some important physiological functions of the pathogen, ensuring the effectiveness of the newly designed drug. Two, by comparing the protein sequences between potential therapeutic targets and the host to identify whether there is homology, any toxic side effects on the human body when the drug interacts with the target can be avoided, in turn, improving the safety of the pharmacological effects of new drugs [20].

2.2 Network-Based Methods

The reason the network-based method can be used for therapeutic target identification is based on the assumption that the influence of specific locations in a biological network can spread along the edges (interactions) of the network [11]. The rationales of network-based methods for predicting therapeutic targets are centrality and differentia. Centrality refers to the analysis of network topological parameters when building a single network. A node in a more central position indicates that it plays a more integral in the network. For example, it may be an essential protein for pathogen survival and thus identified as a potential therapeutic target [8]. However, centrality sometimes cannot be applied directly to normal human protein networks because of the toxicity of acting on such critical nodes [10, 25]. To solve this problem, the direct screening and elimination process of homologous proteins involved in metabolism can be complemented with differential network analysis in which two or more networks are compared, such as normal cell and disease (mostly cancer) cell networks, different subtype networks of cancer, and tissue-specific networks. In this way, the node sets specific to disease cells or highly differential between networks are obtained and identified as potential therapeutic targets [26]. Differential network analysis can also screen out targets that exist in disease cells but not in normal cells or targets connected differentially in different networks to make the identified targets more selective, thereby improving therapeutic security. The highly differential nodes obtained in this way can be further analyzed using network topology to obtain highly centralized nodes that have been double-screened, increasing the reliability of the identified nodes [27].

According to the rationales of centrality and differentia, network-based methods can be divided into two approaches: the centrality-based approach and the differentia-based approach (Fig. 3). The first step in both approaches is network construction. Network construction refers to obtaining a large number of relevant data sets through data mining [28] or from various databases, websites, and experimental data and carrying out attribute mapping through network visualization tools, namely comprehensive data visualization [29]. Some types of constructed networks are protein–protein interaction (PPI) networks [30], gene interaction networks [31], and miRNA–mRNA interaction networks [32]. After the network is built, the processes of the two approaches diverge.

Fig. 3
figure 3

Simplified rationale and approaches of network-based methods for potential therapeutic target identification

The centrality-based approach uses some network analysis tools to (i) analyze the topological parameters of nodes in networks and (ii) select nodes with high degree centrality (hub nodes) and high betweenness centrality (bottlenecks), which are often integral in networks and thus can be selected as potential therapeutic targets [33]. The degree centrality of a node refers to the number of direct connections the node has with other nodes in the network [11], while the betweenness centrality of a node refers to the number of shortest paths that pass through the node in the network [34]. The centrality-based approach is most suitable for rapidly growing cells, such as pathogens and cancer cells [8]. In addition to the widely-used degree centrality and betweenness centrality, other parameters, such as closeness centrality, clustering coefficient, average shortest path, eigenvector centricity, and spectral gap centricity, can also be used as centrality indices to predict the importance of nodes, and thus to identify potential therapeutic targets [35, 36]. For further understanding of the definitions of the parameters mentioned above, two references are recommended [35, 36].

As mentioned above, differential network analysis requires the construction of two or more networks, including normal and disease cell networks [30] or networks of different subtypes of cancer [37]. After the construction of networks is completed, some algorithms can be applied to identify differential components between networks, to select nodes that exist in disease cell networks but not in normal cell networks, or to select nodes that are highly differentially connected between or among networks, as predicted potential therapeutic targets [26, 38].

Potential targets identified through centrality and differentia can be further prioritized by observing the lethality of the network when those nodes are removed [39]. Generally, network lethality after removal of a node is positively correlated with the connectivity of the node. When nodes with high degree centrality are deleted, the network diameter will increase rapidly [40]. When nodes with high betweenness centrality are deleted, (i) the average path length will decrease rapidly [41]; (ii) network topology, such as the characteristic path length, will change significantly; (iii) the ability of the remaining nodes to communicate with each other will be weakened, and (iv) the network will disintegrate [42]. Therefore, the more lethal the removal of a node to the network, the more important the node's role, and the greater its potential as a therapeutic target [39].

3 Databases

Data acquisition is indispensable to any research work. Therefore, we summarized the databases useful in comparative genomics and network-based methods for identifying potential therapeutic targets. Although some databases can be used for both types of in silico methods, we placed them in separate tables because the most popular features of these databases differ between the two approaches.

3.1 Comparative Genomics

The relevant databases for comparative genomics can be roughly divided into two categories: (i) general databases; those usually used in comparative genomics, such as DEG, KEGG [23], and UniProt; and (ii) specific databases, which mainly provide pathogenic gene sequences of bacteria and fungi, such as the Tuberculosis Database (TBDB), WormBase, and the Virulence Factors of Pathogenic Bacteria Database (VFDB). Table 1 lists the general and specific databases with brief descriptions, including the coverage, availability, latest update, and URL.

Table 1 General and specific databases for comparative genomics

DEG is a commonly used database in comparative genomics that contains 53,885 essential genes and 786 non-coding essential sequences critical to the survival and growth of bacteria, archaea, and eukaryotes for homology analyses [44]. DEG 15 is the most recent version of this database. It is worth noting that DEG has multiple built-in tools for data analysis and display, such as a subcellular location and distribution analysis tool, a pathway and genomics enrichment analysis tool, and a Venn maps generation tool for comparing genomes between experiments [54].

TBDB is an online platform for basic scientific research on tuberculosis and drug and vaccine discovery and development research. It contains genome sequence data and microarray and RT-PCR expression data, including over 3,000 Mycobacterium tuberculosis (Mtb) microarrays (2,700 from humans and mice and 260 for Streptomyces coelicolor) and 95 RT-PCR datasets, for numerous strains of Mtb, as well as data for more than 20 Mtb-related strains from in vitro tuberculosis-related experiments and tuberculosis-infected tissues. A wide range of tools is incorporated in the database for browsing, analyzing, searching, and downloading the data [51].

3.2 Network-Based Methods

There are many databases used in network-based methods. We roughly divided the databases into two categories: direct databases and indirect databases. Direct databases cover the interaction data and can be directly imported into network visualization software for network construction. Examples are the Search Tool for Retrieval of Interacting Genes/Proteins (STRING) [55] and the Molecular INTeraction (MINT) database [56]. Indirect databases do not directly cover interaction data but provide detailed annotation of network nodes allowing an in-depth exploration of the network. Some examples include the gene expression omnibus (GEO) [57] and DrugBank [58]. Table 2 (direct databases) and Table 3 (indirect databases) list the databases commonly used in network-based methods, with brief descriptions, including the coverage, availability, latest update, and URL.

Table 2 Frequently used direct databases in network-based methods
Table 3 Frequently-used indirect databases in network-based methods

STRING [55] is the most commonly used direct database in network-based methods. It houses a large number of known and predicted PPIs, including both physical and functional interactions. The data come from the following five main sources: genomic context analysis, high-throughput experimental data, conserved co-expression, artificial text mining, and known information in databases [55]. At the time of writing, STRING covers 24,584,628 proteins from 5090 organisms [55]. This database provides an intuitive and fast viewer for online use, supports online network visualization, and provides a user-friendly platform for data integration with knowledge from other public resources [55].

The GEO database [57] is the most commonly used indirect database in network-based methods. It is a universal public repository for archiving and freely distributing high-throughput microarray, next-generation sequencing, and other forms of high-throughput functional genomic data, with complete and clear annotations from the research community [57]. To date, the GEO database covers 162,671 series comprising 4,777,869 samples. It provides a powerful search engine for users to identify, analyze, and visualize related data of interest. It also supports sophisticated field queries, sample comparison applications, and gene expression profiles [57].

4 Software and Tools

4.1 Comparative Genomics

Table 4 lists the software and tools used in comparative genomics to identify targets. Brief descriptions, availability, the latest update, and the URL are also provided. In comparative genomics, the BLAST suite (BLASTn, BLASTp, BLASTx, tBLASTn, and tBLASTx) is widely used to analyze the functional and evolutionary relationship between nucleic acid and protein sequences [73]. BLAST is a free online tool that can also be downloaded offline from the National Center for Biotechnology Information (NCBI) website. BLASTn is for nucleic acid sequence alignment; BLASTp is for protein sequence alignment; BLASTx compares the six-frame conceptual translation products of a nucleotide query against a protein sequence database; tBLASTn compares a protein query sequence against a sequence database dynamically translated in all six reading frames, and tBLASTx compares the six-frame translation of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database [73, 74]. There are many specific search modules in NCBI besides those regular modules. For example, smartBLAST [75] can be used to query highly similar proteins, GlobalAlign module to compare two sequences in the entire sequence, CD-search [76] to find conservative domains in a sequence, and CDART to query sequences with similar conservative domain architecture [77]. Moreover, NCBI provides an independent program BLAST + for users that dramatically accelerates the speed of long sequences query and chromosome length databases query to address the problem of slow-speed BLAST online comparison [78]. Recently, Du et al. designed a cross-platform local BLAST visualization software developed in Python using the in-built graphical user interface (GUI) module TKinter [79]. BlastGUI, as it is known, utilizes BLAST + as a comparison tool to perform the local operation and sequence comparison visualization. This user-friendly tool allows users without familiarity in computational coding and basic computer skills to compare a sequence directly without additional formatting efforts [79]. BlastGUI preprocesses the input sequence, so the computational complexity of sequence comparison is low. To carry out the comparison, the user enters the file in FASTA format into the search box of BLAST. The maximum acceptable length of nucleotide and protein sequences is generally 1000–2000, and the maximum molecular weight of the protein is 10 to 100 kD. The sequence information can be obtained from NCBI free of charge. Alternatively, the NCBI BLAST uses the indirect BLAST algorithm to run a large number of BLAST searches without using a browser, and the comparison results are returned by e-mail [73].

Table 4 Software and tools of comparative genomics in the identification of potential therapeutic targets

4.2 Network-Based Methods

Table 5 provides brief descriptions, availability, latest update, and URL of software and tools for network-based methods used in previous target identification studies over the past 5 years. Among them, Cytoscape is the most widely used and representative software. Therefore, we chose it as an example for further description of the network-based methods. Cytoscape is a general-purpose platform to analyze and visualize complicated molecular interaction networks. It can be used for integrating massive molecular interaction data. Dynamic states and molecular interactions are mapped as attributes on nodes and edges, and static hierarchical data (such as protein function ontology) are supported by annotations [88]. The Cytoscape Core is the code that organizes, displays, reads, and writes networks but contains no biology-related functionality. It is equipped with basic functionality to lay out and query the network, visually integrate the network with expression profiles, phenotypes, and other molecular states, and link the network to databases of functional annotations [88]. This core functionality is extended by Cytoscape apps. Cytoscape allows users to import attributes from tables whose simplest format are tab-delimited text files containing one column of primary identifiers of network nodes and auxiliary columns of attributes needed mapping to the nodes [89]. To reduce the complexity of a large interaction network, users can create filters based on the attributes as needed and use the Cytoscape built-in function to search [89]. In addition to directly filtering nodes using the built-in topological parameters in Cytoscape, users can also use apps (formerly called plugins), such as stringApp [90], the Biological Networks Gene Oncology (BiNGO) tool [91], Molecular Complex Detection (MCODE) [92], and cytoHubba, a user-friendly interface to explore key nodes and subnetworks [93]. StringApp combines the resources of the STRING database and Cytoscape in the same workflow and facilitates the import of STRING molecular networks into Cytoscape for executing STRING analysis in the script file [90]. BiNGO provides a comprehensive set of annotation tools for Gene Ontology (GO)-level annotations of a variety of organisms. It enables the extraction of information about overexpression of a gene in biological networks and supports user-defined annotations and ontologies [91]. MCODE enables searches for densely connected regions within large PPI networks that may reflect molecular complexes. The method is based on connectivity data [92]. CytoHubba provides a one-stop calculation of 11 topological analysis methods to help users explore hub objects from complex biological networks [93]. These useful apps are freely available from the Cytoscape App Store (http://apps.cytoscape.org/).

Table 5 Software and tools of network-based methods for identification of potential therapeutic targets

5 Applications

5.1 Comparative Genomics

With the arrival of the post-genome era, target-based drug design strategy has gradually become the focus [102]. Both the improvement of the sequencing technology and the exponential explosion of the number of fully sequenced genomes has made it possible to select reasonable new therapeutic targets and vaccine candidates throughout the genome. Drug resistance is becoming increasingly widespread due to the continuous evolution of bacterial strains, such as Streptococcus pneumoniae and Mtb. Knowledge of therapeutic targets and drug candidates is useful for enhanced drug discovery and is becoming increasingly reliant on comparative genomics technology [103]. Table 6 lists recent applications of comparative genomics in finding therapeutic targets. We selected some specific examples to describe in this section.

Table 6 Examples of prediction of potential therapeutic targets by comparative genomics in recent years

Determining essential genes of pathogens is a common method to identify potential therapeutic targets. For example, Tilahun et al. [104] retrieved the protein-coding genes of Mtb from the Mtb database and identified the essential genes by a BLAST search of the retrieved protein-coding genes against DEG. Then, the corresponding protein sequences, obtained by searching in DEG, were used to perform a BLASTp search of human protein sequences to avoid host toxicity in the subsequent drug development. Finally, 572 essential genes with no homology to human genes were selected from 3958 genes of Mtb. Discovering potential therapeutic targets from the proteins encoded by essential genes can refine the search scope of therapeutic targets. The existence of homologous genes is a powerful predictor of biological importance [105] and a breakthrough in therapeutic target identification. For example, Satya et al. [48] sequenced the gene encoding 3-deoxy-D-arabinoheptulosonate-7-phosphate synthase (DAHPS) in Pseudomonas fragilis (Pf). Sequence analysis showed high homology (84%) of Pf-DAHPS with other Pseudomonas DAHPS, indicating that it was possible to design a broad-spectrum drug for the genus by targeting the DAHPS sequence. By analyzing the homology between the protein sequence encoded by DAHPS and human protein sequences, DAHPS, which does not exist in humans, was proposed to be an important potential antibacterial target. The predicted three-dimensional structure of Pseudomonas DAHPS may provide an option for reasonable drug design [48].

Comparative genomics can be used to understand the molecular mechanism of disease and predict targets for new drug design. For example, Zumla et al. [106] discovered that the sequence homology of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genome with SARS-CoV and Middle East respiratory syndrome coronavirus (MERS-CoV) was about 82%, and the homology of structural proteins was over 90%. The high sequence homology revealed their common pathogenic mechanism. Therefore, the authors of the study designed and developed direct-acting antiviral drugs that target highly conserved enzymes in SARS-CoV-2, such as the main protease (MPRO) or 3C-like protease (3CLpro), the papain-like protease (PLpro), non-structural protein 12 (Nsp12), and RNA-dependent RNA polymerase (RdRP). Among them, ganciclovir and maraviroc, the drugs against MPRO, were considered effective for the treatment of coronavirus disease 2019 (COVID-19) [107].

Comparative genomics is used to find potential therapeutic targets for the development of human drugs and animal drugs. Damte et al. [108] selected five unique pathways of Mycoplasma hyopneumoniae strains in KEGG. They then used BLASTp in NCBI to compare the only two protein sequences in the unique pathways with the porcine protein sequences. It was found that the two protein sequences in the unique pathways were not homologous to the porcine protein sequences. Therefore, those essential proteins, which exist in M. hyopneumoniae but not in the host (pig), may be useful in drug design and vaccine production against M. hyopneumoniae. For more examples of comparative genomics used to identify potential targets, readers can refer to the list of references provided in Table 6.

5.2 Network-Based Methods

Different types of biological networks can be used to predict potential therapeutic targets by network-based methods, such as PPI networks, gene interaction networks and miRNA–mRNA interaction networks. Table 7 lists almost all applications since 2015 of network-based methods to predict potential therapeutic targets, including the databases, software and tools, network types, related pathogens/diseases/processes, and the identified targets. Some of the targets in Table 7 have been verified or used for drug design. Here, we select several examples of previous studies that have used different network types for further description.

Table 7 Applications of network-based methods for potential therapeutic target identification

PPI networks are the most widely used molecular networks in target discovery. For example, Huo et al. predicted proteins FGG, SLC9A3, MAPK14, FGF1, FGB, F13A1, and CASR as potential therapeutic targets for the treatment of coronary heart disease (CHD) by combining the centrality-based and differentia-based approaches [30]. They extracted PPIs related to Danshensu (one of the main active ingredients of Salvia miltiorrhiza, known as Danshen) from the STRING database, then integrated the data with the CHD gene expression profile and microarray data obtained from the GEO database to construct a non-CHD state co-expression protein interaction network (CePIN) and a CHD state CePIN on Cytoscape [30]. The non-CHD network contained 91 nodes and 98 edges, and the CHD state CePIN contained 99 nodes and 110 edges [30]. Then, topological analysis and network comparison were performed along with the calculation of network connectivity after the removal of candidate nodes. Finally, two bottleneck proteins, FGG and SLC9A3, existing only in the CHD state CePIN, were selected as the targets of Danshensu in the treatment of CHD and as the potential targets for new drug design [30]. In addition, MAPK14, FGF1, FGB, F13A1, and CASR, obtained through the differentia-based approach, also represented potential therapeutic targets for the treatment of CHD and had been confirmed to be related to CHD to some extent [30].

There are also examples of the use of the centrality based approach alone to identify potential therapeutic targets. For example, Moon et al. generated a list of 1089 differentially expressed genes from patients with diffuse systemic sclerosis by a literature search in Google Scholar and PubMed using specific keywords [125]. Then, using the centrality-based approach to build a PPI network, they identified 1068 interactions of those 1089 genes. Finally, a network centrality analysis identified four hub genes (CTGF, HCK, LYN, PDGFRB) as potential therapeutic targets [125]. In another example, Fathima et al. used non-apoptotic cell death genes of colon adenocarcinoma (COAD), glioblastoma multiforme (GBM), and small cell lung cancer (SCLC) screened from their transcriptome profiles to build three PPI networks [133]. Through centrality analysis, 4 of the top 10 hub proteins, which were not found or only found in one target database, were considered as novel valid therapeutic targets (FANCD2 and NCOA4 for COAD, IKBKB for GBM, and RHOA for GBM and SCLC) [133].

As mentioned above, PPI networks, gene interaction networks, and miRNA–mRNA interaction networks) have applications in predicting potential therapeutic targets. For example, Miryala et al. [31] identified 337 functional interactions of 60 antimicrobial resistance genes of Pseudomonas aeruginosa PA01 from the PathoSystems Resource Integration Center (PATRIC) tool, The Antibiotic Resistance Genes Database (ARDB) [142], the comprehensive antibiotic resistance database (CARD), the National database of antibiotic-resistant organisms (NDARO), and the STRING database. By constructing and analyzing the gene interaction network in Cytoscape, nine hub genes were obtained as potential therapeutic targets for new drug development [31]. Xue et al. [32] constructed a miRNA–mRNA interaction network using miRNA and mRNA expression data, and the clinical data of three cancer types downloaded from The cancer genome atlas (TCGA) database [143]. The top 20 miRNAs with the highest degree in each data set were annotated via miRCancer (a microRNA–cancer association database) [144], miR2Disease (a microRNA–disease database) [145], and the Human microRNA Disease Database (HMDD) [146]. After mapping the genes predicted as the targets of more than three miRNAs in the subnetworks to the human protein atlas database (HPAD) [147], eight genes (ASPG, AQP2, CNOT8, CTPS1, IFNAR2, MOCS2, PRSS37, and VCP) were finally identified as potential therapeutic targets [32].

5.3 Comprehensive Applications

In addition to using comparative genomics and network-based methods independently, they can also be combined for target identification. Table 8 lists recent applications of the combined methods for potential therapeutic target identification. We chose three of them as representatives for further description. Nayak et al. screened putative targets for pathogens causing bacterial pneumonia. By bit score, E-value threshold, and sequence length screening of the complete proteome of 13 pathogenic bacterial strains using comparative genomics, 74 proteins non-homologous to human and intestinal flora were identified [103]. An interaction network for the 74 proteins was constructed in Cytoscape, and 12 built-in central parameters of cytoHubba were used to prioritize the nodes, culminating in the identification of 20 genes as hub nodes. Among the 20 genes, 10 have been reported or confirmed as drug targets, and the remaining 10 were considered new potential therapeutic targets for the treatment of bacterial pneumonia [103]. Melak and Gakkhar used BLAST to perform comparative analysis for the H37RV protein-coding genes obtained from the TBDB against DEG and identified 572 essential genes non-homologous with humans [104]. Then, they prioritized the resulting proteins based on centrality measurement in the PPI network, resulting in the identification of 137 central proteins. Combining flux balance analysis of the reactome and structural assessment of targetability, secY (Rv0732), katG (Rv1908c), gltB (Rv3859c), and sirA (Rv2391) were identified as potential therapeutic targets against Mtb H37RV [104]. Gupta et al. [148] performed subtractive genomic and comparative genomics of 16 pathogenic Leptospira strains retrieved from NCBI against DEG and the Cluster of Essential Genes (CEG) [149] using the Cluster Database at High Identity with Tolerance (CD-Hit) and BLASTp to identify 34 common genes. After analyzing and comparing two extended PPI networks of two strains and multiple sequence alignment, eight proteins (lpxB, lpxK, kdtA, fliN, cobA, metX, thiL, and ubiA) were identified as putative therapeutic targets for drug design or vaccine development [148].

Table 8 Applications of comprehensive methods for potential therapeutic target identification

6 Discussion

6.1 Advantages and Prospects of the Two Categories of In silico Methods for Target Identification

Current trends in drug discovery focus on understanding disease mechanisms, followed by target identification and lead compound discovery [5]. Compared with wet experimental methods, in silico methods provide the technology to systematically explore all possible interactions and illuminate the pharmacological patterns [153]. Reliable target identification methods used in conjunction with drug discovery approaches will improve the efficiency of computer-aided drug discovery [5]. Here, we discuss the advantages and prospects of comparative genomics and network-based methods for identifying potential therapeutic targets.

One advantage of comparative genomics is that the definition of essential genes and unique metabolic pathways not only represents the essential issues of biology but is also of great significance in practical applications [111]. Furthermore, with the establishment of Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) and exome sequencing technology, the number of sequenced human essential genes has increased remarkably [54]. In addition, with the development of bioinformatics and computer science, algorithms have been continuously optimized, generating convenient analysis tools for scientific researchers, and enhancing the potential for comparative genomics in potential therapeutic target identification.

Network-based methods have the advantage of generating visual interactive networks through given databases and are not limited by the lack of quantitative mechanical data [154]. Furthermore, network-based methods do not depend on negative samples and the three-dimensional structure of targets [155], which is time-efficient in the early work of target research. There is also promise that network-based methods will predict more than one target with simultaneous actions, such as a pair of essential proteins [154]. Moreover, network-based methods may be beneficial in identifying candidate multi-target sets in the development of multi-target drugs [156]. Compared with traditional wet experimental methods, which always limit cellular processes to a single component or signaling pathway, network-based methods can be used to identify potential therapeutic targets systematically [15].

6.2 Disadvantages/Limitations of the Two Categories of In silico Methods for Target Identification and Potential Solutions

Although comparative genomics and network-based methods have unique advantages and promising prospects to identify potential therapeutic targets, there are still some drawbacks. For comparative genomics, although this approach is commonly used in the development of drugs against drug-resistant bacteria, the failure rate of old antibiotics is much faster than the development of new antibiotics. Moreover, antibiotics are short-term therapies for the treatment of infections. Additionally, their value is considerably less than the drugs for chronic diseases, so the use of comparative genomics in the development of antibiotics is a long-debated topic [157]. Another issue is that although comparative genomics can reduce the number of experimental targets, making some attractive proteins become potential therapeutic targets, the range of potential targets screened by this method is still very wide and is limited by time and cost. It seems that most of these potential targets screened by comparative genomics will not be used for experimental validation. Therefore, it may be profitable to combine comparative genomics with network-based methods to narrow the scope of experimental targets further and reduce the time and material resources, thereby saving costs in the early stage of drug research and development.

Network-based methods are highly dependent on the accuracy of the source data, potentially requiring a great deal of labor to ensure its accuracy [158]. A promising direction to resolve this problem will be integrating different types and complementary data in the future [6]. Other drawbacks of network-based methods are that they cannot predict proteins or genes without interaction data, and the interactions cannot be quantified [155]. Improved network construction and analysis algorithms or mathematical modeling methods [159] may be required to overcome these issues.

6.3 Comparison and Contrast of the Two Categories of In silico Methods for Target identification

Comparative genomics and network-based methods have unique advantages and disadvantages in predicting targets. Comparative genomics almost exclusively searches within the range of pathogen-associated sequences, limiting the scope to the proteomes closely related to the pathogen. Conversely, network-based methods can be used in pathogens and construct a network for human disease-related proteins or genes. In contrast to comparative genomics, network-based methods can connect long-distance relationships through interactions [160], permitting research into the interplay of evolutionary drivers on a larger scale. Conversely, comparative genomics is usually superior to network-based methods in accuracy because comparative genomics directly compares sequences, which are always constant and almost have no deviation. However, there may be false positives and false negatives in the interaction data used in network-based methods [161], and the interactions are only qualitative [160], which may lead to bias. In summary, the combined use of comparative genomics and network-based methods may be more beneficial than either method alone to improve the accuracy and efficiency in target identification.

6.4 Previous Reviews and Prospective Studies on In silico Methods for Target Identification

We have collected five reviews on in silico methods for identifying potential therapeutic targets during 2016–2020, which will be briefly discussed in this section. Sekyere and Asante [7] reviewed comparative genomic analysis trans-complementation assays in the context of antibiotic resistance research and new drug discovery by describing the emergence of several new drug resistance genes, such as lsa(C), erm(44), VCC-1, mcr-1, mcr-2, mcr-3, mcr-4, blaKLUC-3, and blaKLUC-4. For readers interested in further understanding pathogen protein targets, Saha et al. reviewed the computational work and functional prediction from PPI networks applied to different infectious diseases with Plasmodium falciparum used as an example to analyze the process of protein target identification through the host–pathogen protein interactions [162]. Katsila et al. [5] surveyed chemical informatics and network-based methods for identifying therapeutic targets and introduced some databases and network computing tools for target identification. They also appraised the process of computer-aided drug design (CADD), including ligand-based drug design and structure-based drug design [5]. Readers interested in CADD can peruse their article for further understanding. Reisdorf et al. introduced database resources for identification, prioritization, and validation of disease targets, including emerging integrated bioinformatics platforms, such as Open Targets, and public resources, such as DrugBank and ChEMBL [163]. In comparison, the database resources we described focus more on classic or commonly used databases for applications. We also recommend the review by Agamah et al. [153], which examined current in silico methods for the identification of therapeutic targets and candidate drugs, including network-based analysis approaches, data mining, reverse docking, biospectra analysis, and ligand-based in silico target prediction and compared the different approaches and propounded the benefits of hybrid approaches.

6.5 Related Methods for Target Identification

In silico subtractive genomics (first mentioned in Sect. 2.1), also known as differential proteome mining, is a comparative genomics-based method [164]. Subtractive genomics gradually subtracts proteins from the complete proteome of pathogens to find rational targets [18]. The difference between subtractive genomics and comparative genomics is in the range of application of the two methods. Subtractive genomics has been widely used for developing potential anti-pathogen infection drugs [18], whereas comparative genomics can be used not only to identify potential targets of pathogens but also to understand the molecular basis of disease [106].

For network-based methods, in addition to the centrality-based and differentia-based approaches we reviewed above, there are also studies showing the use of network influence [165], controllability [166], and topological similarity strategy [167] in target identification, but the relevant applications are much fewer. Compared with network centrality, the network influence strategy focuses on the vulnerable nodes close to the central nodes in networks. Acting on these nodes may not be fatal but can have a major impact on the central nodes, so these nodes have the potential to be therapeutic targets [165]. The controllability strategy applies structural controllability theory to determine the minimum set of driver nodes in control of the entire network and identify indispensable nodes as prime targets for disease-causing mutations, viruses, and drugs [166]. The topological similarity strategy focuses on the nodes in the network with similar topological properties to the existing drug targets, which can be potentially developed as therapeutic targets [167].

Commonly used experimental methods for potential therapeutic target identification, especially for essential genes, include single-gene knockout, antisense RNA inhibition of gene expression, large-scale transposon mutagenesis, and CRISPR/Cas9 nuclease system knockout screening.The limitations of experimental methods in identifying essential genes are listed in Table 9 [168].

Table 9 Limitations of experimental methods in identifying essential genes

Current computational studies are based on the integration of prior knowledge, the sparseness of which is still limiting the integrality and accuracy of computational prediction [169]. Data reproducibility of in silico methods is also an essential issue but might be improved by external validation and detailed reports of experimental datasets [153]. It should be emphasized that computational methods complement laboratory-based methods and that the targets identified by in silico methods need to be experimentally validated.

6.6 Review and Prospection of Deep Learning Architecture in Target Identification

Deep learning (DL), a relatively new computational technique that has become a hot research topic, has been rapidly developed and widely used to predict potential therapeutic targets. DL is a subclass of machine learning (ML) algorithms. It uses artificial neural networks with many layers of nonlinear processing units for learning data representations [170]. Therapeutic target identification based on ML or DL is usually used to predict targets of drug repositioning, which means to predict new targets for existing drugs. There are two steps in the ML method to predict therapeutic targets. First, the compounds are transformed into an effective representation, a process called input features, followed by the construction of the feature vectors as input for the ML algorithm to learn the functional relationship between the input feature and the target property [171]. Compared with ML methods, DL reconstructs the original input information into a distributed representation through neurons in the hidden layer. Another characteristic of DL models is that they can automatically learn features upon completing classification and other tasks and learn more complex features when the number of layers increases. DL architectures are well-suited for target prediction because they allow for multitask learning and automatically construct complex features, which, for target prediction, are assumed to be pharmacophore descriptors. Multitask learning has the advantage of allowing for multi-label information and can, therefore, utilize relations between targets. It also permits hidden unit representations to be shared among prediction tasks, which is particularly valuable because some targets have very few measurements available, making single-target prediction ineffective. In addition, DL can boost the performance of tasks with a few training examples. The other advantage of deep networks is that they provide hierarchical representations of a compound, where higher levels represent more complex properties [172].

Convolutional neural networks (CNNs) are a representative DL architecture in potential target prediction. CNNs contain convolutional layers, pooling layers, and fully connected layers. Convolutional layers and pooling layers are responsible for the feature extraction, and fully connected layers are used to construct the nonlinear relationship of the extracted features for obtaining the output [171]. Another DL architecture is deep neural networks (DNNs), which contain multiple hidden layers, with each layer comprising hundreds of nonlinear process units. DNNs can deal with many input features, and the neurons in different layers of a DNN can automatically extract features at different hierarchical levels [173]. The third main DL architecture is auto-encoders, which is a neural network used for unsupervised learning. Auto-encoders contain an encoder part that transforms the input information into a limited number of hidden units and then couples a decoder neural network with the output layer having the same number of nodes as the input layer [174].

Several studies have reported DL for therapeutic target prediction in recent years [175,176,177]. For example, Wang et al. [178] constructed a framework that combines a biased support vector machine and a stacked auto-encoder DL model to identify drug target proteins. The stacked auto-encoders were trained to extract properties from the original protein representations, and the biased support vector machine was used to perform the potential target identification task. The framework identified 23% of the original non-drug target proteins as possible therapeutic target proteins. Zeng et al. [179] developed a DL method, named deepDTnet, for novel target identification. A DNN algorithm was used to learn the relationships between drugs and targets. The model was used to predict the new target for topotecan (an approved topoisomerase inhibitor of human retinoic-acid-receptor-related orphan receptor-gamma t, ROR-γt). Human ROR-γt was predicted as the target, and bioassay experiments showed high inhibitory activity (IC50 = 0.43 μM) on ROR-γt. Lee et al. [180] proposed a DL model named DeepConv-DTI (deep learning with convolution on protein sequences for prediction of drug–target interaction) based on CNN for drug–target interactions prediction, which can be used for target identification. The training dataset contained 11,950 compounds, 3,675 proteins, and 32,568 drug–target interactions. The CNN model is constructed to capture local residue patterns and concatenate protein features with drug features through the fully connected layers. The hyperparameters with an external validation dataset were then optimized. The possible drug–protein interactions are output.

Although DL has advantages in recognition, classification, and feature extraction from complex and noisy data, it still has limitations. First of all, DL is a “black box,” which makes it hard to explain the prediction result and inherent principles of why the compound is effectively targeted to the predicted target. Second, it needs a large number of experimental datasets of drug–target relationships for its training. However, there is currently a lack of experimental data of drug–target relationships [181]. Consequently, there is a risk of overfitting when training the model, leading to low accuracy of the prediction result. Third, DL is usually computationally intensive, time-consuming, and often requires access to and programming knowledge for graphics processing units. DL has recently been applied successfully in therapeutic target identification. However, due to the lack of large-scale studies or experimental data and the hyperparameter selection bias that comes with the high number of potential DL architectures, DL still has scope for improvement and development in research to predict potential therapeutic targets [172, 182].

7 Conclusion

In this review, we introduced, in detail, the two categories of in silico methods for potential therapeutic target identification—comparative genomics and network-based methods—and summarized the databases and software commonly used for these approaches. We also collected and highlighted some previous applications of these methods for therapeutic target identification. Additionally, we analyzed the advantages and disadvantages of the methods and their application prospects. Finally, we accentuated the characteristics of our review in the context of previously published relevant reviews and methods. The purpose of this review was to help readers quickly understand the rationales of in silico methods for potential therapeutic target identification, and become familiar with the available tool resources and the applications of these methods, to harness the full use of the existing tools for target prediction. We strongly believe that more accurate predictions due to users’ familiarity with existing resources will increase the importance of computational methods in the identification of potential therapeutic targets for future research. In turn, the failure rate due to target problems in drug development, the input–output ratio of drug discovery, and the cost of subsequent experiments can be expected to reduce and the drug development cycle time to shorten.