Isariotin b

Formula: C21H31NO8
Molecular weight: 425.50
Smiles: C1C(C(C2C(C3C(C1(C2=O)O)O3)O)O)NC(=O)C=CCCCCCCCCC(=O)O

Download

Isariotin b

Names

Structural Information

Physico-chemical Properties

ADME

Toxicity

Medical Chemistry

Fungi

Names

Mycotoxin name: Isariotin b
First synonym: Isariotin b
Synonyms: isariotin B, CHEMBL253124

Identifiers / External links

PubChem CID: 44445761
ChemSpiderID: 10196
ChEMBL: CHEMBL253124

Structure

Smiles: C1C(C(C2C(C3C(C1(C2=O)O)O3)O)O)NC(=O)C=CCCCCCCCCC(=O)O
Isomeric smiles: C1[C@@H]([C@@H]([C@H]2[C@@H]([C@H]3[C@@H]([C@@]1(C2=O)O)O3)O)O)NC(=O)/C=C/CCCCCCCCC(=O)O
Inchi: InChI=1S/C21H31NO8/c23-13(9-7-5-3-1-2-4-6-8-10-14(24)25)22-12-11-21(29)19(28)15(16(12)26)17(27)18-20(21)30-18/h7,9,12,15-18,20,26-27,29H,1-6,8,10-11H2,(H,22,23)(H,24,25)/b9-7+/t12-,15-,16-,17-,18-,20-,21+/m0/s1
Inchikey: GCOQLRRFBKJAPF-DBGSMAKUSA-N

2D structure:
3D structure:

Physico-chemical properties

Formula: C21H31NO8
Molecular weight: 425.50
Monoisotopic mass: 425.20496695

Select an endpoint:

Endpoint	Tool	QSAR ID	Value	Unit	Comments	Reference
LogP	VEGA	MLogP	-0.27	Log(mol.L)	LogP model (MLogP)-prediction	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
nRing	ADMETLAB2		3.0			doi: 10.1093/nar/gkab255
LogP	VEGA	ALogP	0.3	Log(mol.L)	LogP model (ALogP)-prediction	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
MaxRing	ADMETLAB2		9.0			doi: 10.1093/nar/gkab255
LogP	PKCSM	iLOGP	0.0557	Log(mol.L)	LOGP	doi: 10.1021/acs.jmedchem.5b00104
nHet	ADMETLAB2		9.0			doi: 10.1093/nar/gkab255
LogP	SWISSADME	iLOGP	1.9	Log(mol.L)
fChar	ADMETLAB2		0.0			doi: 10.1093/nar/gkab255
LogP	SWISSADME	XLOGP3	0.3	Log(mol.L)
nRig	ADMETLAB2		16.0			doi: 10.1093/nar/gkab255
LogP	SWISSADME	WLOGP	0.06	Log(mol.L)
MW	ADMETLAB2		425.2			doi: 10.1093/nar/gkab255
nHAt	SWISSADME		30.0
LogP	SWISSADME	MLOGP	-0.66	Log(mol.L)
MW	SWISSADME		425.47
Flex	ADMETLAB2		0.75			doi: 10.1093/nar/gkab255
LogP	SWISSADME	Silicos-IT Log P	1.49	Log(mol.L)
MW	PKCSM		425.478			doi: 10.1021/acs.jmedchem.5b00104
nStereo	ADMETLAB2		7.0			doi: 10.1093/nar/gkab255
LogS	ADMETLAB2		-2.046	Log(mol.L)	logS: The logarithm of aqueous solubility value.: The prediction is based on Multi-task Graph Attention (MGA) framework MGA is composed of input, Relation graph convolution network (RGCN) layers, attention layer and fully-connected (FC) layers. In the Input, a node represents the information of an atom, and after passing RGCN layers, the node represents general features of circular substructure centered on the atom. RGCN is an extension of the standard graph convolution network (GCN) by introducing edge features to enrich the messages used to update the hidden states in the network. Attention layers can assign different attention weights to different substructures, and then generate the customized fingerprints (CFP) from the general features for a specific task. The prediction results are mainly displayed in the tabular format in the browser, with the 2D molecular structure and a radar plot summarizing the physicochemical quality of the compound. For those endpoints predicted by the regression models concrete predictive values are provided. To obtain robust and accurate prediction models, the model training process was repeated ten times with random data splitting. The best performing models were incorporated into the online platform, and different performance measures for the training test regression model: R-square (R2) of 0.967, mean absolute error (MAE) of 0.399, and root mean squared error (RMSE) of 0.287. The logarithm of aqueous solubility value. The first step in the drug absorption process is the disintegration of the tablet or capsule, followed by the dissolution of the active drug. Low solubility is detrimental to good and complete oral absorption, and early measurement of this property is of great importance in drug discovery. This model is based on 4797 Total molecules, with 3836 in the training set, 480 test set, and 480 validation set drug like molecules. How to interpret: The predicted solubility of a compound is given as the logarithm of the molar concentration (log mol/L). Compounds in the range from -4 to 0.5 log mol/L will be considered proper.	doi: 10.1093/nar/gkab255
RatioCsp3	SWISSADME		0.76
LogD7.4	ADMETLAB2		0.57	Log(mol.L)		doi: 10.1093/nar/gkab255
LogS	ADMETSAR	ESOL	-3.1693	Log(mol.L)		DOI: 10.1093/bioinformatics/bty707
nARO	SWISSADME		0.0
VDW_Vol	ADMETLAB2		416.876			doi: 10.1093/nar/gkab255
LogS	PKCSM	Ali	-3.313	Log(mol.L)		doi: 10.1021/acs.jmedchem.5b00104
nHA	ADMETLAB2		9.0			doi: 10.1093/nar/gkab255
Dens	ADMETLAB2		1.02			doi: 10.1093/nar/gkab255
LogS	VEGA	ESOL	-1.33	Log(mol.L)		https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
nHA	SWISSADME		8.0
VSA	PKCSM		174.656			doi: 10.1021/acs.jmedchem.5b00104
LogS	SWISSADME	ESOL	-1.87	Log(mol.L)
nHA	PKCSM		7.0			doi: 10.1021/acs.jmedchem.5b00104
Mref	SWISSADME		105.83
LogS	SWISSADME	ALI	-3.15	Log(mol.L)
nHD	SWISSADME		5.0
TPSA	ADMETLAB2		156.69	A²		doi: 10.1093/nar/gkab255
LogS	SWISSADME	Silicos-IT	-1.48	Log(mol.L)
nHD	ADMETLAB2		5.0			doi: 10.1093/nar/gkab255
TPSA	SWISSADME		156.69	A²
nHD	PKCSM		5.0			doi: 10.1021/acs.jmedchem.5b00104
nRot	ADMETLAB2		12.0			doi: 10.1093/nar/gkab255
LogP	ADMETLAB2		0.507	Log(mol.L)	logP: The logarithm of the n-octanol/water distribution coefficient: The prediction is based on Multi-task Graph Attention (MGA) framework MGA is composed of input, Relation graph convolution network (RGCN) layers, attention layer and fully-connected (FC) layers. In the Input, a node represents the information of an atom, and after passing RGCN layers, the node represents general features of circular substructure centered on the atom. RGCN is an extension of the standard graph convolution network (GCN) by introducing edge features to enrich the messages used to update the hidden states in the network. Attention layers can assign different attention weights to different substructures, and then generate the customized fingerprints (CFP) from the general features for a specific task. The prediction results are mainly displayed in the tabular format in the browser, with the 2D molecular structure and a radar plot summarizing the physicochemical quality of the compound. For those endpoints predicted by the regression models concrete predictive values are provided. To obtain robust and accurate prediction models, the model training process was repeated ten times with random data splitting. The best performing models were incorporated into the online platform, and different performance measures for the training test regression model: R-square (R2) of 0.980, mean absolute error (MAE) of 0.257, and root mean squared error (RMSE) of 0.193. The logarithm of the n-octanol/water distribution coefficient. log P possess a leading position with considerable impact on both membrane permeability and hydrophobic binding to macromolecules, including the target receptor as well as other proteins like plasma proteins, transporters, or metabolizing enzymes. This model is based on 12682 Total molecules, with 10145 in the training set, 1270 test set, and 1267 validation set drug like molecules. How to interpret: The predicted logP of a compound is given as the logarithm of the molar concentration (log mol/L). Compounds in the range from 0 to 3 log mol/L will be considered proper.	doi: 10.1093/nar/gkab255
nRot	SWISSADME		12.0
LogP	VEGA	Meylan-Kowwi	-1.49	Log(mol.L)	LogP model (Meylan-Kowwin)-prediction	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
nRot	PKCSM		11.0			doi: 10.1021/acs.jmedchem.5b00104

Categories
All
Absorption
Distribution
Metabolism
Excretion
Transporter

Select an endpoint:

Category	Endpoint	Tool	QSAR ID	Value	Unit	Comments	Reference
Metabolism	CYP2C9-inh	ADMETSAR		-	Active/"-" = Inactive/Not predicted	Five subgroups of CYP inhibitors were collected by Cheng et al.,[1] including 1a2, 2d6, 2c9, 2c19, and 3a4. A compound was assigned as a CYP inhibitor if the AC50 (the compound concentration leads to 50% of the activity of an inhibition control) value was 10 μM, and it was considered as a noninhibitor if AC50 was >57 μM. In addition, a compound was regarded as a CYP inhibitor if it has the PubChem activity score between 40 and 100, and as a noninhibitor if it has PubChem activity score equal to 0. Three subgroups of CYP substrates were collected by Carbon-Mangles et al., including 2d6, 2c9, and 3a4.[2] The models were built by by MACCS fingerprints and Support vector machine.	DOI: 10.1093/bioinformatics/bty707
Absorption	HIA	SWISSADME		-	Active/"-" = Inactive/Not predicted	The predictions for passive human gastrointestinal absorption (HIA) and blood-brain barrier (BBB) permeation both consist in the readout of the BOILED-Egg model (Daina, A. & Zoete, V. A BOILED-Egg To Predict Gastrointestinal Absorption and Brain Penetration of Small Molecules. ChemMedChem 11, 1117–1121 (2016), an intuitive graphical classification model, which can be displayed in the SwissADME result page by clicking the red button appearing below the sketcher when all input molecules have been processed (refer to Graphical Output). This models are based on the computation of the lipophilicity (WLOGP) and polarity (tPSA). Combining both best ellipses yields the BOILED‐Egg predictive model for respectively HIA and BBB. The white region is the physicochemical space of molecules with highest probability of being absorbed by the gastrointestinal tract, and the yellow region (yolk) is the physicochemical space of molecules with highest probability to permeate to the brain. Yolk and white areas are not mutually exclusive. Other binary classification models are included, which focus on the propensity for a given small molecule to be substrate or inhibitor of proteins governing important pharmacokinetic behaviours. Gastro intestinal absorption: according to the white of the BOILED-egg. doi/10.1002/cmdc.201600182.
Transporter	Pgp Inhibitor	vNN-ADMET		-	Active/"-" = Inactive/Not predicted	Pgp inhibitors: P-glycoprotein inhibitors. The k-nearest neighbor (k-NN) method is widely used to develop QSAR models (Zheng and Tropsha, 2000). An alternative approach is to use a predetermined similarity criterion, vNN method, which uses all nearest neighbors that meet a structural similarity criterion to define the model's applicability domain (Liu et al., 2012, 2015; Liu and Wallqvist, 2014). When no nearest neighbor meets the criterion, the vNN method makes no prediction. P-glycoprotein (Pgp) is an essential cell membrane protein that extracts many foreign substances from the cell (Ambudkar et al., 2003). As such, it is a critical determinant of the pharmacokinetic properties of drugs. Cancer cells often overexpress Pgp, which increases the efflux of chemotherapeutic agents from the cell and prevents treatment by reducing the effective intracellular concentrations of such agents—a phenomenon known as multidrug resistance (Borst and Elferink, 2002). For this reason, identifying compounds that can either be transported out of the cell by Pgp (substrates) or impair Pgp function (inhibitors) is of great interest. vNN method based on dataset included 1,319 inhibitors and 937 non-inhibitors. We classified the Pgp inhibitors and non-inhibitors as positives and negatives, espectively. Overall accuracy of 85%, when using 10-fold CV, with corresponding kappa value of 0.66. These models reliably predicted 76% of the compounds in their datasets to be Pgp inhibitors. Results were adapted and the “Yes” and “No” indicators were changed respectively to “Active” and “-“.	doi: 10.3389/fphar.2017.00889.
Transporter	MATE1-inh	ADMETSAR		-	Active/"-" = Inactive/Not predicted	MATE1 is an apically expressed poly-specific proton antiporter which mediates the efflux of diverse substrates, primarily organic cations, in the kidney and the liver. Following its relatively recent discovery, MATE1 has rapidly emerged as an important transporter in the renal and biliary excretion of endogenous and exogenous organic cations, particularly metformin. It appears that clinical inhibitors of organic cation transporters (OCTs) are also potent inhibitors of MATEs, and therefore modulation of the activity of both OCTs and MATEs, or predominantly of MATEs, may better describe DDIs currently ascribed to OCTs. The major focus of investigation for MATE1 has been on its role in renal drug disposition and elimination, notably on the renal elimination of metformin and the renal toxicity of cisplatin. Various studies of the impact of functional gene polymorphisms of MATE1, MATE2K, OCT1, and OCT2 on metformin pharmacokinetics, efficacy and safety, as well as preclinical assessments in Mate1 knockout mice, imply a significant role for these transporters. The recent FDA regulatory guideline now recommends evaluation of MATE1-mediated drug interactions for NCEs that undergo significant renal elimination. The human multidrug and toxin extrusion (MATE) transporter 1 contributes to the tissue distribution and excretion of many drugs. Inhibition of MATE1 may result in potential drug–drug interactions (DDIs) and alterations in drug exposure and accumulation in various tissues. In total 80 inhibitors and 738 non inhibitors were collected and the model was built by SubFP and random forest.	DOI: 10.1093/bioinformatics/bty707
Distribution	BBB permeant	ADMETSAR		-	Active/"-" = Inactive/Not predicted	Blood brain barrier (BBB) model based on 1438/401 (positive/negative) molecules based on binary model. ADMET data are collected from literature and databases, represented by fingerprints and descriptors, and the models were built by machine (deep) learning methods. ADMETopt can be used to optimize the ADMET properties of a query compound by scaffold hopping. Robustness of the model: AUC: 0.944, Accuracy: 0.907, Sensitivity: 0.921, Specificity: 0.861.	DOI: 10.1093/bioinformatics/bty707
Absorption	Caco-2 Permeability	ADMETSAR		-	Active/"-" = Inactive/Not predicted	In total, 674 drug or drug-like molecules with Caco-2 permeability values were used with 303 positives and 371 negatives experimental values. The dataset were collected from Hai Pham The et al. (2011). The model is based on AtomPairs with Support vector machine (SVM). The binary model’s performances were AUC: 0.857, Accuracy: 0.768, Sensitivity: 0.73, Specificity: 0.799. The result of the prediction is binary Active or - (inactive).	DOI: 10.1093/bioinformatics/bty707
Metabolism	CYP1A2-inh	PKCSM		-	Active/"-" = Inactive/Not predicted	CYP1A2 inhibitor: Cytochrome P450 substrate 1A2 isoform: The prediction is based on molecular properties (Molecular Weight; Heavy Atom count; LogP; Heteroatoms count; Rotatable Bonds count; Ring count; TPSA; Labute ASA; Fluorine atom Count; Toxicophore [1-36]; Pharmacophore count) calculated using the RDKit cheminformatics toolkit and used for training the predictive models. Two different algorithms, Random Forest and Logistic Regression, did the qualitative predictions (classification tasks). Cytochrome P450 is an important detoxification enzyme in the body, mainly found in the liver. It oxidises xenobiotics to facilitate their excretion. Many drugs are deactivated by the cytochrome P450’s, and some can be activated by it. Inhibitors of this enzyme, such as grapefruit juice, can affect drug metabolism and are contraindicated. It is therefore important to assess a compound’s ability to inhibit the cytochrome P450. Model for CYP1A2 inhibitor was built using from over 14903 compounds whose ability to inhibit the cytochrome P450 1A2 has been determined. A compound is considered to be a cytochrome P450 inhibitor if the concentration required to lead to 50% inhibition is less than 10 uM. The best performing predictor in each task was chosen based 5-fold cv approach. The Weka toolkit was used for training and testing the models. How to interpret the results: The predictors will assess a given molecule to determine whether it is likely going to be a cytochrome P450 inhibitor, for a given isoform. Qualitative information were changed: "Yes" was replaced to "Active" and "No" prediction was replaced to "-".	doi: 10.1021/acs.jmedchem.5b00104
Metabolism	CYP3A4-inh	ADMETSAR		-	Active/"-" = Inactive/Not predicted	Five subgroups of CYP inhibitors were collected by Cheng et al.,[1] including 1a2, 2d6, 2c9, 2c19, and 3a4. A compound was assigned as a CYP inhibitor if the AC50 (the compound concentration leads to 50% of the activity of an inhibition control) value was 10 μM, and it was considered as a noninhibitor if AC50 was >57 μM. In addition, a compound was regarded as a CYP inhibitor if it has the PubChem activity score between 40 and 100, and as a noninhibitor if it has PubChem activity score equal to 0. Three subgroups of CYP substrates were collected by Carbon-Mangles et al., including 2d6, 2c9, and 3a4.[2] The models were built by by MACCS fingerprints and Support vector machine.	DOI: 10.1093/bioinformatics/bty707
Excretion	CL	ADMETLAB2		2.813	ml/min/kg bw	CL: The clearance of a drug: The prediction is based on Multi-task Graph Attention (MGA) framework MGA is composed of input, Relation graph convolution network (RGCN) layers, attention layer and fully-connected (FC) layers. In the Input, a node represents the information of an atom, and after passing RGCN layers, the node represents general features of circular substructure centered on the atom. RGCN is an extension of the standard graph convolution network (GCN) by introducing edge features to enrich the messages used to update the hidden states in the network. Attention layers can assign different attention weights to different substructures, and then generate the customized fingerprints (CFP) from the general features for a specific task. The prediction results are mainly displayed in the tabular format in the browser, with the 2D molecular structure and a radar plot summarizing the physicochemical quality of the compound. For those endpoints predicted by the regression models concrete predictive values are provided. To obtain robust and accurate prediction models, the model training process was repeated ten times with random data splitting. The best performing models were incorporated into the online platform, and different performance measures for the training test regression model: R-square (R2) of 0.977, mean absolute error (MAE) of 0.740, and root mean squared error (RMSE) of 0.556. The fraction unbound in plasma. The clearance of a drug. Clearance is an important pharmacokinetic parameter that defines, together with the volume of distribution, the half-life, and thus the frequency of dosing of a drug. This model is based on 831 Total molecules, with 666 in the training set, 81 test set, and 84 validation set drug like molecules. How to interpret: The unit of predicted CL penetration is ml/min/kg. >15 ml/min/kg: high clearance; 5- 15 ml/min/kg: moderate clearance; < 5: poor (red).	doi: 10.1093/nar/gkab255
Transporter	Pgp substrate	VEGA		-	Active/"-" = Inactive/Not predicted	The model provides a qualitative prediction of P-Glycoprotein inhibition/substrate activity. 96 molecular descriptors were used. To further reduce the likelihood of correlations between descriptors, a Kohonen top-map was used (Drganet al., 2017). In this way, the remaining descriptors were mapped onto a network with a 7 by 7 architecture of neurons using the transpose of the descriptor matrix; two descriptors were selected from each neuron, those with the largest and the shortest Euclidean distance to the central neuron, yielding a final set of 96 molecular descriptors for further use. The dataset was collected mainly from the admet SAR database (http://lmmd.ecust.edu.cn/admetsar2) and from the work of Li et al (doi.org/10.1021/mp400450m) and contain 1785 chemicals (training set). P-Glycoprotein Activity Classification Model (NIC) is Counter Propagation Artificial Neural Network (CP ANN) Multiclass classification model Counter Propagation Artificial Neural Network (CP ANN) in combination with The genetic algorithm (GA)Mora Lagares, L., Minovski, N., & Novic, M. (2019). Multiclass Classifier for P-Glycoprotein Substrates, Inhibitors, and Non-Active Compounds. Molecules, 24(10). doi:10.3390/molecules24102006. The training set properties were of Accuracy = 0.95, Specificity = 0.95, Sensitivity = 0.95. The initial predictions were Inhibitor/Substrate/Non-active.	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Transporter	BSEP-inh	ADMETSAR		-	Active/"-" = Inactive/Not predicted	ABCB11, more commonly referred to as BSEP (Bile Salt Export Pump) is a uni-directional, ATP-dependent efflux transporter that plays an important role in the elimination of bile salts from the hepatocyte into the bile canaliculi for export into the gastrointestinal tract (GIT). It is almost exclusively expressed in the liver, with much lower levels reported in the kidney. It is mainly of relevance to hepatotoxicity, as BSEP inhibition by a drug and/or its metabolites can result in the buildup of bile salts in the liver, which can lead to cholestasis and drug-induced liver injury (DILI). Compared to other drug transporters there are only few identified drug substrates and inhibitors of BSEP; thus, its involvement in drug-drug interactions (DDI) is very limited. The relevance of in vitro BSEP inhibition as a predictor of clinical outcomes is not clearly established, but whenever cholestatic liver injury is observed in clinical or preclinical trials, characterization of BSEP interactions should be considered. In contrast with the FDA guidance, the EMA guidance recommends consideration of in vitro BSEP inhibition testing for NCEs. In total 317 inhibitors and 290 noninhibitors were collected and the model was built by AtomParis and Support vector machine.	DOI: 10.1093/bioinformatics/bty707
Metabolism	CYP2D6-sub	PKCSM		-	Active/"-" = Inactive/Not predicted	CYP2D6 substrate: Cytochrome P450 substrate 2D6 isoform: The prediction is based on molecular properties (Molecular Weight; Heavy Atom count; LogP; Heteroatoms count; Rotatable Bonds count; Ring count; TPSA; Labute ASA; Fluorine atom Count; Toxicophore [1-36]; Pharmacophore count) calculated using the RDKit cheminformatics toolkit and used for training the predictive models. Two different algorithms, Random Forest and Logistic Regression, did the qualitative predictions (classification tasks). The cytochrome P450’s are responsible for metabolism of many drugs. However inhibitors of the P450’s can dramatically alter the pharmacokinetics of these drugs. It is therefore important to assess whether a given compound is likely to be a cytochrome P450 substrate. The two main isoforms responsible for drug metabolism are 2D6 and 3A4. These models were built using 671 compounds whose metabolism by each cytochrome P450 isoform has been measured. The best performing predictor in each task was chosen based 5-fold cv approach. The Weka toolkit was used for training and testing the models How to interpret the results: The predictor will assess whether a given molecule is likely to be metabolized by either P450. Qualitative information were changed: "Yes" was replaced to "Active" and "No" prediction was replaced to "-".	doi: 10.1021/acs.jmedchem.5b00104
Absorption	Caco-2 Permeability	ADMETLAB2		2.0606	10-6 cm/s	Caco-2 Permeability. The prediction is based on Multi-task Graph Attention (MGA) framework MGA is composed of input, Relation graph convolution network (RGCN) layers, attention layer and fully-connected (FC) layers. In the Input, a node represents the information of an atom, and after passing RGCN layers, the node represents general features of circular substructure centered on the atom. RGCN is an extension of the standard graph convolution network (GCN) by introducing edge features to enrich the messages used to update the hidden states in the network. Attention layers can assign different attention weights to different substructures, and then generate the customized fingerprints (CFP) from the general features for a specific task. The prediction results are mainly displayed in the tabular format in the browser, with the 2D molecular structure and a radar plot summarizing the physicochemical quality of the compound. For those endpoints predicted by the regression models concrete predictive values are provided. To obtain robust and accurate prediction models, the model training process was repeated ten times with random data splitting. The best performing models were incorporated into the online platform, and different performance measures for the training test regression model: R-square (R2) of 0.943, mean absolute error (MAE) of 0.152, and root mean squared error (RMSE) of 0.117. CACO-2: Before an oral drug reaches the systemic circulation, it must pass through intestinal cell membranes via passive diffusion, carrier-mediated uptake or active transport processes. The human colon adenocarcinoma cell lines (Caco-2), as an alternative approach for the human intestinal epithelium, has been commonly used to estimate in vivo drug permeability due to their morphological and functional similarities. Thus, Caco-2 cell permeability has also been an important index for an eligible candidate drug compound. This model is based on 2464 Total molecules (positive/Negative), with 1970 in the training set (positive/Negative), 247 test set (positive/Negative), and 247 validation set (positive/Negative) drug like molecules with Caco-2 permeability values and predicts the logarithm of the apparent permeability coefficient (log Papp; log cm/s).	doi: 10.1093/nar/gkab255
Excretion	T1/2	VEGA		2.421	h	This study addresses the development of QSAR models for the prediction of total body elimination half-lives. The first aim of this work is the creation of statistically valid and predictive models for the prediction of half-lives in human; the second aim is to show how QSAR predictions can be used for the refinement of chemical screening procedures for hazard assessment. kT (h-1) rate was converted to normalized biotransformation half-life value (HLT, h), and then expressed in base 10 log units LogHLT. The dataset was taken from literature (J.A. Arnot, T.N. Brown, F. Wania, Estimating screening-level organic chemical half-lives in human. Environ Sci Technol. 2014; 48:723-730) The HL dataset consists of the union of several datasets to obtain a variety of discrete organic chemical structures with a range of HL values. The final data set is composed of 1105 chemicals with molar mass ranging from 30 (formaldehyde) to 960 (decabromodiphenyl ether) g/mol. The HLs span approximately 7.5 orders of magnitude from 0.05 h (0.002d) for nitroglycerin to 2 × 106 h (83 000 d) for 2,3,4,5,2′,3′,5′,6′-octachlorobiphenyl with a median of 7.6 h (0.32 d). The corresponding rate constants range from 14/h (330/d) to 3.5 × 10–7/h (8.3 × 10–6)/d with a median of 0.091/h (2.2/d). Eighty percent of the chemicals in the HLT QSAR data set are pharmaceuticals (measured HLT) and 20% are environmental contaminants (estimated or assumed to approximate HLT). The range of LogHLTare -1.30 / 6.30 for the training set and -1.08 / 5.83 for the test set. After successful validation, the model has been retrained on the entire dataset for implementation. Dataset was splitted in training (552) and test set (553). For more details see section 6.6 and 7.6 of QMRF. LogHLT (total elimination half-life in human) OLS-MLR method. Model developed on a training set of 552 compounds LogHLT (total elimination half-life in human)_Full model OLS-MLR method. Model developed on a training set of 1105 compounds Split model equation: LogHLT= 0.5683 + 0.3299 ScCl + 0.6018AATS7p + 0.2385 nF - 0.0043 TopoPSA - 0.0484 gmax + 0.0778 GGI1 - 0.2404minsCl - 0.27 minsHsOH Full model Equation: LogHLT= 0.6577 + 0.351 ScCl + 0.5905AATS7p - 0.0042 TopoPSA + 0.2105 nF - 0.0495 gmax - 0.4298 minsCl +0.0686 GGI1 - 0.2927 minsHsOH. .Statistics for goodness-of-fit: R2= 0.78 ; CCCtr[9,10]= 0.88 ; RMSEtr=0.62 The VEGA implementation returns the following statistics on the entire dataset (1105 compounds): R2 ext = 0.77 ; MAE = 0.489 ; MSE = 0.404 ; RMSE = 0.63.	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Transporter	Pgp substrate	vNN-ADMET		Active	Active/"-" = Inactive/Not predicted	Pgp substrates: P-glycoprotein substrates. The k-nearest neighbor (k-NN) method is widely used to develop QSAR models (Zheng and Tropsha, 2000). An alternative approach is to use a predetermined similarity criterion, vNN method, which uses all nearest neighbors that meet a structural similarity criterion to define the model's applicability domain (Liu et al., 2012, 2015; Liu and Wallqvist, 2014). When no nearest neighbor meets the criterion, the vNN method makes no prediction. P-glycoprotein (Pgp) is an essential cell membrane protein that extracts many foreign substances from the cell (Ambudkar et al., 2003). As such, it is a critical determinant of the pharmacokinetic properties of drugs. Cancer cells often overexpress Pgp, which increases the efflux of chemotherapeutic agents from the cell and prevents treatment by reducing the effective intracellular concentrations of such agents—a phenomenon known as multidrug resistance (Borst and Elferink, 2002). For this reason, identifying compounds that can either be transported out of the cell by Pgp (substrates) or impair Pgp function (inhibitors) is of great interest. vNN method is based on dataset included measurements for 422 substrates and 400 non-substrates. We classified the Pgp substrates and non-substrates as positives and negatives, respectively. Overall accuracy of 79%, when using 10-fold CV, with corresponding kappa value of 0.58. These model reliably predicted 65% of the compounds in their datasets to be Pgp substrates.	doi: 10.3389/fphar.2017.00889.
Metabolism	CYP2C9-sub	ADMETLAB2		Active	Active/"-" = Inactive/Not predicted	CYP2C9 substrate: The prediction is based on Multi-task Graph Attention (MGA) framework MGA is composed of input, Relation graph convolution network (RGCN) layers, attention layer and fully-connected (FC) layers. In the Input, a node represents the information of an atom, and after passing RGCN layers, the node represents general features of circular substructure centered on the atom. RGCN is an extension of the standard graph convolution network (GCN) by introducing edge features to enrich the messages used to update the hidden states in the network. Attention layers can assign different attention weights to different substructures, and then generate the customized fingerprints (CFP) from the general features for a specific task. The prediction results are mainly displayed in the tabular format in the browser, with the 2D molecular structure and a radar plot summarizing the physicochemical quality of the compound. For those endpoints predicted by the regression models concrete predictive values are provided. To obtain robust and accurate prediction models, the model training process was repeated ten times with random data splitting. The best performing models were incorporated into the online platform, and different performance of classification models in training and validation sets: AUC: 0.967, ACC: 0.904, SP: 0.911, Sen: 0.894, MCC: 0.801. Based on the chemical nature of biotransformation, the process of drug metabolism reactions can be divided into two broad categories: phase I (oxidative reactions) and phase II (conjugative reactions). The human cytochrome P450 family (phase I enzymes) contains 57 isozymes and these isozymes metabolize approximately two-thirds of known drugs in human with 80% of this attribute to five isozymes––1A2, 3A4, 2C9, 2C19 and 2D6. Most of these CYPs responsible for phase I reactions are concentrated in the liver. This model is based on 811 (325/486) Total molecules, with 647 (259/388) in the training set, 82 (33/49) test set, and 82 (33/49) validation set drug like molecules. Leave-cluster-out validation of classification models was used for model validation. How to interpret: Category 0: Non-substrate / Non-inhibitor; Category 1: substrate / inhibitor. The output value is the probability of being substrate / inhibitor, within the range of 0 to 1. Empirical decision: If the prediction >= 0.5 the endpoint is considered “active”. If not it is considered as inactive “-“.	doi: 10.1093/nar/gkab255
Absorption	F(20%)	ADMETLAB2		-	Active/"-" = Inactive/Not predicted	F20%. The human oral bioavailability 20%. For any drug administrated by the oral route, oral bioavailability is undoubtedly one of the most important pharmacokinetic parameters because it is the indicator of the efficiency of the drug delivery to the systemic circulation. Result interpretation: Molecules with a bioavailability ≥ 20% were classified as F20%- (Category 0), while molecules with a bioavailability < 20% were classified as F20%+ (Category 1).	doi: 10.1093/nar/gkab255
Distribution	VDss	ADMETLAB2		0.33	L/Kg	VDss: Volume Distribution: The prediction is based on Multi-task Graph Attention (MGA) framework MGA is composed of input, Relation graph convolution network (RGCN) layers, attention layer and fully-connected (FC) layers. In the Input, a node represents the information of an atom, and after passing RGCN layers, the node represents general features of circular substructure centered on the atom. RGCN is an extension of the standard graph convolution network (GCN) by introducing edge features to enrich the messages used to update the hidden states in the network. Attention layers can assign different attention weights to different substructures, and then generate the customized fingerprints (CFP) from the general features for a specific task. The prediction results are mainly displayed in the tabular format in the browser, with the 2D molecular structure and a radar plot summarizing the physicochemical quality of the compound. For those endpoints predicted by the regression models concrete predictive values are provided. To obtain robust and accurate prediction models, the model training process was repeated ten times with random data splitting. The best performing models were incorporated into the online platform, and different performance measures for the training test regression model: R-square (R2) of 0.895, mean absolute error (MAE) of 0.492, and root mean squared error (RMSE) of 0.330. Volume Distribution. The VD is a theoretical concept that connects the administered dose with the actual initial concentration present in the circulation and it is an important parameter to describe the in vivo distribution for drugs. In practical, we can speculate the distribution characters for an unknown compound according to its VD value, such as its condition binding to plasma protein, its distribution amount in body fluid and its uptake amount in tissues. This model is based on 1086 Total molecules, with 872 in the training set, 107 test set, and 107 validation set drug like molecules. How to interpret: The unit of predicted VD is L/kg.	doi: 10.1093/nar/gkab255
Transporter	OATP1B3-inh	ADMETSAR		Active	Active/"-" = Inactive/Not predicted	OATP1B3 is an uptake transporter exclusively expressed in the liver on the sinusoidal (basolateral) side of centrilobular hepatocytes. In conjunction with OATP1B1, it is responsible for the hepatic uptake of some important drug classes, notably statins, for the uptake of bile acids (in conjunction with OATP1B1 and NTCP) and bilirubin, as well as some other endogenous molecules. It is a mediator of drug interactions, but as it shares many substrates and inhibitors with another major hepatic uptake transporter, OATP1B1, its role may not be fully appreciated. The FDA and EMA recommend in vitro testing of OATP1B3 interactions for drug candidates that are eliminated in part via the liver and/or will be co-administered with OATP1B3 substrates. OATP1B3i was trained by 1743 inhibitors and 130 noninhibitors with Morgan fingerprint and random forest.	DOI: 10.1093/bioinformatics/bty707
Metabolism	CYP1A2-sub	ADMETLAB2		-	Active/"-" = Inactive/Not predicted	CYP1A2 substrate: The prediction is based on Multi-task Graph Attention (MGA) framework MGA is composed of input, Relation graph convolution network (RGCN) layers, attention layer and fully-connected (FC) layers. In the Input, a node represents the information of an atom, and after passing RGCN layers, the node represents general features of circular substructure centered on the atom. RGCN is an extension of the standard graph convolution network (GCN) by introducing edge features to enrich the messages used to update the hidden states in the network. Attention layers can assign different attention weights to different substructures, and then generate the customized fingerprints (CFP) from the general features for a specific task. The prediction results are mainly displayed in the tabular format in the browser, with the 2D molecular structure and a radar plot summarizing the physicochemical quality of the compound. For those endpoints predicted by the regression models concrete predictive values are provided. To obtain robust and accurate prediction models, the model training process was repeated ten times with random data splitting. The best performing models were incorporated into the online platform, and different performance of classification models in training and validation sets: AUC: 0.985, ACC: 0.936, SP: 0.942, Sen: 0.929, MCC: 0.871. Based on the chemical nature of biotransformation, the process of drug metabolism reactions can be divided into two broad categories: phase I (oxidative reactions) and phase II (conjugative reactions). The human cytochrome P450 family (phase I enzymes) contains 57 isozymes and these isozymes metabolize approximately two-thirds of known drugs in human with 80% of this attribute to five isozymes––1A2, 3A4, 2C9, 2C19 and 2D6. Most of these CYPs responsible for phase I reactions are concentrated in the liver. This model is based on 366 (176/190) Total molecules, with 292 (140/152) in the training set, 37 (18/19)) test set, and 37 (18/19) validation set drug like molecules. Leave-cluster-out validation of classification models was used for model validation. How to interpret: Category 0: Non-substrate / Non-inhibitor; Category 1: substrate / inhibitor. The output value is the probability of being substrate / inhibitor, within the range of 0 to 1. Empirical decision: If the prediction >= 0.5 the endpoint is considered “active”. If not it is considered as inactive “-“.	doi: 10.1093/nar/gkab255
Transporter	Pgp substrate	ADMETSAR		-	Active/"-" = Inactive/Not predicted	The P-glycoprotein substrates (pgps+) and nonsubstrates (pgps-) were collected from two research articles. In total 718 pgps+ and 847 pgps- were obtained after prepreparation including removing salts, repetative and inorganic compounds. The model was built by Morgan fingerprint with support vector machine.	DOI: 10.1093/bioinformatics/bty707
Metabolism	CYP2D6-inh	SWISSADME		-	Active/"-" = Inactive/Not predicted	CYP2D6 inhibitor: Cytochrome P450 inhibition (drug-drug interaction): The Support Vector Machin (SVM) method (SVM) Cortes, C. & Vapnik, V. (1995) on meticulously cleansed large datasets of known inhibitors/non-inhibitors. In similar contexts, SVM was found to perform better than other machine-learning algorithms for binary classification (Mishra et al. 2010). The models return “Yes” or “No” if the molecule under investigation has higher probability to be respectively inhibitor or non-inhibitor of a given CYP. Cytochrome P450 enzymes (CYPs) constitute a superfamily of proteins that play an important role in the metabolism and detoxification of xenobiotics (Brown et al., 2008). A drug should not be rapidly metabolized by CYPs if it is to maintain an effective concentration. In addition, it should not inhibit drug-metabolizing CYPs, because such an effect could elevate the concentration of a co-administered drug and potentially lead to drug overdose—an effect known as a drug-drug interaction. CYP2D6: Cytochrome P 450 2D6 inhibitor: SVM Model built on 3664 molecules (Training set) and tested on 1068 molecules (Test set). 10 fold CV: ACC=0.79/AUC=0.85, External: ACC = 0.81 / AUC = 0.87. Results were adapted and the “Yes” and “No” indicators were changed respectively to “Active” and “-“.
Transporter	OCT1-inh	ADMETSAR		-	Active/"-" = Inactive/Not predicted	OCT1 is primarily a hepatic uptake transporter, expressed on the sinusoidal membrane (blood side) of hepatocytes. It plays a key role in the disposition and hepatic clearance of mostly cationic drugs and endogenous compounds. It functions in conjunction with MATE1 that facilitates the biliary elimination of OCT1 substrates transported into the liver. Metformin is an important clinical substrate of OCT1. Genetic polymorphisms of OCT1 are associated with altered metformin pharmacokinetics, safety, and efficacy, but the contributions of other cation transporters and their functional SNPs are also important. Since the discovery of MATEs, DDIs ascribed to OCTs are being re-evaluated, and it is likely that some interactions may be re-assigned to MATEs. Regardless of this, the role of OCT1 as the first step in active hepatic extraction of cationic drugs remains important. Current FDA and EMA guidances do not specifically recommend evaluation of OCT1 liabilities, although investigation of OCT2 or OCTs in general is advised. It is appropriate to consider evaluating OCT1 interactions for drugs that are likely to be co-administered with OCT/MATE substrates, particularly metformin. Although there is no guidance for MATEs either, simultaneous evaluation of their interactions is also advisable.	DOI: 10.1093/bioinformatics/bty707
Metabolism	CYP2C19-inh	SWISSADME		-	Active/"-" = Inactive/Not predicted	CYP2C19 inhibitor: Cytochrome P450 inhibition (drug-drug interaction): The Support Vector Machin (SVM) method (SVM) Cortes, C. & Vapnik, V. (1995) on meticulously cleansed large datasets of known inhibitors/non-inhibitors. In similar contexts, SVM was found to perform better than other machine-learning algorithms for binary classification (Mishra et al. 2010). The models return “Yes” or “No” if the molecule under investigation has higher probability to be respectively inhibitor or non-inhibitor of a given CYP. Cytochrome P450 enzymes (CYPs) constitute a superfamily of proteins that play an important role in the metabolism and detoxification of xenobiotics (Brown et al., 2008). A drug should not be rapidly metabolized by CYPs if it is to maintain an effective concentration. In addition, it should not inhibit drug-metabolizing CYPs, because such an effect could elevate the concentration of a co-administered drug and potentially lead to drug overdose—an effect known as a drug-drug interaction. CYP2C19: Cytochrome P 450 2C19 inhibitor: SVM Model is built on 9272 molecules (Training set) and tested on 3000 molecules (Test set). 10 fold CV: ACC=0.80/AUC=0.86, external: ACC = 0.80 / AUC = 0.87.
Metabolism	UGT activity	ADMETSAR		Active	Active/"-" = Inactive/Not predicted	Human uridine diphosphate (UDP)-glucuronosyltransferases (UGTs) are major phase II drug-metabolizing enzymes that catalyze transfer of glucuronic acid from UDP-glucuronic acid to various substrates containing nucleophilic functional group, e.g. alcohols, phenols, carboxylic acids, amines, thiols and so forth. Up until now, 22 human UGT proteins have been identified, and they can be classified in four families: UGT1, UGT2, UGT3 and UGT8	DOI: 10.1093/bioinformatics/bty707
Metabolism	CYP2D6-inh	ADMETSAR		-	Active/"-" = Inactive/Not predicted	Five subgroups of CYP inhibitors were collected by Cheng et al.,[1] including 1a2, 2d6, 2c9, 2c19, and 3a4. A compound was assigned as a CYP inhibitor if the AC50 (the compound concentration leads to 50% of the activity of an inhibition control) value was 10 μM, and it was considered as a noninhibitor if AC50 was >57 μM. In addition, a compound was regarded as a CYP inhibitor if it has the PubChem activity score between 40 and 100, and as a noninhibitor if it has PubChem activity score equal to 0. Three subgroups of CYP substrates were collected by Carbon-Mangles et al., including 2d6, 2c9, and 3a4.[2] The models were built by by MACCS fingerprints and Support vector machine.	DOI: 10.1093/bioinformatics/bty707
Transporter	Renal OCT2 substrate	PKCSM		-	Active/"-" = Inactive/Not predicted	Renal OCT2 substrate: Organic Cation Transporter 2: The prediction is based on molecular properties (Molecular Weight; Heavy Atom count; LogP; Heteroatoms count; Rotatable Bonds count; Ring count; TPSA; Labute ASA; Fluorine atom Count; Toxicophore [1-36]; Pharmacophore count) calculated using the RDKit cheminformatics toolkit and used for training the predictive models. Two different algorithms, Random Forest and Logistic Regression, did the qualitative predictions (classification tasks). Organic Cation Transporter 2 is a renal uptake transporter that plays an important role in disposition and renal clearance of drugs and endogenous compounds. OCT2 substrates also have the potential for adverse interactions with coadministered OCT2 inhibitors. Assessing a candidate’s potential to be transported by OCT2 provides useful information regarding not only its clearance but potential contraindications. This model was built using 906 compounds whose transport by OCT2 has been experimentally measured. The best performing predictor in each task was chosen based 5-fold cv approach. The Weka toolkit was used for training and testing the models. How to interpret the results: The predictor will assess whether a given molecule is likely to be an OCT2 substrate. Qualitative information were changed: "Yes" was replaced to "Active" and "No" prediction was replaced to "-".	doi: 10.1021/acs.jmedchem.5b00104
Metabolism	CYP2C19-inh	ADMETSAR		-	Active/"-" = Inactive/Not predicted	Five subgroups of CYP inhibitors were collected by Cheng et al.,[1] including 1a2, 2d6, 2c9, 2c19, and 3a4. A compound was assigned as a CYP inhibitor if the AC50 (the compound concentration leads to 50% of the activity of an inhibition control) value was 10 μM, and it was considered as a noninhibitor if AC50 was >57 μM. In addition, a compound was regarded as a CYP inhibitor if it has the PubChem activity score between 40 and 100, and as a noninhibitor if it has PubChem activity score equal to 0. Three subgroups of CYP substrates were collected by Carbon-Mangles et al., including 2d6, 2c9, and 3a4.[2] The models were built by by MACCS fingerprints and Support vector machine.	DOI: 10.1093/bioinformatics/bty707
Transporter	Pgp Inhibitor	VEGA		-	Active/"-" = Inactive/Not predicted	The model provides a qualitative prediction of P-Glycoprotein inhibition/substrate activity. 96 molecular descriptors were used. To further reduce the likelihood of correlations between descriptors, a Kohonen top-map was used (Drganet al., 2017). In this way, the remaining descriptors were mapped onto a network with a 7 by 7 architecture of neurons using the transpose of the descriptor matrix; two descriptors were selected from each neuron, those with the largest and the shortest Euclidean distance to the central neuron, yielding a final set of 96 molecular descriptors for further use. The dataset was collected mainly from the admet SAR database (http://lmmd.ecust.edu.cn/admetsar2) and from the work of Li et al (doi.org/10.1021/mp400450m) and contain 1785 chemicals (training set). P-Glycoprotein Activity Classification Model (NIC) is Counter Propagation Artificial Neural Network (CP ANN) Multiclass classification model Counter Propagation Artificial Neural Network (CP ANN) in combination with The genetic algorithm (GA)Mora Lagares, L., Minovski, N., & Novic, M. (2019). Multiclass Classifier for P-Glycoprotein Substrates, Inhibitors, and Non-Active Compounds. Molecules, 24(10). doi:10.3390/molecules24102006. The training set properties were of Accuracy = 0.95, Specificity = 0.95, Sensitivity = 0.95. The initial predictions were Inhibitor/Substrate/Non-active.	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Metabolism	CYP2C9-inh	ADMETLAB2		-	Active/"-" = Inactive/Not predicted	CYP2C9 inhibitor: The prediction is based on Multi-task Graph Attention (MGA) framework MGA is composed of input, Relation graph convolution network (RGCN) layers, attention layer and fully-connected (FC) layers. In the Input, a node represents the information of an atom, and after passing RGCN layers, the node represents general features of circular substructure centered on the atom. RGCN is an extension of the standard graph convolution network (GCN) by introducing edge features to enrich the messages used to update the hidden states in the network. Attention layers can assign different attention weights to different substructures, and then generate the customized fingerprints (CFP) from the general features for a specific task. The prediction results are mainly displayed in the tabular format in the browser, with the 2D molecular structure and a radar plot summarizing the physicochemical quality of the compound. For those endpoints predicted by the regression models concrete predictive values are provided. To obtain robust and accurate prediction models, the model training process was repeated ten times with random data splitting. The best performing models were incorporated into the online platform, and different performance of classification models in training and validation sets: AUC: 0.960, ACC: 0.880, SP: 0.849 Sen: 0.942, MCC: 0.755. Based on the chemical nature of biotransformation, the process of drug metabolism reactions can be divided into two broad categories: phase I (oxidative reactions) and phase II (conjugative reactions). The human cytochrome P450 family (phase I enzymes) contains 57 isozymes and these isozymes metabolize approximately two-thirds of known drugs in human with 80% of this attribute to five isozymes––1A2, 3A4, 2C9, 2C19 and 2D6. Most of these CYPs responsible for phase I reactions are concentrated in the liver. This model is based on 12111 (4017/8094), Total molecules, with 9686 (3213/6473) in the training set, 1213 (402/811) test set, and 1212 (402/810) validation set drug like molecules. Leave-cluster-out validation of classification models was used for model validation. How to interpret: Category 0: Non-substrate / Non-inhibitor; Category 1: substrate / inhibitor. The output value is the probability of being substrate / inhibitor, within the range of 0 to 1. Empirical decision: If the prediction >= 0.5 the endpoint is considered “active”. If not it is considered as inactive “-“.	doi: 10.1093/nar/gkab255
Distribution	BBB permeant	ADMETLAB2		Active	Active/"-" = Inactive/Not predicted	BBB Penetration: The prediction is based on Multi-task Graph Attention (MGA) framework MGA is composed of input, Relation graph convolution network (RGCN) layers, attention layer and fully-connected (FC) layers. In the Input, a node represents the information of an atom, and after passing RGCN layers, the node represents general features of circular substructure centered on the atom. RGCN is an extension of the standard graph convolution network (GCN) by introducing edge features to enrich the messages used to update the hidden states in the network. Attention layers can assign different attention weights to different substructures, and then generate the customized fingerprints (CFP) from the general features for a specific task. The prediction results are mainly displayed in the tabular format in the browser, with the 2D molecular structure and a radar plot summarizing the physicochemical quality of the compound. For those endpoints predicted by the regression models concrete predictive values are provided. To obtain robust and accurate prediction models, the model training process was repeated ten times with random data splitting. The best performing models were incorporated into the online platform, and different performance of classification models in training and validation sets: AUC: 0.992, ACC: 0.957, SP: 0.948, Sen: 0.964, MCC: 0.912. Drugs that act in the CNS need to cross the blood–brain barrier (BBB) to reach their molecular target. By contrast, for drugs with a peripheral target, little or no BBB penetration might be required in order to avoid CNS side effects. This model is based on 2865 (1651 pos/1254 neg) Total molecules, with 2324 (1321/1003) in the training set, 290 (165/125) test set, and 291 (165/126) validation set drug like molecules. Leave-cluster-out validation of classification models was used for model validation. How to interpret: The unit of BBB penetration is cm/s. Molecules with logBB > -1 were classified as BBB+ (Category 1), while molecules with logBB ≤ -1 were classified as BBB- (Category 0). The output value is the probability of being BBB+, within the range of 0 to 1.	doi: 10.1093/nar/gkab255
Metabolism	CYP3A4-inh	ADMETLAB2		-	Active/"-" = Inactive/Not predicted	CYP3A4 inhibitor: The prediction is based on Multi-task Graph Attention (MGA) framework MGA is composed of input, Relation graph convolution network (RGCN) layers, attention layer and fully-connected (FC) layers. In the Input, a node represents the information of an atom, and after passing RGCN layers, the node represents general features of circular substructure centered on the atom. RGCN is an extension of the standard graph convolution network (GCN) by introducing edge features to enrich the messages used to update the hidden states in the network. Attention layers can assign different attention weights to different substructures, and then generate the customized fingerprints (CFP) from the general features for a specific task. The prediction results are mainly displayed in the tabular format in the browser, with the 2D molecular structure and a radar plot summarizing the physicochemical quality of the compound. For those endpoints predicted by the regression models concrete predictive values are provided. To obtain robust and accurate prediction models, the model training process was repeated ten times with random data splitting. The best performing models were incorporated into the online platform, and different performance of classification models in training and validation sets: AUC: 0.960, ACC: 0.891, SP: 0.869 Sen: 0.922, MCC: 0.781. Based on the chemical nature of biotransformation, the process of drug metabolism reactions can be divided into two broad categories: phase I (oxidative reactions) and phase II (conjugative reactions). The human cytochrome P450 family (phase I enzymes) contains 57 isozymes and these isozymes metabolize approximately two-thirds of known drugs in human with 80% of this attribute to five isozymes––1A2, 3A4, 2C9, 2C19 and 2D6. Most of these CYPs responsible for phase I reactions are concentrated in the liver. This model is based on 12339 (5092/7247), Total molecules, with 9880 (4074/5806) in the training set, 1232 (510/722) test set, and 1227 (508/719) validation set drug like molecules. Leave-cluster-out validation of classification models was used for model validation. How to interpret: Category 0: Non-substrate / Non-inhibitor; Category 1: substrate / inhibitor. The output value is the probability of being substrate / inhibitor, within the range of 0 to 1. Empirical decision: If the prediction >= 0.5 the endpoint is considered “active”. If not it is considered as inactive “-“.	doi: 10.1093/nar/gkab255
Absorption	Kp	PKCSM		0.4985	10-6 cm/s	Skin Permeability (log Kp): The prediction is based on molecular properties (Molecular Weight; Heavy Atom count; LogP; Heteroatoms count; Rotatable Bonds count; Ring count; TPSA; Labute ASA; Fluorine atom Count; Toxicophore [1-36]; Pharmacophore count) calculated using the RDKit cheminformatics toolkit and used for training the predictive models. Two different algorithms, Gaussian Processes and Model Tree Regression did the quantitative predictions (regression tasks). Skin permeability is a significant consideration for many consumer products efficacy, and of interest for the development of transdermal drug delivery. The best performing predictor in each task was chosen based on 5-fold cv approach. The Weka toolkit was used for training and testing the models. This predictor was built using 211 compounds whose in vitro human skin permeability has been measured. How to interpret the results. It predicts whether if given compound is likely to be skin permeable, expressed as the skin permeability constant logKp (cm/h). A compound is considered to have a relatively low skin permeability if it has a logKp > -2.5. DOI: 10.1016/j.taap.2014.12.013. The data were transformed in cm/s to be more easily compared to other tools.	doi: 10.1021/acs.jmedchem.5b00104
Distribution	CNS permeability	PKCSM		0.1337	(µL/min/g brain)	CNS permeability: central nervous system (CNS) permeability (alternative to blood-brain barrier permeability). The prediction is based on molecular properties (Molecular Weight; Heavy Atom count; LogP; Heteroatoms count; Rotatable Bonds count; Ring count; TPSA; Labute ASA; Fluorine atom Count; Toxicophore [1-36]; Pharmacophore count) calculated using the RDKit cheminformatics toolkit and used for training the predictive models. Two different algorithms, Gaussian Processes and Model Tree Regression did the quantitative predictions (regression tasks). Measuring blood brain permeability can difficult with confounding factors. The blood-brain permeability-surface area product (logPS) is a more direct measurement. It is obtained from in situ brain perfusions with the compound directly injected into the carotid artery. This lacks the systemic distribution effects which may distort brain penetration. This predictive model was built using 153 compounds whose logPS has been experimentally measured. PS (measured in the unit mL/min/g brain). [PS = −F ln(1 – (Kin/F))] where F is the cerebral blood or perfusion flow rate and Kin is the unidirectional transfer constant. [Kin = (Qbr/Cpf)/T] , in which Qbr is the concentration, corrected for the vascular volume, of compound in the brain, Cpf is the concentration of compound in the perfusion fluid and T is the perfusion time. The best performing predictor in each task was chosen based 10-fold cv approach. The Weka toolkit was used for training and testing the models How to interpret the results: Compounds with a logPS > -2 are considered to penetrate the Central Nervous System (CNS), while those with logPS < -3 are considered as unable to penetrate the CNS.	doi: 10.1021/acs.jmedchem.5b00104
Metabolism	CYP3A4-sub	ADMETSAR		Active	Active/"-" = Inactive/Not predicted	Five subgroups of CYP inhibitors were collected by Cheng et al.,[1] including 1a2, 2d6, 2c9, 2c19, and 3a4. A compound was assigned as a CYP inhibitor if the AC50 (the compound concentration leads to 50% of the activity of an inhibition control) value was 10 μM, and it was considered as a noninhibitor if AC50 was >57 μM. In addition, a compound was regarded as a CYP inhibitor if it has the PubChem activity score between 40 and 100, and as a noninhibitor if it has PubChem activity score equal to 0. Three subgroups of CYP substrates were collected by Carbon-Mangles et al., including 2d6, 2c9, and 3a4.[2] The models were built by by MACCS fingerprints and Support vector machine.	DOI: 10.1093/bioinformatics/bty707
Metabolism	CYP2C9-inh	vNN-ADMET		-	Active/"-" = Inactive/Not predicted	CYP2C9 Cytochrome P450 inhibition (drug-drug interaction): The k-nearest neighbor (k-NN) method is widely used to develop QSAR models (Zheng and Tropsha, 2000). An alternative approach is to use a predetermined similarity criterion, vNN method, which uses all nearest neighbors that meet a structural similarity criterion to define the model's applicability domain (Liu et al., 2012, 2015; Liu and Wallqvist, 2014). When no nearest neighbor meets the criterion, the vNN method makes no prediction. Cytochrome P450 enzymes (CYPs) constitute a superfamily of proteins that play an important role in the metabolism and detoxification of xenobiotics (Brown et al., 2008). A drug should not be rapidly metabolized by CYPs if it is to maintain an effective concentration. In addition, it should not inhibit drug-metabolizing CYPs, because such an effect could elevate the concentration of a co-administered drug and potentially lead to drug overdose—an effect known as a drug-drug interaction. CYP2C9: CYP inhibitors from ChEMBL (Bento et al., 2014) and classified them as inhibitors if the IC50 was below 10 μM. VNN medel was applied on the base of dataset of 8,072 molecules Tanimoto-distance thresold value of 0.50 Accurancy 0.91 sensitivity 0.55 Specificty 0.96 kappa 0.54 coverage 0.76. Results were adapted and the “Yes” and “No” indicators were changed respectively to “Active” and “-“.	doi: 10.3389/fphar.2017.00889.
Excretion	T1/2	ADMETLAB2		Active	Active/"-" = Inactive/Not predicted	T1/2 substrate: half time below or upper than 3 hours. The prediction is based on Multi-task Graph Attention (MGA) framework MGA is composed of input, Relation graph convolution network (RGCN) layers, attention layer and fully-connected (FC) layers. In the Input, a node represents the information of an atom, and after passing RGCN layers, the node represents general features of circular substructure centered on the atom. RGCN is an extension of the standard graph convolution network (GCN) by introducing edge features to enrich the messages used to update the hidden states in the network. Attention layers can assign different attention weights to different substructures, and then generate the customized fingerprints (CFP) from the general features for a specific task. The prediction results are mainly displayed in the tabular format in the browser, with the 2D molecular structure and a radar plot summarizing the physicochemical quality of the compound. For those endpoints predicted by the regression models concrete predictive values are provided. To obtain robust and accurate prediction models, the model training process was repeated ten times with random data splitting. The best performing models were incorporated into the online platform, and different performance of classification models in training and validation sets: AUC: 0.948, ACC: 0.869, SP: 0.822, Sen: 0.938, MCC: 0.746. The half-life of a drug is a hybrid concept that involves clearance and volume of distribution, and it is arguably more appropriate to have reliable estimates of these two properties instead. This model is based on 1219 (500/719) Total molecules, with 973 (399/574) in the training set, 124 (51/73) test set, and 122 (50/72) validation set drug like molecules. Leave-cluster-out validation of classification models was used for model validation. How to interpret: Molecules with T1/2 > 3 hours were classified as T1/2 - (Category 0), while molecules with T1/2 ≤ 3 hours were classified as T1/2 + (Category 1). The output value is the probability of being T1/2+, within the range of 0 to 1. Empirical decision: If the prediction >= 0.5 the endpoint is considered “active”. If not it is considered as inactive “-“.	doi: 10.1093/nar/gkab255
Metabolism	CYP3A4-inh	SWISSADME		-	Active/"-" = Inactive/Not predicted	CYP3A4 inhibitors: Cytochrome P450 inhibition (drug-drug interaction): The Support Vector Machin method (SVM _ Cortes, C. & Vapnik, V. (1995)) was applied on meticulously cleansed large datasets of known inhibitors/non-inhibitors. In similar contexts, SVM was found to perform better than other machine-learning algorithms for binary classification (Mishra et al. 2010). The models return “Yes” or “No” if the molecule under investigation has higher probability to be respectively inhibitor or non-inhibitor of a given CYP. Cytochrome P450 enzymes (CYPs) constitute a superfamily of proteins that play an important role in the metabolism and detoxification of xenobiotics (Brown et al., 2008). A drug should not be rapidly metabolized by CYPs if it is to maintain an effective concentration. In addition, it should not inhibit drug-metabolizing CYPs, because such an effect could elevate the concentration of a co-administered drug and potentially lead to drug overdose—an effect known as a drug-drug interaction. CYP3A4: Cytochrome P 450 3A4 inhibitor: SVM Model built on 7518 molecules (Training set) and tested on 2579 molecules (Test set). 10 fold CV: ACC=0.77/AUC=0.85, External: ACC = 0.78 / AUC = 0.86. Results were adapted and the “Yes” and “No” indicators were changed respectively to “Active” and “-“.
Distribution	BBB permeant	PKCSM		0.06	Brain/blood partition coefficient (no unit)	BBB permeability: blood-brain barrier (BBB) permeability. The prediction is based on molecular properties (Molecular Weight; Heavy Atom count; LogP; Heteroatoms count; Rotatable Bonds count; Ring count; TPSA; Labute ASA; Fluorine atom Count; Toxicophore [1-36]; Pharmacophore count) calculated using the RDKit cheminformatics toolkit and used for training the predictive models. Two different algorithms, Gaussian Processes and Model Tree Regression did the quantitative predictions (regression tasks). The brain is protected from exogenous compounds by the blood-brain barrier (BBB). The ability of a drug to cross into the brain is an important parameter to consider to help reduce side effects and toxicities or to improve the efficacy of drugs whose pharmacological activity is within the brain. Blood-brain permeability is measured in vivo in animal models as logBB, the logarithmic ratio of brain to plasma drug concentrations. LogBB = log (C brain/ C blood). This predictive model was built using 320 compounds whose logBB has been experimentally measured. The best performing predictor in each task was chosen based 10-fold cv. approach. The Weka toolkit was used for training and testing the models.	doi: 10.1021/acs.jmedchem.5b00104
Absorption	F(30%)	ADMETLAB2		-	Active/"-" = Inactive/Not predicted	F30%. The human oral bioavailability 30%. For any drug administrated by the oral route, oral bioavailability is undoubtedly one of the most important pharmacokinetic parameters because it is the indicator of the efficiency of the drug delivery to the systemic circulation. Result interpretation: Molecules with a bioavailability ≥ 30% were classified as F30%- (Category 0), while molecules with a bioavailability < 30% were classified as F30%+ (Category 1). The output value is the probability of being F30%+, within the range of 0 to 1. Empirical decision: 0-0.3: excellent (green); 0.3-0.7: medium (yellow); 0.7-1.0(++): poor (red). If the prediction is upper or equal to 0.5 the molecules is considered as “Active”. If not the molecules is noted “-“.	doi: 10.1093/nar/gkab255
Metabolism	CYP-inh Pro	ADMETSAR		-	Active/"-" = Inactive/Not predicted	Five subgroups of CYP inhibitors were collected by Cheng et al.,[1] including 1a2, 2d6, 2c9, 2c19, and 3a4. A compound was assigned as a CYP inhibitor if the AC50 (the compound concentration leads to 50% of the activity of an inhibition control) value was 10 μM, and it was considered as a noninhibitor if AC50 was >57 μM. In addition, a compound was regarded as a CYP inhibitor if it has the PubChem activity score between 40 and 100, and as a noninhibitor if it has PubChem activity score equal to 0. Three subgroups of CYP substrates were collected by Carbon-Mangles et al., including 2d6, 2c9, and 3a4.[2] The models were built by by MACCS fingerprints and Support vector machine.	DOI: 10.1093/bioinformatics/bty707
Metabolism	CYP3A4-inh	vNN-ADMET		-	Active/"-" = Inactive/Not predicted	CYP3A4 Cytochrome P450 inhibition (drug-drug interaction): The k-nearest neighbor (k-NN) method is widely used to develop QSAR models (Zheng and Tropsha, 2000). An alternative approach is to use a predetermined similarity criterion, vNN method, which uses all nearest neighbors that meet a structural similarity criterion to define the model's applicability domain (Liu et al., 2012, 2015; Liu and Wallqvist, 2014). When no nearest neighbor meets the criterion, the vNN method makes no prediction. Cytochrome P450 enzymes (CYPs) constitute a superfamily of proteins that play an important role in the metabolism and detoxification of xenobiotics (Brown et al., 2008). A drug should not be rapidly metabolized by CYPs if it is to maintain an effective concentration. In addition, it should not inhibit drug-metabolizing CYPs, because such an effect could elevate the concentration of a co-administered drug and potentially lead to drug overdose—an effect known as a drug-drug interaction. CYP3A4: CYP inhibitors from ChEMBL (Bento et al., 2014) and classified them as inhibitors if the IC50 was below 10 μM. VNN medel was applied on the base of dataset of 10,373 molecules Tanimoto-distance threshold value of 0.50 Accuracy, 0.88 sensitivity 0.76, Specificity 0.92, kappa 0.68, coverage 0.78. Results were adapted and the “Yes” and “No” indicators were changed respectively to “Active” and “-“.	doi: 10.3389/fphar.2017.00889.
Metabolism	CYP2C9-inh	PKCSM		-	Active/"-" = Inactive/Not predicted	CYP2C9 inhibitor: Cytochrome P450 substrate 2C9 isoform: The prediction is based on molecular properties (Molecular Weight; Heavy Atom count; LogP; Heteroatoms count; Rotatable Bonds count; Ring count; TPSA; Labute ASA; Fluorine atom Count; Toxicophore [1-36]; Pharmacophore count) calculated using the RDKit cheminformatics toolkit and used for training the predictive models. Two different algorithms, Random Forest and Logistic Regression, did the qualitative predictions (classification tasks). Cytochrome P450 is an important detoxification enzyme in the body, mainly found in the liver. It oxidises xenobiotics to facilitate their excretion. Many drugs are deactivated by the cytochrome P450’s, and some can be activated by it. Inhibitors of this enzyme, such as grapefruit juice, can affect drug metabolism and are contraindicated. It is therefore important to assess a compound’s ability to inhibit the cytochrome P450. Model for CYP2C9 inhibitor was built using from over 14709 compounds whose ability to inhibit the cytochrome P450 2C9 has been determined. A compound is considered to be a cytochrome P450 inhibitor if the concentration required to lead to 50% inhibition is less than 10 uM. The best performing predictor in each task was chosen based 5-fold cv approach. The Weka toolkit was used for training and testing the models. How to interpret the results: The predictors will assess a given molecule to determine whether it is likely going to be a cytochrome P450 inhibitor, for a given isoform. Qualitative information were changed: "Yes" was replaced to "Active" and "No" prediction was replaced to "-".	doi: 10.1021/acs.jmedchem.5b00104
Distribution	FU	ADMETLAB2		38.7545	%	FU: Fraction unbound: The prediction is based on Multi-task Graph Attention (MGA) framework MGA is composed of input, Relation graph convolution network (RGCN) layers, attention layer and fully-connected (FC) layers. In the Input, a node represents the information of an atom, and after passing RGCN layers, the node represents general features of circular substructure centered on the atom. RGCN is an extension of the standard graph convolution network (GCN) by introducing edge features to enrich the messages used to update the hidden states in the network. Attention layers can assign different attention weights to different substructures, and then generate the customized fingerprints (CFP) from the general features for a specific task. The prediction results are mainly displayed in the tabular format in the browser, with the 2D molecular structure and a radar plot summarizing the physicochemical quality of the compound. For those endpoints predicted by the regression models concrete predictive values are provided. To obtain robust and accurate prediction models, the model training process was repeated ten times with random data splitting. The best performing models were incorporated into the online platform, and different performance measures for the training test regression model: R-square (R2) of 0.861, mean absolute error (MAE) of 0.268, and root mean squared error (RMSE) of 0.197. The fraction unbound in plasma. Most drugs in plasma will exist in equilibrium between either an unbound state or bound to serum proteins. Efficacy of a given drug may be affect by the degree to which it binds proteins within blood, as the more that is bound the less efficiently it can traverse cellular membranes or diffuse. This model is based on 2575 Total molecules, with 2059 in the training set, 258 test set, and 258 validation set drug like molecules.	doi: 10.1093/nar/gkab255
Absorption	HOB	ADMETSAR		-	Active/"-" = Inactive/Not predicted	In total, 995 molecules were collected from Kim et al. (2014) , including 509 positive and 486 negative compounds. Compounds with logK(%F) > 0 were considered as positive. The model is based on Morgan fingerprint descriptor and random forest algorithm. The binary model’s performances were AUC: 0.752, Accuracy: 0.697, Sensitivity: 0.739, Specificity: 0.654. The result of the prediction is binary Active or - (inactive).	DOI: 10.1093/bioinformatics/bty707
Transporter	Pgp II Inhibitor	PKCSM		-	Active/"-" = Inactive/Not predicted	Pgp inhibitor II: The P-glycoprotein I is an ATP-binding cassette (ABC) transporter. The prediction is based on molecular properties (Molecular Weight; Heavy Atom count; LogP; Heteroatoms count; Rotatable Bonds count; Ring count; TPSA; Labute ASA; Fluorine atom Count; Toxicophore [1-36]; Pharmacophore count) calculated using the RDKit cheminformatics toolkit and used for training the predictive models. Two different algorithms, Random Forest and Logistic Regression, did the qualitative predictions (classification tasks). The best performing predictor in each task was chosen. The Weka toolkit was used for training and testing the models. The best performing predictor in each task was chosen based on 5-fold cv approach. The Weka toolkit was used for training and testing the models. P-glycoprotein I and II inhibitors: Modulation of P-glycoprotein mediated transport has significant pharmacokinetic implications for Pgp substrates, which may either be exploited for specific therapeutic advantages or result in contraindications. This predictive models were build using 1273 and 1275 compounds that have been characterized for their ability to inhibit P-glycoprotein I and P-glycoprotein II transport, respectively. How to interpret the results: The predictor will determine is a given compound is likely to be a P-glycoprotein I/II inhibitor.	doi: 10.1021/acs.jmedchem.5b00104
Transporter	Pgp substrate	ADMETLAB2		-	Active/"-" = Inactive/Not predicted	Pgp substrate: substrate of P-glycoprotein. The prediction is based on Multi-task Graph Attention (MGA) framework MGA is composed of input, Relation graph convolution network (RGCN) layers, attention layer and fully-connected (FC) layers. In the Input, a node represents the information of an atom, and after passing RGCN layers, the node represents general features of circular substructure centered on the atom. RGCN is an extension of the standard graph convolution network (GCN) by introducing edge features to enrich the messages used to update the hidden states in the network. Attention layers can assign different attention weights to different substructures, and then generate the customized fingerprints (CFP) from the general features for a specific task. The prediction results are mainly displayed in the tabular format in the browser, with the 2D molecular structure and a radar plot summarizing the physicochemical quality of the compound. For those endpoints predicted by the regression models concrete predictive values are provided. To obtain robust and accurate prediction models, the model training process was repeated ten times with random data splitting. The best performing models were incorporated into the online platform, and different performance of classification models in training and validation sets: AUC: 1.000, ACC 1.000, SP: 1.000, Sen: 1.000, MCC: 1.000. As described in the Pgp-inhibitor section, modulation of P-glycoprotein mediated transport has significant pharmacokinetic implications for Pgp substrates, which may either be exploited for specific therapeutic advantages or result in contraindications. This model is based on 1185 (586/599) molecules, 949 (471/478) Total in the training set, 118 (58/60) test set, and 118 (57/61) validation set drug like molecules. Leave-cluster-out validation of classification models was used for model validation. How to interpret: Category 0: Non-substrate / Non-inhibitor; Category 1: substrate / inhibitor. The output value is the probability of being substrate / inhibitor, within the range of 0 to 1. Empirical decision: If the prediction >= 0.5 the endpoint is considered “active”. If not it is considered as inactive “-“.	doi: 10.1093/nar/gkab255
Distribution	BBB permeant	SWISSADME		-	Active/"-" = Inactive/Not predicted	BBB: blood-brain barrier: The predictions for passive human gastrointestinal absorption (HIA) and blood-brain barrier (BBB) permeation both consist in the readout of the BOILED-Egg model (Daina, A. & Zoete, V. A BOILED-Egg To Predict Gastrointestinal Absorption and Brain Penetration of Small Molecules. ChemMedChem 11, 1117–1121 (2016), an intuitive graphical classification model, which can be displayed in the SwissADME result page by clicking the red button appearing below the sketcher when all input molecules have been processed (refer to Graphical Output). This models are based on the computation of the lipophilicity (WLOGP) and polarity (tPSA). Combining both best ellipses yields the BOILED‐Egg predictive model for respectively HIA and BBB. The white region is the physicochemical space of molecules with highest probability of being absorbed by the gastrointestinal tract, and the yellow region (yolk) is the physicochemical space of molecules with highest probability to permeate to the brain. Yolk and white areas are not mutually exclusive. Other binary classification models are included, which focus on the propensity for a given small molecule to be substrate or inhibitor of proteins governing important pharmacokinetic behaviours. blood-brain barrier: according to the yolk of the BOILED-egg. doi/10.1002/cmdc.201600182. The "High", "low" and "null" predictions were replaced respectively by "Active", "-", and "NP" for not predicted.
Metabolism	CYP1A2-inh	SWISSADME		-	Active/"-" = Inactive/Not predicted	CYP1A2 inhibitor: Cytochrome P450 inhibition (drug-drug interaction). The Support Vector Machin (SVM) method (SVM) Cortes, C. & Vapnik, V. (1995) on meticulously cleansed large datasets of known inhibitors/non-inhibitors. In similar contexts, SVM was found to perform better than other machine-learning algorithms for binary classification (Mishra et al. 2010). The models return “Yes” or “No” if the molecule under investigation has higher probability to be respectively inhibitor or non-inhibitor of a given CYP. Cytochrome P450 enzymes (CYPs) constitute a superfamily of proteins that play an important role in the metabolism and detoxification of xenobiotics (Brown et al., 2008). A drug should not be rapidly metabolized by CYPs if it is to maintain an effective concentration. In addition, it should not inhibit drug-metabolizing CYPs, because such an effect could elevate the concentration of a co-administered drug and potentially lead to drug overdose—an effect known as a drug-drug interaction. Cytochrome P 4501A2 inhibitor: SVM Model built on 9145 molecules (Training set) and tested on 3000 molecules (Test set). 10 fold CV: ACC=0.83/AUC=0.90, external: ACC = 0.84 / AUC = 0.91. Results were adapted and the “Yes” and “No” indicators were changed respectively to “Active” and “-“.
Metabolism	CYP3A4-inh	PKCSM		-	Active/"-" = Inactive/Not predicted	CYP3A4 inhibitor: Cytochrome P450 substrate 3A4 isoform: The prediction is based on molecular properties (Molecular Weight; Heavy Atom count; LogP; Heteroatoms count; Rotatable Bonds count; Ring count; TPSA; Labute ASA; Fluorine atom Count; Toxicophore [1-36]; Pharmacophore count) calculated using the RDKit cheminformatics toolkit and used for training the predictive models. Two different algorithms, Random Forest and Logistic Regression, did the qualitative predictions (classification tasks). Cytochrome P450 is an important detoxification enzyme in the body, mainly found in the liver. It oxidises xenobiotics to facilitate their excretion. Many drugs are deactivated by the cytochrome P450’s, and some can be activated by it. Inhibitors of this enzyme, such as grapefruit juice, can affect drug metabolism and are contraindicated. It is therefore important to assess a compound’s ability to inhibit the cytochrome P450. Model for CYP3A4 inhibitor was built using from over 18561 compounds whose ability to inhibit the cytochrome P450 2D6 has been determined. A compound is considered to be a cytochrome P450 inhibitor if the concentration required to lead to 50% inhibition is less than 10 uM. The best performing predictor in each task was chosen based 5-fold cv approach. The Weka toolkit was used for training and testing the models. How to interpret the results: The predictors will assess a given molecule to determine whether it is likely going to be a cytochrome P450 inhibitor, for a given isoform. Qualitative information were changed: "Yes" was replaced to "Active" and "No" prediction was replaced to "-".	doi: 10.1021/acs.jmedchem.5b00104
Excretion	CL	PKCSM		27.227	ml/min/kg bw	CLtot: Total Clearance. The prediction is based on molecular properties (Molecular Weight; Heavy Atom count; LogP; Heteroatoms count; Rotatable Bonds count; Ring count; TPSA; Labute ASA; Fluorine atom Count; Toxicophore [1-36]; Pharmacophore count) calculated using the RDKit cheminformatics toolkit and used for training the predictive models. Two different algorithms, Gaussian Processes and Model Tree Regression did the quantitative predictions (regression tasks). Drug clearance is measured by the proportionality constant CLtot, and occurs primarily as a combination of hepatic clearance (metabolism in the liver and biliary clearance) and renal clearance (excretion via the kidneys). It is related to bioavailability, and is important for determining dosing rates to achieve steady[1]state concentrations. This predictor was built using the total clearance data for 398 compounds. The best performing predictor in each task was chosen based train/test approach. The Weka toolkit was used for training and testing the models. How to interpret the results: The predicted total clearance log(CLtot) of a given compound is given in log(ml/min/kg).	doi: 10.1021/acs.jmedchem.5b00104
Distribution	PPB	ADMETLA2		63.347	%	PPB: Plasma protein binding: The prediction is based on Multi-task Graph Attention (MGA) framework MGA is composed of input, Relation graph convolution network (RGCN) layers, attention layer and fully-connected (FC) layers. In the Input, a node represents the information of an atom, and after passing RGCN layers, the node represents general features of circular substructure centered on the atom. RGCN is an extension of the standard graph convolution network (GCN) by introducing edge features to enrich the messages used to update the hidden states in the network. Attention layers can assign different attention weights to different substructures, and then generate the customized fingerprints (CFP) from the general features for a specific task. The prediction results are mainly displayed in the tabular format in the browser, with the 2D molecular structure and a radar plot summarizing the physicochemical quality of the compound. For those endpoints predicted by the regression models concrete predictive values are provided. To obtain robust and accurate prediction models, the model training process was repeated ten times with random data splitting. The best performing models were incorporated into the online platform, and different performance measures for the training test regression model: R-square (R2) of 0.961, mean absolute error (MAE) of 0.054, and root mean squared error (RMSE) of 0.037. One of the major mechanisms of drug uptake and distribution is through PPB, thus the binding of a drug to proteins in plasma has a strong influence on its pharmacodynamic behavior. PPB can directly influence the oral bioavailability because the free concentration of the drug is at stake when a drug binds to serum proteins in this process. This model is based on 4712 Total molecules, with 3771 in the training set, 479 test set, and 480 validation set drug like molecules with PPB values.	doi: 10.3389/fphar.2017.00889
Distribution	Kab	VEGA		1.1978	Adipose/blood partition coefficient (no unit)	Adipose tissue:blood partition coefficient (Kab). Key-endpoint to predict the bioaccumulation and the pharmacokinetics in humans and animals, since other organ: blood affinities can be estimated as a function of this parameter. 101 in vivo data of Kab measured in rats retrieved from one review and several paper in literature ([5]-[10]) The dataset contains mono-constituent organic chemicals belonging to different categories and uses: drugs, plant protection products, polychlorinated biphenyls, volatile organic compounds. All chemicals’ names were converted in SMILES using the CIR and REST node of KNIME v 3.5.1; CAS were retrieved form ChemIDplus and PubChem. Consistence among the CAS numbers, the chemical names and chemical structures of all substances were checked. All structures were standardized and normalized. All duplicates were removed. The experimental data (in vivo tissue: plasma partition coefficients) were converted in adipose tissue: blood partition coefficient by dividing the partition coefficients determined in plasma by the blood-to-plasma ration. Due to the lack of experimental values of the blood-to-plasma ratio, we considered this value equal to 0.55 for acid drugs and equal to 1 for the remaining chemicals. All the values of KAB were changed to their base-10 logarithms. Dataset were split into training (63) and test (38) according to three criteria: 1) The covered range of the experimental activity 2) The chemical structures representativeness (using PCA on Padel descriptors) 3) A balanced distribution between training and test set chemicals with respect to their ionisation state. Model is based on Random Forest (RF) approach. Machine Learning Algorithim for Regression. The number of trees selected for the RF were finely tuned by identifying the onset of the plateau of the curve describing the Q2LOO as a function of the number of trees. Six PaDEL-Descriptors v. 2.21 PaDEL compute all the 2D descriptors: ALogp2, ATSC1s, BCUTp-1l, minHBd, XLogP, WTPT-5. Robustness - Statistics obtained by leave-one-out cross-validation: Q2 LOO = 0.73. Robustness - Statistics obtained by other methods: MAE LOO = 0.41 (Mean Absolute Error calculated for leave one out).	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Transporter	Pgp Inhibitor	ADMETSAR		-	Active/"-" = Inactive/Not predicted	The P-glycoprotein inhibitors (pgpi+) and noninhibitors (pgpi-) were collected from four research articles. In total 1172 pgpi+ and 771 pgpi- were obtained after prepreparation including removing salts, repetative and inorganic compounds. The model was built by AtomPairs fingerprint with support vector machine.	DOI: 10.1093/bioinformatics/bty707
Transporter	BRCP-inh	ADMETSAR		-	Active/"-" = Inactive/Not predicted	ABCG2, more commonly referred to as BCRP (Breast Cancer Resistance Protein), is an efflux transporter that serves two major drug transport functions. Firstly, it restricts the distribution of its substrates into organs such as the brain, testes, placenta, and across the gastrointestinal tract (GIT). Secondly, it eliminates its substrates from excretory organs, mediating both biliary and renal excretion, and occasionally direct gut secretion. Although less well studied than e.g. MDR1, BCRP is generally co-expressed with MDR1, and shares many of its substrates, inhibitors and inducers. Of its known substrates, rosuvastatin has been implicated in DDI, especially with perpetrator drugs that also inhibit OATPs (e.g. cyclosporine). It is probable that a synergy exists between the action of BCRP, MDR1, and the drug-metabolizing enzyme CYP3A4, particularly in the GIT. BCRP is included in the list of important drug transporters that both the FDA and EMA consider necessary to investigate regarding liabilities for NCEs. Drugs whose ADME, and bioavailability in particular, is influenced by BCRP may require clinical investigation to reveal a potential DDI with potent clinical BCRP inhibitors. For instance, since the GIT absorption of rosuvastatin is modulated by BCRP, it may be necessary to study the impact of BCRP inhibitors on the oral absorption of rosuvastatin. Because of the potential synergy between BCRP, CYP3A4, and MDR1, a clinical investigation examining the contribution of both drug transporters and enzymes to drug ADME may be necessary. In total 432 inhibitors and 538 noninhibitors were collected and the model was built by AtomPairs fingerprint and Support vector machine.	DOI: 10.1093/bioinformatics/bty707
Metabolism	CYP2D6-inh	ADMETLAB2		-	Active/"-" = Inactive/Not predicted	CYP2D6 inhibitor: The prediction is based on Multi-task Graph Attention (MGA) framework MGA is composed of input, Relation graph convolution network (RGCN) layers, attention layer and fully-connected (FC) layers. In the Input, a node represents the information of an atom, and after passing RGCN layers, the node represents general features of circular substructure centered on the atom. RGCN is an extension of the standard graph convolution network (GCN) by introducing edge features to enrich the messages used to update the hidden states in the network. Attention layers can assign different attention weights to different substructures, and then generate the customized fingerprints (CFP) from the general features for a specific task. The prediction results are mainly displayed in the tabular format in the browser, with the 2D molecular structure and a radar plot summarizing the physicochemical quality of the compound. For those endpoints predicted by the regression models concrete predictive values are provided. To obtain robust and accurate prediction models, the model training process was repeated ten times with random data splitting. The best performing models were incorporated into the online platform, and different performance of classification models in training and validation sets: AUC: 0.973, ACC: 0.880, SP: 0.866 Sen: 0.958, MCC: 0.715. Based on the chemical nature of biotransformation, the process of drug metabolism reactions can be divided into two broad categories: phase I (oxidative reactions) and phase II (conjugative reactions). The human cytochrome P450 family (phase I enzymes) contains 57 isozymes and these isozymes metabolize approximately two-thirds of known drugs in human with 80% of this attribute to five isozymes––1A2, 3A4, 2C9, 2C19 and 2D6. Most of these CYPs responsible for phase I reactions are concentrated in the liver. This model is based on 13073 (2535 positive /10538 negative), Total molecules, with 10471 (2032/8439) in the training set, 1304 (255/1051) test set, and 1298 (250/1048) validation set drug like molecules. Leave-cluster-out validation of classification models was used for model validation. How to interpret: Category 0: Non-substrate / Non-inhibitor; Category 1: substrate / inhibitor. The output value is the probability of being substrate / inhibitor, within the range of 0 to 1. Empirical decision: If the prediction >= 0.5 the endpoint is considered “active”. If not it is considered as inactive “-“.	doi: 10.1093/nar/gkab255
Metabolism	CYP1A2-inh	ADMETSAR		-	Active/"-" = Inactive/Not predicted	Five subgroups of CYP inhibitors were collected by Cheng et al.,[1] including 1a2, 2d6, 2c9, 2c19, and 3a4. A compound was assigned as a CYP inhibitor if the AC50 (the compound concentration leads to 50% of the activity of an inhibition control) value was 10 μM, and it was considered as a noninhibitor if AC50 was >57 μM. In addition, a compound was regarded as a CYP inhibitor if it has the PubChem activity score between 40 and 100, and as a noninhibitor if it has PubChem activity score equal to 0. Three subgroups of CYP substrates were collected by Carbon-Mangles et al., including 2d6, 2c9, and 3a4.[2] The models were built by by MACCS fingerprints and Support vector machine.	DOI: 10.1093/bioinformatics/bty707
Metabolism	CYP2C19-inh	ADMETLAB2		-	Active/"-" = Inactive/Not predicted	CYP2C19 inhibitor: The prediction is based on Multi-task Graph Attention (MGA) framework MGA is composed of input, Relation graph convolution network (RGCN) layers, attention layer and fully-connected (FC) layers. In the Input, a node represents the information of an atom, and after passing RGCN layers, the node represents general features of circular substructure centered on the atom. RGCN is an extension of the standard graph convolution network (GCN) by introducing edge features to enrich the messages used to update the hidden states in the network. Attention layers can assign different attention weights to different substructures, and then generate the customized fingerprints (CFP) from the general features for a specific task. The prediction results are mainly displayed in the tabular format in the browser, with the 2D molecular structure and a radar plot summarizing the physicochemical quality of the compound. For those endpoints predicted by the regression models concrete predictive values are provided. To obtain robust and accurate prediction models, the model training process was repeated ten times with random data splitting. The best performing models were incorporated into the online platform, and different performance of classification models in training and validation sets: AUC: 0.952, ACC: 0.877, SP: 0.845, Sen: 0.916, MCC: 0.758. Based on the chemical nature of biotransformation, the process of drug metabolism reactions can be divided into two broad categories: phase I (oxidative reactions) and phase II (conjugative reactions). The human cytochrome P450 family (phase I enzymes) contains 57 isozymes and these isozymes metabolize approximately two-thirds of known drugs in human with 80% of this attribute to five isozymes––1A2, 3A4, 2C9, 2C19 and 2D6. Most of these CYPs responsible for phase I reactions are concentrated in the liver. This model is based on 12611 (5770/6841) Total molecules, with 10096 (4618/5478) in the training set, 1257 (577/680) test set, and 1258 (575/683) validation set drug like molecules. Leave-cluster-out validation of classification models was used for model validation. How to interpret: Category 0: Non-substrate / Non-inhibitor; Category 1: substrate / inhibitor. The output value is the probability of being substrate / inhibitor, within the range of 0 to 1. Empirical decision: If the prediction >= 0.5 the endpoint is considered “active”. If not it is considered as inactive “-“.	doi: 10.1093/nar/gkab255
Transporter	Pgp substrate	SWISSADME		Active	Active/"-" = Inactive/Not predicted	Pgp substrates: P-glycoprotein substrates. Implementation within SwissADME enriched the graphical output with the prediction of P-gp substrate, which is the most important active efflux mechanism involved in those biological barriers (refer to the SVM model described in Pharmacokinetics). As a result, the user conveniently obtains on the same graph a global evaluation about passive absorption (inside/outside the white), passive brain access (inside/outside the yolk) and active efflux from the CNS or to the gastrointestinal lumen by colour-coding: blue dots for P-gp substrates (PGP+) and red dots for P-gp non-substrate (PGP−). The SVM model was based on the training set (TR: 1033) and then appled on the test set (TS: 415). The 10-fold cross-validation accuracy: 0.72, 10-fold cross-validation area under receiver operating characteristic (ROC) curve: 0.77, external validation accuracy: 0.89, external validation area under ROC curve: 0.94.
Distribution	VDss	PKCSM		0.2388	L/Kg	VDss (Human): Distribution Human Volume of Distribution at steady state. The prediction is based on molecular properties (Molecular Weight; Heavy Atom count; LogP; Heteroatoms count; Rotatable Bonds count; Ring count; TPSA; Labute ASA; Fluorine atom Count; Toxicophore [1-36]; Pharmacophore count) calculated using the RDKit cheminformatics toolkit and used for training the predictive models. Two different algorithms, Gaussian Processes and Model Tree Regression did the quantitative predictions (regression tasks). The steady state volume of distribution (VDss) is the theoretical volume that the total dose of a drug would need to be uniformly distributed to give the same concentration as in blood plasma. The higher the VD is, the more of a drug is distributed in tissue rather than plasma. It can be affected by renal failure and dehydration. This predictive model was built using the calculated steady state volume of distribution (VDss) in humans from 670 drugs. The best performing predictor in each task was chosen based on Leave-one-out cv and Train/test approach. The Weka toolkit was used for training and testing the models. The predicted logarithm of VDss of a given compound is given as the log L/kg. How to interpret the results: VDss is considered low if below 0.71 L/kg (log VDss < -0.15) and high if above 2.81 L/kg (log VDss > 0.45).	doi: 10.1021/acs.jmedchem.5b00104
Transporter	OATP1B1-inh	ADMETSAR		Active	Active/"-" = Inactive/Not predicted	OATP1B1 is an uptake transporter exclusively expressed on the sinusoidal side of hepatocytes. It is responsible for the hepatic uptake of drugs and endogenous compounds from the blood. OATP1B1 substrates often, but by no means always, contain a carboxylic acid moiety. Some important therapeutic drugs, most notably HMG-CoA inhibitors also known as statins, are substrates and/or inhibitors of OATP1B1. Inhibition of OATP1B1 can result in supra-proportional systemic exposure of the victim drug. This is particularly important for members of the statin class of medicines, where elevated blood concentrations due to inhibition of hepatic uptake can result in myopathy and rhabdomyolysis. Complex drug interactions involving OATP1B1, other uptake and efflux transporters, and drug-metabolizing enzymes (DMEs) have been described, as have clinically important genetic polymorphisms, resulting in label recommendations, dose adjustments, and product withdrawals. The FDA and EMA recommend in vitro testing of OATP1B1 interactions for drug candidates that are eliminated in part via the liver and/or will be co-administered with OATP1B1 substrates .OATP1B1i was trained by 1657 inhibitors and 198 noninhibitors with Morgan fingerprint and support vector machine.	DOI: 10.1093/bioinformatics/bty707
Metabolism	CYP2D6-sub	ADMETSAR		-	Active/"-" = Inactive/Not predicted	Five subgroups of CYP inhibitors were collected by Cheng et al.,[1] including 1a2, 2d6, 2c9, 2c19, and 3a4. A compound was assigned as a CYP inhibitor if the AC50 (the compound concentration leads to 50% of the activity of an inhibition control) value was 10 μM, and it was considered as a noninhibitor if AC50 was >57 μM. In addition, a compound was regarded as a CYP inhibitor if it has the PubChem activity score between 40 and 100, and as a noninhibitor if it has PubChem activity score equal to 0. Three subgroups of CYP substrates were collected by Carbon-Mangles et al., including 2d6, 2c9, and 3a4.[2] The models were built by by MACCS fingerprints and Support vector machine.	DOI: 10.1093/bioinformatics/bty707
Transporter	Pgp substrate	PKCSM		Active	Active/"-" = Inactive/Not predicted	Pgp Substrate: The P-glycoprotein is an ATP-binding cassette (ABC) transporter. The prediction is based on molecular properties (Molecular Weight; Heavy Atom count; LogP; Heteroatoms count; Rotatable Bonds count; Ring count; TPSA; Labute ASA; Fluorine atom Count; Toxicophore [1-36]; Pharmacophore count) calculated using the RDKit cheminformatics toolkit and used for training the predictive models. Two different algorithms, Random Forest and Logistic Regression, did the qualitative predictions (classification tasks). The best performing predictor in each task was chosen. The Weka toolkit was used for training and testing the models. The best performing predictor in each task was chosen based on 5-fold cv approach. The Weka toolkit was used for training and testing the models. The P-glycoprotein is an ATP-binding cassette (ABC) transporter. It functions as a biological barrier by extruding toxins and xenobiotics out of cells. P-glycoprotein transport screening is performed using transgenic mdr knockout mice and in vitro cell systems. This model was built using 332 compounds that have been characterized for their ability to be transported by Pgp. How to interpret the results: The model predicts whether a given compound is likely to be a substrate of Pgp or not.	doi: 10.1021/acs.jmedchem.5b00104
Metabolism	CYP2D6-inh	vNN-ADMET		-	Active/"-" = Inactive/Not predicted	CYP2D6 Cytochrome P450 inhibition (drug-drug interaction): The k-nearest neighbor (k-NN) method is widely used to develop QSAR models (Zheng and Tropsha, 2000). An alternative approach is to use a predetermined similarity criterion, vNN method, which uses all nearest neighbors that meet a structural similarity criterion to define the model's applicability domain (Liu et al., 2012, 2015; Liu and Wallqvist, 2014). When no nearest neighbor meets the criterion, the vNN method makes no prediction. Cytochrome P450 enzymes (CYPs) constitute a superfamily of proteins that play an important role in the metabolism and detoxification of xenobiotics (Brown et al., 2008). A drug should not be rapidly metabolized by CYPs if it is to maintain an effective concentration. In addition, it should not inhibit drug-metabolizing CYPs, because such an effect could elevate the concentration of a co-administered drug and potentially lead to drug overdose—an effect known as a drug-drug interaction. CYP2D6: Cytochrome P450 Inhibition (Drug-Drug Interaction): CYP inhibitors from ChEMBL (Bento et al., 2014) and classified them as inhibitors if the IC50 was below 10 μM. VNN medel was applied on the base of dataset of 7,805 molecules Tanimoto-distance thresold value of 0.50 Accurancy, 0.89 sensitivity 0.61, Specificty 0.94, kappa 0.57, coverage 0.75.Results were adapted and the “Yes” and “No” indicators were changed respectively to “Active” and “-“.	doi: 10.3389/fphar.2017.00889.
Transporter	OATP2B1-inh	ADMETSAR		-	Active/"-" = Inactive/Not predicted	OATP2B1 is a ubiquitously expressed uptake transporter with broad substrate specificity. It mostly transports organic anionic endo- and xenobiotics, and its activity appears to be pH-dependent. OATP2B1 is primarily associated with the oral absorption of drugs, notably fexofenadine, whose PK is altered when intestinal OATPs and/or MDR1 are inhibited. Its expression in the liver and other tissues, as well as the results of in vitro studies, suggest a broader role in drug ADME, DDI, and toxicology; these aspects, however, are not well understood or characterized. The FDA and EMA guidances recommend evaluation of OATP drug interaction liabilities, but do not specifically recommend investigation of OATP2B1. OATP2B1i was trained by 44 inhibitors and 175 noninhibitors with AtomPairs fingerprint and random forest.	DOI: 10.1093/bioinformatics/bty707
Metabolism	CYP2C19-inh	vNN-ADMET		-	Active/"-" = Inactive/Not predicted	CYP2C19 Cytochrome P450 inhibition (drug-drug interaction): The k-nearest neighbor (k-NN) method is widely used to develop QSAR models (Zheng and Tropsha, 2000). An alternative approach is to use a predetermined similarity criterion, vNN method, which uses all nearest neighbors that meet a structural similarity criterion to define the model's applicability domain (Liu et al., 2012, 2015; Liu and Wallqvist, 2014). When no nearest neighbor meets the criterion, the vNN method makes no prediction. Cytochrome P450 enzymes (CYPs) constitute a superfamily of proteins that play an important role in the metabolism and detoxification of xenobiotics (Brown et al., 2008). A drug should not be rapidly metabolized by CYPs if it is to maintain an effective concentration. In addition, it should not inhibit drug-metabolizing CYPs, because such an effect could elevate the concentration of a co-administered drug and potentially lead to drug overdose—an effect known as a drug-drug interaction. CYP2C19: CYP inhibitors from ChEMBL (Bento et al., 2014) and classified them as inhibitors if the IC50 was below 10 μM. VNN medel was applied on the base of dataset of 8,155 molecules Tanimoto-distance thresold value of 0.50 Accurancy 0.87 sensitivity 0.64 Specificty 0.93 kappa 0.58 coverage 0.76 Results were adapted and the “Yes” and “No” indicators were changed respectively to “Active” and “-“.	doi: 10.3389/fphar.2017.00889.
Metabolism	CYP3A4-sub	ADMETLAB2		-	Active/"-" = Inactive/Not predicted	CYP3A4 substrate: The prediction is based on Multi-task Graph Attention (MGA) framework MGA is composed of input, Relation graph convolution network (RGCN) layers, attention layer and fully-connected (FC) layers. In the Input, a node represents the information of an atom, and after passing RGCN layers, the node represents general features of circular substructure centered on the atom. RGCN is an extension of the standard graph convolution network (GCN) by introducing edge features to enrich the messages used to update the hidden states in the network. Attention layers can assign different attention weights to different substructures, and then generate the customized fingerprints (CFP) from the general features for a specific task. The prediction results are mainly displayed in the tabular format in the browser, with the 2D molecular structure and a radar plot summarizing the physicochemical quality of the compound. For those endpoints predicted by the regression models concrete predictive values are provided. To obtain robust and accurate prediction models, the model training process was repeated ten times with random data splitting. The best performing models were incorporated into the online platform, and different performance of classification models in training and validation sets: AUC: 0.948, ACC: 0.887, SP: 0.920, Sen: 0.855, MCC: 0.776. Based on the chemical nature of biotransformation, the process of drug metabolism reactions can be divided into two broad categories: phase I (oxidative reactions) and phase II (conjugative reactions). The human cytochrome P450 family (phase I enzymes) contains 57 isozymes and these isozymes metabolize approximately two-thirds of known drugs in human with 80% of this attribute to five isozymes––1A2, 3A4, 2C9, 2C19 and 2D6. Most of these CYPs responsible for phase I reactions are concentrated in the liver. This model is based on 979 (497/482) Total molecules, with 786 (397/389) in the training set, 97 (49/48) test set, and 96 (51/45) validation set drug like molecules. Leave-cluster-out validation of classification models was used for model validation. How to interpret: Category 0: Non-substrate / Non-inhibitor; Category 1: substrate / inhibitor. The output value is the probability of being substrate / inhibitor, within the range of 0 to 1. Empirical decision: If the prediction >= 0.5 the endpoint is considered “active”. If not it is considered as inactive “-“.	doi: 10.1093/nar/gkab255
Metabolism	CYP2D6-inh	PKCSM		-	Active/"-" = Inactive/Not predicted	CYP2D6 inhibitor: Cytochrome P450 substrate 2D6 isoform: The prediction is based on molecular properties (Molecular Weight; Heavy Atom count; LogP; Heteroatoms count; Rotatable Bonds count; Ring count; TPSA; Labute ASA; Fluorine atom Count; Toxicophore [1-36]; Pharmacophore count) calculated using the RDKit cheminformatics toolkit and used for training the predictive models. Two different algorithms, Random Forest and Logistic Regression, did the qualitative predictions (classification tasks). Cytochrome P450 is an important detoxification enzyme in the body, mainly found in the liver. It oxidises xenobiotics to facilitate their excretion. Many drugs are deactivated by the cytochrome P450’s, and some can be activated by it. Inhibitors of this enzyme, such as grapefruit juice, can affect drug metabolism and are contraindicated. It is therefore important to assess a compound’s ability to inhibit the cytochrome P450. Model for CYP2D6 inhibitor was built using from over 14741 compounds whose ability to inhibit the cytochrome P450 2D6 has been determined. A compound is considered to be a cytochrome P450 inhibitor if the concentration required to lead to 50% inhibition is less than 10 uM. The best performing predictor in each task was chosen based 5-fold cv approach. The Weka toolkit was used for training and testing the models. How to interpret the results: The predictors will assess a given molecule to determine whether it is likely going to be a cytochrome P450 inhibitor, for a given isoform. Qualitative information were changed: "Yes" was replaced to "Active" and "No" prediction was replaced to "-".	doi: 10.1021/acs.jmedchem.5b00104
Absorption	Caco-2 Permeability	PKCSM		1.5849	10-6 cm/s	Caco-2 Permeability. The prediction is based on molecular properties (Molecular Weight; Heavy Atom count; LogP; Heteroatoms count; Rotatable Bonds count; Ring count; TPSA; Labute ASA; Fluorine atom Count; Toxicophore [1-36]; Pharmacophore count) calculated using the RDKit cheminformatics toolkit and used for training the predictive models. Two different algorithms, Gaussian Processes and Model Tree Regression did the quantitative predictions (regression tasks). The best performing predictor in each task was chosen based on 5-fold cv. The Weka toolkit was used for training and testing the models. The Caco-2 cell line is composed of human epithelial colorectal adenocarcinoma cells. The Caco 2 monolayer of cells is widely used as an in vitro model of the human intestinal mucosa to predict the absorption of orally administered drugs. This model is based on 674 drug like molecules with Caco-2 permeability values and predicts the logarithm of the apparent permeability coefficient (log Papp; log cm/s). How to interpret: A compound is considered to have a high Caco-2 permeability if it has a Papp > 8 x 10-6 cm/s. For the pkCSM predictive model, high Caco-2 permeability would translate in predicted values > 0.90. More information: DOI 10.1021/acs.jmedchem.5b00104	doi: 10.1021/acs.jmedchem.5b00104
Transporter	OCT2-inh	ADMETSAR		-	Active/"-" = Inactive/Not predicted	OCT2 is a primarily renal uptake transporter that is expressed on the basolateral (blood) side of proximal tubule cells. It plays a key role in the disposition and renal clearance of mostly cationic drugs and endogenous compounds. It functions in conjunction with MATE1 and MATE2-K which facilitate the elimination of OCT2 substrates into the urine. Important clinical substrates include metformin and cisplatin. Gene polymorphisms of OCT2 are associated with altered metformin and cisplatin pharmacokinetics and toxicity, but the role of other cation transporters, and their functional SNPs are also important. Since the discovery of MATEs, DDIs ascribed to OCT2 are being re-evaluated, and it is likely that some interactions may be re-assigned to MATEs. Regardless of this, the role of OCT2 as the first step in active renal secretion of cationic drugs remains important. Current FDA and EMA guidelines recommend evaluation of OCT2 liabilities for drugs with high renal elimination, or which are likely to be co-administered with OCT2 substrates such as metformin. Simultaneous evaluation of MATE interactions is also advisable. In total 244 inhibitors and 633 noninhibitors were collected and the model was built by MACCS fingerprint and random forest.	DOI: 10.1093/bioinformatics/bty707
Metabolism	CYP2C19-inh	PKCSM		-	Active/"-" = Inactive/Not predicted	CYP2C19 inhibitor: Cytochrome P450 substrate 2C19 isoform: The prediction is based on molecular properties (Molecular Weight; Heavy Atom count; LogP; Heteroatoms count; Rotatable Bonds count; Ring count; TPSA; Labute ASA; Fluorine atom Count; Toxicophore [1-36]; Pharmacophore count) calculated using the RDKit cheminformatics toolkit and used for training the predictive models. Two different algorithms, Random Forest and Logistic Regression, did the qualitative predictions (classification tasks). Cytochrome P450 is an important detoxification enzyme in the body, mainly found in the liver. It oxidises xenobiotics to facilitate their excretion. Many drugs are deactivated by the cytochrome P450’s, and some can be activated by it. Inhibitors of this enzyme, such as grapefruit juice, can affect drug metabolism and are contraindicated. It is therefore important to assess a compound’s ability to inhibit the cytochrome P450. Model for CYP2C19 inhibitor was built using from over 14576 compounds whose ability to inhibit the cytochrome P450 2C19 has been determined. A compound is considered to be a cytochrome P450 inhibitor if the concentration required to lead to 50% inhibition is less than 10 uM. The best performing predictor in each task was chosen based 5-fold cv approach. The Weka toolkit was used for training and testing the models. How to interpret the results: The predictors will assess a given molecule to determine whether it is likely going to be a cytochrome P450 inhibitor, for a given isoform.	doi: 10.1021/acs.jmedchem.5b00104
Absorption	HIA	PKCSM		23.342	% of Absorption	Human Intestinal Absorption (HIA): The prediction is based on molecular properties (Molecular Weight; Heavy Atom count; LogP; Heteroatoms count; Rotatable Bonds count; Ring count; TPSA; Labute ASA; Fluorine atom Count; Toxicophore [1-36]; Pharmacophore count) calculated using the RDKit cheminformatics toolkit and used for training the predictive models. Two different algorithms, Gaussian Processes and Model Tree Regression did the quantitative predictions (regression tasks). The Intestine is normally the primary site for absorption of a drug from an orally administered solution. This method is built to predict the proportion of compounds that were absorbed through the human small intestine. How to interpret the results: For a given compound it predicts the percentage that will be absorbed through the human intestine. The best performing predictor in each task was chosen based on train/test approach. The Weka toolkit was used for training and testing the models. The Human Intestinal Absorption (HIA) model is based on 552 drug information. How to interpret the results: For a given compound it predicts the percentage that will be absorbed through the human intestine.	doi: 10.1021/acs.jmedchem.5b00104
Absorption	Kp	VEGA	Potts and Guy method	0.0023	10-6 cm/s	The model is based on a dataset of 271 compounds. Following the criteria reported in the OECD guideline 428, only data obtained in compliance with the following features have been kept: - All the data are retrieved from “in vitro” experiments - Data are collected from human skin experiments - Studies concerned skin application of chemicals dissolved in water, aqueous solution, water gel, PBS and distilled water - The buffer solution at a pH of 7.4 - The permeation coefficients were measured under comparable circumstances. The model is an application of the Potts and Guy equation to the entire dataset. For this reason, a splitting into training and test set is not provided.	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Metabolism	CYP2C9-sub	ADMETSAR		-	Active/"-" = Inactive/Not predicted	Five subgroups of CYP inhibitors were collected by Cheng et al.,[1] including 1a2, 2d6, 2c9, 2c19, and 3a4. A compound was assigned as a CYP inhibitor if the AC50 (the compound concentration leads to 50% of the activity of an inhibition control) value was 10 μM, and it was considered as a noninhibitor if AC50 was >57 μM. In addition, a compound was regarded as a CYP inhibitor if it has the PubChem activity score between 40 and 100, and as a noninhibitor if it has PubChem activity score equal to 0. Three subgroups of CYP substrates were collected by Carbon-Mangles et al., including 2d6, 2c9, and 3a4.[2] The models were built by by MACCS fingerprints and Support vector machine.	DOI: 10.1093/bioinformatics/bty707
Absorption	HIA	ADMETSAR		-	Active/"-" = Inactive/Not predicted	The Intestine is normally the primary site for absorption of a drug from an oraly administered solution. This method is built to predict the proportion of compounds that were absorbed through the human small intestine. The entire dataset were collected from Shen et al. (2010), which included 578 compounds (500 HIA+ and 78 HIA- compounds). How to interpret the results: For a given compound it predicts the percentage that will be absorbed through the human intestine. If a compound with the HIA% is less than 30%, it is labeled as -, otherwise it is labeled as Active.	DOI: 10.1093/bioinformatics/bty707
Transporter	Pgp Inhibitor	ADMETLAB2		-	Active/"-" = Inactive/Not predicted	Pgp inhibitor: inhibitor of P-glycoprotein. The prediction is based on Multi-task Graph Attention (MGA) framework MGA is composed of input, Relation graph convolution network (RGCN) layers, attention layer and fully-connected (FC) layers. In the Input, a node represents the information of an atom, and after passing RGCN layers, the node represents general features of circular substructure centered on the atom. RGCN is an extension of the standard graph convolution network (GCN) by introducing edge features to enrich the messages used to update the hidden states in the network. Attention layers can assign different attention weights to different substructures, and then generate the customized fingerprints (CFP) from the general features for a specific task. The prediction results are mainly displayed in the tabular format in the browser, with the 2D molecular structure and a radar plot summarizing the physicochemical quality of the compound. For those endpoints predicted by the regression models concrete predictive values are provided. To obtain robust and accurate prediction models, the model training process was repeated ten times with random data splitting. The best performing models were incorporated into the online platform, and different performance of classification models in training and validation sets: AUC: 1.000, ACC 0.994, SP: 0.993, Sen: 0.994, MCC: 0.987. The inhibitor of P-glycoprotein. The P-glycoprotein, also known as MDR1 or 2 ABCB1, is a membrane protein member of the ATP-binding cassette (ABC) transporters superfamily. It is probably the most promiscuous efflux transporter, since it recognizes a number of structurally different and apparently unrelated xenobiotics; notably, many of them are also CYP3A4 substrates. This model is based on 2209 (1315/894) Total molecules, 1764 (1051/713) in the training set, 222 (132/90) test set, and 223 (132/91) validation set drug like molecules. Leave-cluster-out validation of classification models was used for model validation. How to interpret: Category 0: Non-substrate / Non-inhibitor; Category 1: substrate / inhibitor.	doi: 10.1093/nar/gkab255
Metabolism	CYP1A2-inh	ADMETLAB2		-	Active/"-" = Inactive/Not predicted	CYP1A2 inhibitor: The prediction is based on Multi-task Graph Attention (MGA) framework MGA is composed of input, Relation graph convolution network (RGCN) layers, attention layer and fully-connected (FC) layers. In the Input, a node represents the information of an atom, and after passing RGCN layers, the node represents general features of circular substructure centered on the atom. RGCN is an extension of the standard graph convolution network (GCN) by introducing edge features to enrich the messages used to update the hidden states in the network. Attention layers can assign different attention weights to different substructures, and then generate the customized fingerprints (CFP) from the general features for a specific task. The prediction results are mainly displayed in the tabular format in the browser, with the 2D molecular structure and a radar plot summarizing the physicochemical quality of the compound. For those endpoints predicted by the regression models concrete predictive values are provided. To obtain robust and accurate prediction models, the model training process was repeated ten times with random data splitting. The best performing models were incorporated into the online platform, and different performance of classification models in training and validation sets: AUC: 0.972, ACC: 0.914, SP: 0.898, Sen: 0.932, MCC: 0.828. Based on the chemical nature of biotransformation, the process of drug metabolism reactions can be divided into two broad categories: phase I (oxidative reactions) and phase II (conjugative reactions). The human cytochrome P450 family (phase I enzymes) contains 57 isozymes and these isozymes metabolize approximately two-thirds of known drugs in human with 80% of this attribute to five isozymes––1A2, 3A4, 2C9, 2C19 and 2D6. Most of these CYPs responsible for phase I reactions are concentrated in the liver. This model is based on 12635 (5876 positive /6759 negative) Total molecules, with 10111 (4702/5425) in the training set, 1261 (588/673) test set, and 1263 (586/677) validation set drug like molecules. Leave-cluster-out validation of classification models was used for model validation. How to interpret: Category 0: Non-substrate / Non-inhibitor; Category 1: substrate / inhibitor. The output value is the probability of being substrate / inhibitor, within the range of 0 to 1. Empirical decision: If the prediction >= 0.5 the endpoint is considered “active”. If not it is considered as inactive “-“.	doi: 10.1093/nar/gkab255
Metabolism	CYP3A4-sub	PKCSM		-	Active/"-" = Inactive/Not predicted	CYP3A4 substrate: Cytochrome P450 substrate 3A4 isoform: The prediction is based on molecular properties (Molecular Weight; Heavy Atom count; LogP; Heteroatoms count; Rotatable Bonds count; Ring count; TPSA; Labute ASA; Fluorine atom Count; Toxicophore [1-36]; Pharmacophore count) calculated using the RDKit cheminformatics toolkit and used for training the predictive models. Two different algorithms, Random Forest and Logistic Regression, did the qualitative predictions (classification tasks). The cytochrome P450’s are responsible for metabolism of many drugs. However inhibitors of the P450’s can dramatically alter the pharmacokinetics of these drugs. It is therefore important to assess whether a given compound is likely to be a cytochrome P450 substrate. The two main isoforms responsible for drug metabolism are 2D6 and 3A4. These models were built using 671 compounds whose metabolism by each cytochrome P450 isoform has been measured. The best performing predictor in each task was chosen based 5-fold cv approach. The Weka toolkit was used for training and testing the models How to interpret the results: The predictor will assess whether a given molecule is likely to be metabolized by either P450. Qualitative information were changed: "Yes" was replaced to "Active" and "No" prediction was replaced to "-".	doi: 10.1021/acs.jmedchem.5b00104
Absorption	Kp	VEGA	Ten Berge method	0.004	10-6 cm/s	The model is based on a dataset of 271 compounds. Following the criteria reported in the OECD guideline 428, only data obtained in compliance with the following features have been kept: All the data are retrieved from “in vitro” experiments Data are collected from human skin experiments Studies concerned skin application of chemicals dissolved in water, aqueous solution, water gel, PBS and distilled water The buffer solution at a pH of 7.4 The permeation coefficients were measured under comparable circumstances. The model is an application of the ten Berge equation to the entire dataset. For this reason, a splitting into training and test set is not provided. 4.Defining the algorithm - OECD Principle.	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Metabolism	CYP2C19-sub	ADMETLAB2		-	Active/"-" = Inactive/Not predicted	CYP2C19 substrate: The prediction is based on Multi-task Graph Attention (MGA) framework MGA is composed of input, Relation graph convolution network (RGCN) layers, attention layer and fully-connected (FC) layers. In the Input, a node represents the information of an atom, and after passing RGCN layers, the node represents general features of circular substructure centered on the atom. RGCN is an extension of the standard graph convolution network (GCN) by introducing edge features to enrich the messages used to update the hidden states in the network. Attention layers can assign different attention weights to different substructures, and then generate the customized fingerprints (CFP) from the general features for a specific task. The prediction results are mainly displayed in the tabular format in the browser, with the 2D molecular structure and a radar plot summarizing the physicochemical quality of the compound. For those endpoints predicted by the regression models concrete predictive values are provided. To obtain robust and accurate prediction models, the model training process was repeated ten times with random data splitting. The best performing models were incorporated into the online platform, and different performance of classification models in training and validation sets: AUC: 0.974, ACC: 0.928, SP: 0.894, Sen: 0.977, MCC: 0.859. Based on the chemical nature of biotransformation, the process of drug metabolism reactions can be divided into two broad categories: phase I (oxidative reactions) and phase II (conjugative reactions). The human cytochrome P450 family (phase I enzymes) contains 57 isozymes and these isozymes metabolize approximately two-thirds of known drugs in human with 80% of this attribute to five isozymes––1A2, 3A4, 2C9, 2C19 and 2D6. Most of these CYPs responsible for phase I reactions are concentrated in the liver. This model is based on 258 (107/151) Total molecules, with 206 (85/121) in the training set, 26 (11/15) test set, and 26 (11/15) validation set drug like molecules. Leave-cluster-out validation of classification models was used for model validation. How to interpret: Category 0: Non-substrate / Non-inhibitor; Category 1: substrate / inhibitor. The output value is the probability of being substrate / inhibitor, within the range of 0 to 1.	doi: 10.1093/nar/gkab255
Metabolism	CYP2D6-sub	ADMETLAB2		-	Active/"-" = Inactive/Not predicted	CYP2D6 substrate: The prediction is based on Multi-task Graph Attention (MGA) framework MGA is composed of input, Relation graph convolution network (RGCN) layers, attention layer and fully-connected (FC) layers. In the Input, a node represents the information of an atom, and after passing RGCN layers, the node represents general features of circular substructure centered on the atom. RGCN is an extension of the standard graph convolution network (GCN) by introducing edge features to enrich the messages used to update the hidden states in the network. Attention layers can assign different attention weights to different substructures, and then generate the customized fingerprints (CFP) from the general features for a specific task. The prediction results are mainly displayed in the tabular format in the browser, with the 2D molecular structure and a radar plot summarizing the physicochemical quality of the compound. For those endpoints predicted by the regression models concrete predictive values are provided. To obtain robust and accurate prediction models, the model training process was repeated ten times with random data splitting. The best performing models were incorporated into the online platform, and different performance of classification models in training and validation sets: AUC: 0.947, ACC: 0.893, SP: 0.849, Sen: 0.937, MCC: 0.788. Based on the chemical nature of biotransformation, the process of drug metabolism reactions can be divided into two broad categories: phase I (oxidative reactions) and phase II (conjugative reactions). The human cytochrome P450 family (phase I enzymes) contains 57 isozymes and these isozymes metabolize approximately two-thirds of known drugs in human with 80% of this attribute to five isozymes––1A2, 3A4, 2C9, 2C19 and 2D6. Most of these CYPs responsible for phase I reactions are concentrated in the liver. This model is based on 877 (435/442) Total molecules, with 703 (347/356) in the training set, 85 (44/41) test set, and 89 (44/45) validation set drug like molecules. Leave-cluster-out validation of classification models was used for model validation. How to interpret: Category 0: Non-substrate / Non-inhibitor; Category 1: substrate / inhibitor. The output value is the probability of being substrate / inhibitor, within the range of 0 to 1. Empirical decision: If the prediction >= 0.5 the endpoint is considered “active”. If not it is considered as inactive “-“.	doi: 10.1093/nar/gkab255
Metabolism	HLM	vNN-ADMET		Active	Active/"-" = Inactive/Not predicted	HLM: Human liver microsomal stability. The k-nearest neighbor (k-NN) method is widely used to develop QSAR models (Zheng and Tropsha, 2000). An alternative approach is to use a predetermined similarity criterion, vNN method, which uses all nearest neighbors that meet a structural similarity criterion to define the model's applicability domain (Liu et al., 2012, 2015; Liu and Wallqvist, 2014). When no nearest neighbor meets the criterion, the vNN method makes no prediction. The human liver is the most important organ for drug metabolism. For a drug to achieve effective therapeutic concentrations in the body, it cannot be metabolized too rapidly by the liver. Otherwise, it would need to be administered at high doses, which are associated with high toxicity. To identify and exclude rapidly metabolized compounds (Di et al., 2003), pharmaceutical companies commonly use the human liver microsomal (HLM) stability assay. This has led to the accumulation of a substantial body of HLM stability data in publicly accessible databases. However, our knowledge of how enzymes in the HLM assay metabolize drugs remains fragmentary. Therefore, we examined whether the vNN method could effectively predict drugs that are rapidly metabolized by the liver. We retrieved HLM data from the ChEMBL database (Bento et al., 2014), manually curated the data, and classified compounds as stable or unstable based on the reported half-life [T1/2 > 30 min was considered stable, and T1/2 < 30 min unstable (Liu et al., 2015)]. The final dataset contained 3,219 compounds. Of these, we classified 2,047 as stable and 1,166 as unstable. The HLM model performed with an overall accuracy of 81%; sensitivity and specificity values of 71 and 87%, respectively; and a high kappa value of 0.60. The HLM model reliably predicted 91% of the compounds in the HLM dataset when using 10-fold CV. Results were adapted and the “Yes” and “No” indicators were changed respectively to “Active” and “-“.	doi: 10.3389/fphar.2017.00889.
Absorption	Kp	SWISSADME		0.0021	10-6 cm/s	Skin permeation (Log kp): One model is a multiple linear regression (QSPR model), which aims predicting the skin permeability coefficient (Kp). It is adapted from Potts and Guy (1992 Pharm. Res.), who found Kp linearly correlated with molecular size and lipophilicity (R2 = 0.67).
Absorption	MDCK Permeability	ADMETLAB2		54.9	10-6 cm/s	MDCK Permeability. The prediction is based on Multi-task Graph Attention (MGA) framework MGA is composed of input, Relation graph convolution network (RGCN) layers, attention layer and fully-connected (FC) layers. In the Input, a node represents the information of an atom, and after passing RGCN layers, the node represents general features of circular substructure centered on the atom. RGCN is an extension of the standard graph convolution network (GCN) by introducing edge features to enrich the messages used to update the hidden states in the network. Attention layers can assign different attention weights to different substructures, and then generate the customized fingerprints (CFP) from the general features for a specific task. The prediction results are mainly displayed in the tabular format in the browser, with the 2D molecular structure and a radar plot summarizing the physicochemical quality of the compound. For those endpoints predicted by the regression models concrete predictive values are provided. To obtain robust and accurate prediction models, the model training process was repeated ten times with random data splitting. The best performing models were incorporated into the online platform, and different performance measures for the training test regression model: R-square (R2) of 0.934, mean absolute error (MAE) of 0.140, and root mean squared error (RMSE) of 0.105. Madin−Darby Canine Kidney cells (MDCK) have been developed as an in vitro model for permeability screening. Its apparent permeability coefficient, Papp, is widely considered to be the in vitro gold standard for assessing the uptake efficiency of chemicals into the body. Papp values of MDCK cell lines are also used to estimate the effect of the blood-brain barrier (BBB). This model is based on 1140 Total molecules (positive/Negative), with 912 in the training set (positive/Negative), 114 test set (positive/Negative), and 114 validation set (positive/Negative) drug like molecules with MDCK permeability values and predicts the logarithm of the apparent permeability coefficient (log Papp; log cm/s). How to interpret: The unit of predicted MDCK permeability is cm/s. A compound is considered to have a high passive MDCK permeability for a Papp > 20 x 10-6 cm/s, medium permeability for 2-20 x 10-6cm/s, low permeability for < 2 x 10-6cm/s. Empirical decision: >2 x 10-6cm/s: excellent (green), otherwise: poor (red).	doi: 10.1093/nar/gkab255
Metabolism	CYP2C9-inh	SWISSADME		-	Active/"-" = Inactive/Not predicted	CYP2C9 Cytochrome P450 inhibition (drug-drug interaction): The Support Vector Machin (SVM) method (SVM) Cortes, C. & Vapnik, V. (1995) on meticulously cleansed large datasets of known inhibitors/non-inhibitors. In similar contexts, SVM was found to perform better than other machine-learning algorithms for binary classification (Mishra et al. 2010). The models return “Yes” or “No” if the molecule under investigation has higher probability to be respectively inhibitor or non-inhibitor of a given CYP. Cytochrome P450 enzymes (CYPs) constitute a superfamily of proteins that play an important role in the metabolism and detoxification of xenobiotics (Brown et al., 2008). A drug should not be rapidly metabolized by CYPs if it is to maintain an effective concentration. In addition, it should not inhibit drug-metabolizing CYPs, because such an effect could elevate the concentration of a co-administered drug and potentially lead to drug overdose—an effect known as a drug-drug interaction. CYP2C9: Cytochrome P 450 2C9 inhibitor: SVM Model built on 5940 molecules (Training set) and tested on 2075 molecules (Test set). 10 fold CV: ACC=0.78/AUC=0.85, external: ACC = 0.71 / AUC = 0.81. Results were adapted and the “Yes” and “No” indicators were changed respectively to “Active” and “-“.
Distribution	FU	PKCSM		44.7	%	FU (Human): Human Fraction Unbound. The prediction is based on molecular properties (Molecular Weight; Heavy Atom count; LogP; Heteroatoms count; Rotatable Bonds count; Ring count; TPSA; Labute ASA; Fluorine atom Count; Toxicophore [1-36]; Pharmacophore count) calculated using the RDKit cheminformatics toolkit and used for training the predictive models. Two different algorithms, Gaussian Processes and Model Tree Regression did the quantitative predictions (regression tasks). Most drugs in plasma will exist in equilibrium between either an unbound state or bound to serum proteins. Efficacy of a given drug may be affect by the degree to which it binds proteins within blood, as the more that is bound the less efficiently it can traverse cellular membranes or diffuse. This predictive model was built using the measured free proportion of 552 compounds in human blood (Fu). The best performing predictor in each task was chosen based on Leave-one-out cv and Train/test approach. The Weka toolkit was used for training and testing the models. How to interpret the results: For a given compound the predicted fraction that would be unbound in plasma will be calculated and expressed in %.	doi: 10.1021/acs.jmedchem.5b00104
Transporter	Pgp I Inhibitor	PKCSM		-	Active/"-" = Inactive/Not predicted	Pgp inhibitor I: The P-glycoprotein I is an ATP-binding cassette (ABC) transporter. The prediction is based on molecular properties (Molecular Weight; Heavy Atom count; LogP; Heteroatoms count; Rotatable Bonds count; Ring count; TPSA; Labute ASA; Fluorine atom Count; Toxicophore [1-36]; Pharmacophore count) calculated using the RDKit cheminformatics toolkit and used for training the predictive models. Two different algorithms, Random Forest and Logistic Regression, did the qualitative predictions (classification tasks). The best performing predictor in each task was chosen. The Weka toolkit was used for training and testing the models. The best performing predictor in each task was chosen based on 5-fold cv approach. The Weka toolkit was used for training and testing the models. P-glycoprotein I and II inhibitors: Modulation of P-glycoprotein mediated transport has significant pharmacokinetic implications for Pgp substrates, which may either be exploited for specific therapeutic advantages or result in contraindications. This predictive models were build using 1273 and 1275 compounds that have been characterized for their ability to inhibit P-glycoprotein I and P-glycoprotein II transport, respectively.	doi: 10.1021/acs.jmedchem.5b00104
Distribution	BBB permeant	vNN-ADMET		-	Active/"-" = Inactive/Not predicted	The k-nearest neighbor (k-NN) method is widely used to develop QSAR models (Zheng and Tropsha, 2000). An alternative approach is to use a predetermined similarity criterion, vNN method, which uses all nearest neighbors that meet a structural similarity criterion to define the model's applicability domain (Liu et al., 2012, 2015; Liu and Wallqvist, 2014). When no nearest neighbor meets the criterion, the vNN method makes no prediction. All information are available at doi: 10.3389/fphar.2017.00889. 353 compounds whose BBB permeability values (logBB) were obtained from the literature (Muehlbacher et al., 2011; Naef, 2015).	doi: 10.3389/fphar.2017.00889.
Metabolism	CYP1A2-inh	vNN-ADMET		-	Active/"-" = Inactive/Not predicted	CYP1A2 Cytochrome P450 inhibition (drug-drug interaction): The k-nearest neighbor (k-NN) method is widely used to develop QSAR models (Zheng and Tropsha, 2000). An alternative approach is to use a predetermined similarity criterion, vNN method, which uses all nearest neighbors that meet a structural similarity criterion to define the model's applicability domain (Liu et al., 2012, 2015; Liu and Wallqvist, 2014). When no nearest neighbor meets the criterion, the vNN method makes no prediction. Cytochrome P450 enzymes (CYPs) constitute a superfamily of proteins that play an important role in the metabolism and detoxification of xenobiotics (Brown et al., 2008). A drug should not be rapidly metabolized by CYPs if it is to maintain an effective concentration. In addition, it should not inhibit drug-metabolizing CYPs, because such an effect could elevate the concentration of a co-administered drug and potentially lead to drug overdose—an effect known as a drug-drug interaction. Cytochrome P450 Inhibition (Drug-Drug Interaction: CYP inhibitors from ChEMBL (Bento et al., 2014) and classified them as inhibitors if the IC50 was below 10 μM. VNN medel was applied on the base of dataset of 7,558 molecules Tanimoto-distance thresold value of 0.50 Accurancy 0.90 sensitivity 0.70 Specificty 0.95 kappa 0.66 coverage 0.75.	doi: 10.3389/fphar.2017.00889.

Categories
All
Human toxicology
General Toxicology
Genotoxicity/Mutagenicity
Carcinogenicity
Organ toxicology
Developmental/Reproductive Toxicology
Cell toxicology
Endocrine Disruption

Select an endpoint:

Category	Sub category	Endpoint	Tool	QSAR ID	Value	Unit	Comments	Reference
Cell toxicology	Mito-toxicity	MMP	ADMETLAB2		-	Agonist/Antagonist/"-" = Inactive/Not predicted	Mitochondrial membrane potential (MMP), one of the parameters for mitochondrial function, is generated by mitochondrial electron transport chain that creates an electrochemical gradient by a series of redox reactions. This gradient drives the synthesis of ATP, a crucial molecule for various cellular processes. Measuring MMP in living cells is commonly used to assess the effect of chemicals on mitochondrial function; decreases in MMP can be detected using lipophilic cationic fluorescent dyes. Traditional multitask graph neural network (GNN) methods usually handle homogeneous tasks, such as pure regression or classification tasks. However, in ADMET prediction, both regression tasks and classification tasks are needed. Therefore, a multi-task graph attention (MGA) framework was used to simultaneously learn the regression and classification tasks for ADMET predictions in this study. Result interpretation: Category 1: actives ; Category 0: inactives. The output value is the probability of being actives within the range of 0 to 1. Empirical decision: Empirical decision: If the prediction is upper or equal to 0.5 the molecules is considered as “Active”. If not the molecules is noted “-“.	doi: 10.1093/nar/gkab255
Organ toxicology	Hepatotoxicity	Liver NOAEL	VEGA		824.8975	mg/kg bw /d	No-information available	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Carcinogenicity		Carcino	VEGA	IRFMN-ISSCAN-CGX	Active	Active/"-" = Inactive/Not predicted	The datasets used for the extraction of the rules (structural Alerts), was based on ISSCAN database and CGX dataset. The rules (structural alerts) have been extracted with SARpy. The method is based on a set of 43 rules (structural alerts) related to carcinogenic activity. Qualitative information were changed: "mutagenic" whatever the quality of the prediction was replaced to "Active" and "Possible NON-Carcinogen" prediction was replaced to "-" whatever the quality of the prediction.	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Endocrine Disruption		GR	VEGA_NRMEA		agonist	Agonist/Antagonist/ a-anta agonist and antagonist /"-" = Inactive/Not predicted		https://www.vegahub.eu/wp/wp-content/uploads/2019/12/VEGA_NRMEA_model_Introduction.pdf
Organ toxicology	Cardiotoxicity	hERG Blocker	ADMETLAB2		-	Active/"-" = Inactive/Not predicted	The human ether-a-go-go related gene. The During cardiac depolarization and repolarization, a voltage_x0002_gated potassium channel encoded by hERG plays a major role in the regulation of the exchange of cardiac action potential and resting potential. The hERG blockade may cause long QT syndrome (LQTS), arrhythmia, and Torsade de Pointes (TdP), which lead to palpitations, fainting, or even sudden death. Result interpretation: Molecules with IC50 more than 10 μM or less than 50% inhibition at 10 μM were classified as hERG - (Category 0), while molecules with IC50 less than 10 μM or more than 50% inhibition at 10 μM were classified as hERG+ (Category 1). The output value is the probability of being hERG+, within the range of 0 to 1.	doi: 10.1093/nar/gkab255
Human toxicology		MRTD	ADMETLAB2		-	Active/"-" = Inactive/Not predicted	The prediction is based on Multi-task Graph Attention (MGA) framework MGA is composed of input, Relation graph convolution network (RGCN) layers, attention layer and fully-connected (FC) layers. In the Input, a node represents the information of an atom, and after passing RGCN layers, the node represents general features of circular substructure centered on the atom. RGCN is an extension of the standard graph convolution network (GCN) by introducing edge features to enrich the messages used to update the hidden states in the network. Attention layers can assign different attention weights to different substructures, and then generate the customized fingerprints (CFP) from the general features for a specific task. The prediction results are mainly displayed in the tabular format in the browser, with the 2D molecular structure and a radar plot summarizing the physicochemical quality of the compound. For those endpoints predicted by the regression models concrete predictive values are provided. To obtain robust and accurate prediction models, the model training process was repeated ten times with random data splitting. The best performing models were incorporated into the online platform, and different performance of classification models in training and validation sets: AUC: 0.869, ACC: 0.787, SP: 0.766, Sen: 0.810, MCC: 0.575. The half-life of a drug is a hybrid concept that involves clearance and volume of distribution, and it is arguably more appropriate to have reliable estimates of these two properties instead. This model is based on 1197 (561/636) Total molecules, with 957 (448/509) in the training set, 120 (56/64) test set, and 120 (57/63) validation set drug like molecules. Leave-cluster-out validation of classification models was used for model validation. Result interpretation: MRTD Active if MRTD ≤ 0.011 mmol/kg -bw/day	doi: 10.1093/nar/gkab255
Carcinogenicity		Fem rat carcino	VEGA		0.1186	[log(1/(mg/kg-day))]	no information	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Cell toxicology	Response to Stress	P53	ADMETLAB2		-	Agonist/Antagonist/"-" = Inactive/Not predicted	P53, a tumor suppressor protein, is activated following cellular insult, including DNA damage and other cellular stresses. The activation of p53 regulates cell fate by inducing DNA repair, cell cycle arrest, apoptosis, or cellular senescence. The activation of p53, therefore, is a good indicator of DNA damage and other cellular stresses. Traditional multitask graph neural network (GNN) methods usually handle homogeneous tasks, such as pure regression or classification tasks. However, in ADMET prediction, both regression tasks and classification tasks are needed. Therefore, a multi-task graph attention (MGA) framework was used to simultaneously learn the regression and classification tasks for ADMET predictions in this study. Result interpretation: Category 1: actives ; Category 0: inactives. The output value is the probability of being actives within the range of 0 to 1. Empirical decision: If the prediction is upper or equal to 0.5 the molecules is considered as “Active”. If not the molecules is noted “-“.	doi: 10.1093/nar/gkab255
Endocrine Disruption		ER	ADMETSAR		Active	Agonist/Antagonist/"-" = Inactive/Not predicted	Binary models for 6 targets implicated in endocrine disruption (ED), namely AR (androgen receptor), ER (estrogen receptor), TR (thyroid receptor), GR (glucocorticoid receptor), PPARγ (peroxisome proliferator-activated receptors γ) and Aromatase. All the datasets were collected from Tox21 and random under-sampling technique was used to achieve a balanced dataset for model training. A multi-label model was developped by combining the best single-label model of each target and the resulting model can be used to distinguish whether certain endocrine disrupting chemicals can simultaneously modulate multiple receptors related to ED. Finally, all the binary models and multi-label model were respectively evaluated by corresponding single-label test sets and a multi-label test set with reasonable reliability.	DOI: 10.1093/bioinformatics/bty707
Organ toxicology	Hepatotoxicity	PXR up liver stea	VEGA		-	Active/"-" = Inactive/Not predicted	Data referred to ToxCast assays ATG_PXR_TRANS_up (AEID: 135) and ATG_PXRE_CIS_up (AEID: 103). Attagene (ATG) assays are cell-based, multiplexed-redout assays that uses HepG2, a human liver cell line, with measurements taken at 24 hour after chemical dosing in 24-well plate. The consensus of four single models based on 1) Random Forest (RF) and Balanced Random Forest (BRF) were applied tod the training dataset of 853 chemicals. The output statistics for goodness of fit were Balance Accuracy 0.99, Sensitivity 0.99, Specificity 1, MCC 0.99. TP509, TN 422, FP 0, FN 3.	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Organ toxicology	Ocular toxicity	EC	ADMETLAB2		-	Active/"-" = Inactive/Not predicted	Assessing the eye irritation/corrosion (EI/EC) potential of a chemical is a necessary component of risk assessment. Cornea and conjunctiva tissues comprise the anterior surface of the eye, and hence cornea and conjunctiva tissues are directly exposed to the air and easily suffer injury by chemicals. There are several substances, such as chemicals used in manufacturing, agriculture and warfare, ocular pharmaceuticals, cosmetic products, and household products, that can cause EI or EC. Result interpretation: Category 1: corrosives / irritants chemicals; Category 0: non-corrosives / non-irritants chemicals. The output value is the probability of being toxic, within the range of 0 to 1.	doi: 10.1093/nar/gkab255
Endocrine Disruption		RARa	VEGA_NRMEA		-	Agonist/Antagonist/"-" = Inactive/Not predicted		https://www.vegahub.eu/wp/wp-content/uploads/2019/12/VEGA_NRMEA_model_Introduction.pdf
Developmental/Reproductive Toxicology	Reprotoxicity	Reprotox	ADMETSAR		Active	Agonist/Antagonist/"-" = Inactive/Not predicted	A reproductive toxicity data set of 1823 compounds (861 positive compounds and 962 negative compounds) was collected from the ECHA‐C&L Inventory and OECD‐eChemPortal.[1]	DOI: 10.1093/bioinformatics/bty707
General Toxicology		LD50/ROA	ADMETSAR		561.4368	mg/kg of bw	Rat oral Acute toxicity. In total. 10207 molecules with LD50 (mol/kg) against rat were collected from Zhu's work.[1] The model was built by graph convolutional neuronal network implemented in Deepchem. https://pubs.acs.org/doi/pdfplus/10.1021/tx900189p. the results are expressed in log(1/(mol/kg)) according to the website and to the publication by zhu et al. Therefore a conversion in mg/kg was performed.	DOI: 10.1093/bioinformatics/bty707
Organ toxicology	Hepatotoxicity	DILI	ADMETSAR		-	Active/"-" = Inactive/Not predicted	In total, 3115 toxic molecules and 593 nontoxic molecules were collected from publications and databases such as DrugBank by Mulliner et al. All the molecues were prepared in Pipeline Pilot, removing inorganic compounds, large molecules ( > 800 Da), and inorganic salts in mixtures. The "+" indicating a Hepatotoxic effect was changed to "active" in the table	DOI: 10.1093/bioinformatics/bty707
General Toxicology		NOAEL	VEGA		1389.9526	mg/kg of bw/day	NOAEL: All doubtful or inorganic compounds, salts, and mixtures were eliminated, because the relationships between molecular structure and the NOAEL are very complex. We considered only data referring to 90 days of oral administration in rats and rejected reproductive toxicity studies. It is to be noted that the exchange of the 90-day study by shorter testing is an attractive alternative. Taking into account this circumstance, values for 28 days of treatment were considered but, in order to have consistent data, they were divided by a factor of 3, as specified by the scientific committee on consumer safety (SCCS) in order to approximate the 90-day NOAEL. After the above selection, about four hundreds of various substances with small molecules (e.g., 2–3 atoms) and vice versa with extremely large molecules (e.g., 100 or more atoms), molecules with specific groups, such as [N+], [NH4+], [nH], etc., and substances with molecules containing many various cycles / heterocycles were remained. Under such circumstances, the following limitations were used in the selection of compounds for the work set: (i) too large and, vice versa, too small molecules were removed (practically, molecules which can be represented by SMILES with length less than 70 and larger than 10 symbols, were selected); (ii) molecules which have only one cycles or have no cycles at all were selected; and (iii) molecules with special groups (indicated by square brackets) were removed from the work set. Thus, the dataset of 140 compounds has been selected. All values were converted to decimal logarithms (-log NOAEL). Algortihm is The Monte Carlo method.	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Organ toxicology	Hepatotoxicity	PPARa up liver stea	VEGA		Active	Active/"-" = Inactive/Not predicted	Data referred to ToxCast assays ATG_PPARa_TRANS_up (AEID: 132). Attagene (ATG) assays are cell-based, multiplexed-redout assays that uses HepG2, a human liver cell line, with measurements taken at 24 hour after chemical dosing in 24-well plate. The consensus of four single models based on 1) Random Forest (RF) and Balanced Random Forest (BRF) were applied tod the training dataset of 1057 chemicals. The output statistics for goodness of fit were Balanced Accuracy: 0.76, Sensitivity: 0.60, Specificity: 0.91, MCC: 0.34. TP 30, TN 917, FP 90, FN 20	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Organ toxicology	Skin toxicity	SkinSen	PKCSM		-	Active/"-" = Inactive/Not predicted	SkinSen: Skin sensitization: The prediction is based on molecular properties (Molecular Weight; Heavy Atom count; LogP; Heteroatoms count; Rotatable Bonds count; Ring count; TPSA; Labute ASA; Fluorine atom Count; Toxicophore [1-36]; Pharmacophore count) calculated using the RDKit cheminformatics toolkit and used for training the predictive models. Two different algorithms, Random Forest and Logistic Regression, did the qualitative predictions (classification tasks). Skin sensitization is a potential adverse effect for dermally applied products. The evaluation of whether a compound, that may encountered the skin, can induce allergic contact dermatitis is an important safety concern. This predictor was built using 254 compounds which have been evaluated for their ability to induce skin sensitization. The best performing predictor in each task was chosen based 5-fold cv approach. The Weka toolkit was used for training and testing the models. How to interpret the results: How to interpret the results: It predicts whether a given compound is likely to be associated with skin sensitisation. Qualitative information were changed: "Yes" was replaced to "Active" and "No" prediction was replaced to "-".	doi: 10.1021/acs.jmedchem.5b00104
Endocrine Disruption		TRa	VEGA_NRMEA		-	Agonist/Antagonist/"-" = Inactive/Not predicted		https://www.vegahub.eu/wp/wp-content/uploads/2019/12/VEGA_NRMEA_model_Introduction.pdf
General Toxicology		LD50/ROA	PKCSM		311.0836	mg/kg of bw	Rat LD50 Oral Rat Acute Toxicity (LD50). The prediction is based on molecular properties (Molecular Weight; Heavy Atom count; LogP; Heteroatoms count; Rotatable Bonds count; Ring count; TPSA; Labute ASA; Fluorine atom Count; Toxicophore [1-36]; Pharmacophore count) calculated using the RDKit cheminformatics toolkit and used for training the predictive models. Two different algorithms, Gaussian Processes and Model Tree Regression did the quantitative predictions (regression tasks). It is important to consider the toxic potency of a potential compound. The lethal dosage values (LD50) are a standard measurement of acute toxicity used to assess the relative toxicity of different molecules. The LD50 is the amount of a compound given all at once that causes the death of 50% of a group of test animals. The model was built on over 10207 compounds tested in rats and predicts the LD50 (in mol/kg). The best performing predictor in each task was chosen based 5-fold CV approach. The Weka toolkit was used for training and testing the models. How to interpret the results: the LD50prediction (in mol/kg ). Values are also expressed in g/kg. (The prediction is based on the first version of the ADMETSAR algorithm). Considering the results obtained from the prediction we assumes that the results are expressed in log(1/(mol/kg)) as for ADMETSAR2. We therefore corrected the values as mol/kg = 10^(-value). the results are finally expressed in mg/Kg of BW	doi: 10.1021/acs.jmedchem.5b00104
Organ toxicology	Hepatotoxicity	H-HT	VEGA		-	Active/"-" = Inactive/Not predicted	Only data on human from literature were considered: - Fourches et al. (2010) [2], which contains 950 hepatotoxicity data (drugs) on humans, rodents and non-rodent species. We selected only data referring to humans (650 data) and eliminated the rest. - United States Food and Drug Administration (US FDA) Human Liver Adverse Effects Database [3]. This contains 631 unique pharmaceuticals, 491 of which (non-proprietary data) have adverse drug reaction data for one or more of the 47 liver effects Coding Symbols for Thesaurus of Adverse Reaction (COSTAR) term endpoints. Since only two compounds were labeled as M (marginally active) we eliminated them in order to reduce the uncertainty of the data set. The two datasets were merged: duplicates and compounds with contrasting experimental values were eliminated. Compounds with concordant experimental activity considered ones. The final data set was fairly balanced, with 510 compounds labeled as hepatotoxic and 440 non-hepatotoxic. The final dataset was randomly splitted into a training set (760 mono constituent organic compounds) and a test set, test set 1, (190 mono constituent organic compounds) The external validation set (test set 2) was retrieved in the Liver Toxicity Knowledge Base (LTKB) Benchmark Dataset developed by the US FDA [7]. 101 chemicals are selected (after elimination of compounds already present in the dataset), 69 labeled as hepatotoxicity and 32 labeled as non-hepatotoxicity. The VEGA implemented model merged the test set 1 and the test set 2 (external validation set) and hence consisted of 291 number of substances ( 171 labelled hepatotoxic and 120 labeled non-hepatotoxic). Decision tree based on structural alerts (SAs). The NON-Toxic (whatever the reliability) are changed to “-“, Toxic (whatever the reliability) are changed to “Active“, and Unknown changed to NP for no prediction.	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Genotoxicity/Mutagenicity	Mutagenicity	AMES	ADMETLAB2		-	Active/"-" = Inactive/Not predicted	The Ames test for mutagenicity. The mutagenic effect has a close relationship with the carcinogenicity, and it is the most widely used assay for testing the mutagenicity of compounds. Result interpretation: Category 0: AMES negative(-); Category 1: AMES positive(+). The output value is the probability of being toxic, within the range of 0 to 1. We adapted the results, if prediction >= 0.5 the compound is considered as Active	doi: 10.1093/nar/gkab255
Endocrine Disruption		ER-RBA	VEGA		-	Agonist/Antagonist/"-" = Inactive/Not predicted	The tested substance is added to a system where radio-labeled reference hormone binds to a prescribed quantity of hormone receptor. The chemical concentration that inhibits 50% of the binding of the reference hormone to the receptor is measured and defined as IC50. Then, Relative Binding Affinity (RBA) between IC50 values of the chemical and natural hormone (E2) is defined as the endpoint when the IC50 concentration of natural hormone is set at 100. The final dataset comprised 806 single 2D structures, with the majority of the compounds considered inactive. The dataset was split into training (656 chemicals) and test (150 chemicals). Classification and regression tree (CART) uses the methodology of tree building as a hierarchical classification method. the model is based on 8 physicochemical descriptors. The model Statistics for goodness-of-fit were: Accuracy 0.86, Sensitivity 0.87, Specificity 0.85, MCC 0.70. TP 203, TN 356, FP 64, FN 31.	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Genotoxicity/Mutagenicity	Mutagenicity	MicroN-In vivo	VEGA		Not predicted	Active/"-" = Inactive/Not predicted	The models have been developed using a set of 1228 compounds and their experimental results of in vivo micronucleus test, classified as genotoxic (378) and non-genotoxic (850). The model performs a consensus assessment based on the predictions of two single models: 1) SAR in python (SARpy) and 2) k-nearest neighbor (k-NN). The Statistics for goodness-of-fit: Accuracy 0.99, Sensitivity 0.99, Specificity 1.00, MCC 0.99. TP 363, TN 839, FP 2.	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Organ toxicology	Nephrotoxicity	Nephro	ADMETSAR		-	Active/"-" = Inactive/Not predicted	Drug-induced nephrotoxicity has been one of the main reasons for the failure of drug development. Early prediction of the nephrotoxicity for drug candidates is critical to the success of clinical trials. We manually collected 777 valid data, and divided the molecules into negative and positive according to clinical reports and in vivo assay. Our model can be applied to the prediction of nephrotoxicity of Chinese herbal medicines and chemical drugs.	DOI: 10.1093/bioinformatics/bty707
Genotoxicity/Mutagenicity	Mutagenicity	MicroN-In vivo	ADMETSAR		Active	Active/"-" = Inactive/Not predicted	Genotoxicity testing of new chemical entities is an integral part of the drug development process and is a regulatory requirement prior to the approval of new drugs. In vivo micronucleus assay is common used to detect chemical genotoxicity. Chemicals were labeled as positive and negative according to the result of the assay. A total of 641 chemicals with the in vivo micronucleus assay results were collected from available literature and database. The performance of the binary models in admetSAR was of AUC: 0.937, Accuracy: 0.87, Sensitivity: 0.819, Specificity: 0.906.	DOI: 10.1093/bioinformatics/bty707
Cell toxicology	Mito-toxicity	MMP	vNN-ADMET		-	Agonist/Antagonist/"-" = Inactive/Not predicted	MMP: The k-nearest neighbor (k-NN) method is widely used to develop QSAR models (Zheng and Tropsha, 2000). An alternative approach is to use a predetermined similarity criterion, vNN method, which uses all nearest neighbors that meet a structural similarity criterion to define the model's applicability domain (Liu et al., 2012, 2015; Liu and Wallqvist, 2014). When no nearest neighbor meets the criterion, the vNN method makes no prediction. MMP Disruption (Mitochondrial Toxicity) vNN-based MMP prediction model, using 6,261 compounds collected from a previous study that screened a library of 10,000 compounds (∼8,300 unique chemicals) at 15 concentrations, each in triplicate, to measure changes in the MMP in HepG2 cells (Attene-Ramos et al., 2015). The study found that 913 compounds decreased the MMP, whereas 5,395 compounds had no effect. It made predictions for compounds that were well represented in the applicability domain, but not for any other compound. The model showed a high overall accuracy of 89% and a kappa value of 0.61, with a coverage of 69%. Results were adapted and the “Yes” and “No” indicators were changed respectively to “Active” and “-“.	doi: 10.3389/fphar.2017.00889.
General Toxicology		LD50/ROA	VEGA		3903.8	mg/kg of bw	LD50 values (mg/kg) were converted to logLD50 (mmol/kg) in order to have a distribution of data more suitable for modelling. The MW used for conversion was calculated as the sum of MWs of the main molecule and of its counterion, if present. There were data referring to the same chemical (i.e., InChi code) in the “Training Dataset” and “Complete LD50 inventory”. They have been defined as follows in this document: Duplicates: two or more records sharing the same InChI code in the “Training Dataset”. The final dataset was composed of 6280 substances, 5029 as training set (TS) and 1251 as validation set (VS). After the implementation in VEGA, the final dataset was composed of 6280 substances (training set). Defining the algorithm - OECD Principle 2 4.1.Type of model: The Acute Toxicity (LD50) model is a Regression model (kNN) based on 6280 substances retrieved from several sources. It is specific for the acute oral systemic toxicity tests in Rats. Explicit algorithm: Regression model (kNN).	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Endocrine Disruption		AR	VEGA		-	Agonist/Antagonist/"-" = Inactive/Not predicted	Androgen Receptor-mediated effect (IRFMN-COMPARA)-assessment. 1689 curated chemical structures with AR experimental activity were provided by the EPA’s National Center for Computational Toxicology as a training set to develop the in silico models. Experimental data were derived from a collection of 11 in vitro HTS assays exploring multiple points in the AR pathway including three receptor binding, two cofactor recruitment, one RNA transcription, three agonist-mode protein production and two antagonist-mode protein production. A chemical was considered as a binder if it was either an active agonist or antagonist. The model provides a qualitative prediction for Androgen Receptor (AR) effects mediated through the AR pathway. The data were used to generate binary classification models to discriminate active (both agonists and antagonists) compounds from inactive ones. It is a two steps model developed with SARpy. In the first step SARpy was used to model the two classes, identifying a set of 127 rules (17 for active and 110 for inactive). Then, a second set of 22 rules identifying active compounds is applied to unpredicted compounds only. Statistics for goodness-of-fit were for training set: accuracy 0.94, sensitivity 0.77, specificity 0.96, MCC 0.70. TP 155, TN 1402, FP 65, FN 42.	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Genotoxicity/Mutagenicity	Mutagenicity	AMES	VEGA	KNN-Read-Across	Active	Active/"-" = Inactive/Not predicted	The read-across model has been built with the k-Nearest Neighbor (KNN) application, and it is based on the similarity index, k of the most similar compounds. Model based on 5,770 compounds. Qualitative information were transformed: "mutagenic" whatever the quality of the prediction is considered as "Active" and "non-mutagenic" prediction as "-" whatever the quality of the prediction.	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Carcinogenicity		Carcino	VEGA	ISS	Active	Active/"-" = Inactive/Not predicted	Decisional algorithm based on rules of toxicity. The model has been built as a set of 56 rules, taken from the work of Benigni and Bossa (ISS) as implemented in the software ToxTree and based from a training model of 797 compounds. Qualitative information were changed: "mutagenic" whatever the quality of the prediction was replaced to "Active" and "non-mutagenic" prediction was replaced to "-" whatever the quality of the prediction.	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Endocrine Disruption		PPARg	ADMETLAB2		-	Active/"-" = Inactive/Not predicted	PPAR-gamma: The peroxisome proliferator-activated receptors (PPARs) are lipid-activated transcription factors of the nuclear receptor superfamily with three distinct subtypes namely PPAR alpha, PPAR delta (also called PPAR beta) and PPAR gamma (PPARg). All these subtypes heterodimerize with Retinoid X receptor (RXR) and these heterodimers regulate transcription of various genes. PPAR-gamma receptor (glitazone receptor) is involved in the regulation of glucose and lipid metabolism. Traditional multitask graph neural network (GNN) methods usually handle homogeneous tasks, such as pure regression or classification tasks. However, in ADMET prediction, both regression tasks and classification tasks are needed. Therefore, a multi-task graph attention (MGA) framework was used to simultaneously learn the regression and classification tasks for ADMET predictions in this study. Result interpretation: Category 1: actives ; Category 0: inactives. The output value is the probability of being actives within the range of 0 to 1. Empirical decision: If the prediction is upper or equal to 0.5 the molecules is considered as “Active”. If not the molecules is noted “-“.	doi: 10.1093/nar/gkab255
Organ toxicology	Cardiotoxicity	hERG Blocker	vNN-ADMET		-	Active/"-" = Inactive/Not predicted	hERG Blocker: human ether-à-go-go-related gene. The k-nearest neighbor (k-NN) method is widely used to develop QSAR models (Zheng and Tropsha, 2000). An alternative approach is to use a predetermined similarity criterion, vNN method, which uses all nearest neighbors that meet a structural similarity criterion to define the model's applicability domain (Liu et al., 2012, 2015; Liu and Wallqvist, 2014). When no nearest neighbor meets the criterion, the vNN method makes no prediction. hERG: 282 known hERG blockers from the literature and classified compounds with an IC50 cutoff value of 10 μM or less as blockers (Wang et al., 2012). We also collected a set of 404 compounds with IC50 values >10 μM from ChEMBL (Bento et al., 2014) and classified them as non blockers (Czodrowski, 2013). hERG blockers and non-blockers were classified as positives and negatives, respectively. The hERG model performed with an overall accuracy of 84%, well-balanced sensitivity and specificity values (84 and 83%, respectively), and a kappa value of 0.68. The model reliably predicted 80% of the compounds in our dataset when using 10-fold CV. Results were adapted and the “Yes” and “No” indicators were changed respectively to “Active” and “-“.	doi: 10.3389/fphar.2017.00889.
Organ toxicology	Skin toxicity	SkinSen	VEGA	NCSTOX	Active	Active/"-" = Inactive/Not predicted	no information	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Human toxicology		MRTD	vNN-ADMET		2.8167	mg/Kg of bw /day	MRTD: maximum recommended therapeutic dose (MRTD). The k-nearest neighbor (k-NN) method is widely used to develop QSAR models (Zheng and Tropsha, 2000). An alternative approach is to use a predetermined similarity criterion, vNN method, which uses all nearest neighbors that meet a structural similarity criterion to define the model's applicability domain (Liu et al., 2012, 2015; Liu and Wallqvist, 2014). When no nearest neighbor meets the criterion, the vNN method makes no prediction. Maximum recommended therapeutic dose: A basic principle of toxicology is that “the dose makes the poison.” For most drugs, the therapeutic dose is limited by toxicity, and the maximum recommended therapeutic dose (MRTD) is an estimated upper daily dose that is safe (Contrera et al., 2004). Investigators carry out toxicological experiments on animals to determine the toxic effects of a drug and the initial dose for human clinical trials. Unfortunately, there is a lack of correlation between animal and human toxicity data. Therefore, we investigated whether the vNN method could predict the MRTD values of new compounds based on known human MRTD data. If so, the values could be used to estimate the starting dose in phase I clinical trials, while significantly reducing the number of animals used in preliminary toxicology studies. Maximum Recommended Therapeutic Dose: MRTD values publically disclosed by the FDA, mostly of single-day oral doses for an average adult with a body weight of 60 kg, for 1,220 compounds (most of which are small organic drugs). For modeling purposes we converted the MRTD unit from mg/kg-body weight/day to mol/kg-body weight/day via the molecular weight of the compound. However, the predicted values on the website are reported in mg/day based upon an average adult weighing 60 kg. We used an external test set of 160 compounds, which was collected by the FDA for validation. The total dataset for our model contained 1,184 compounds (Liu et al., 2012). The MRTD model reliably predicted 69% of the FDA MRTD dataset, with a Pearson's correlation coefficient (R) of 0.79 between the predicted and measured log(MRTD) values, and a mean deviation (mDev) of 0.56 log units, using 40-fold CV (Liu et al., 2012). To facilitate the comparison between tool predictions, the data expressed in mg/day for an adult of 60 kg were transformed in mg/Kg of BW/day.	doi: 10.3389/fphar.2017.00889.
Cell toxicology	Response to Stress	HSE	ADMETLAB2		-	Agonist/Antagonist/"-" = Inactive/Not predicted	Heat shock factor response element. Various chemicals, environmental and physiological stress conditions may lead to the activation of heat shock response/ unfolded protein response (HSR/UPR). There are three heat shock transcription factors (HSFs) (HSF-1, -2, and -4) mediating transcriptional regulation of the human HSR. Traditional multitask graph neural network (GNN) methods usually handle homogeneous tasks, such as pure regression or classification tasks. However, in ADMET prediction, both regression tasks and classification tasks are needed. Therefore, a multi-task graph attention (MGA) framework was used to simultaneously learn the regression and classification tasks for ADMET predictions in this study. Result interpretation: Category 1: actives ; Category 0: inactives. The output value is the probability of being actives within the range of 0 to 1. Empirical decision: : If the prediction is upper or equal to 0.5 the molecules is considered as “Active”. If not the molecules is noted “-“.	doi: 10.1093/nar/gkab255
Endocrine Disruption		AhR	ADMETLAB2		-	Agonist/Antagonist/"-" = Inactive/Not predicted	NR-AhR: The Aryl hydrocarbon Receptor (AhR), a member of the family of basic helix-loop-helix transcription factors, is crucial to adaptive responses to environmental changes. AhR mediates cellular responses to environmental pollutants such as aromatic hydrocarbons through induction of phase I and II enzymes but also interacts with other nuclear receptor signaling pathways. Traditional multitask graph neural network (GNN) methods usually handle homogeneous tasks, such as pure regression or classification tasks. However, in ADMET prediction, both regression tasks and classification tasks are needed. Therefore, a multi-task graph attention (MGA) framework was used to simultaneously learn the regression and classification tasks for ADMET predictions in this study. Result interpretation: Category 1: actives ; Category 0: inactives. The output value is the probability of being actives within the range of 0 to 1. Empirical decision: If the prediction is upper or equal to 0.5 the molecules is considered as “Active”. If not the molecules is noted “-“.	doi: 10.1093/nar/gkab255
Organ toxicology	Hepatotoxicity	Liver LOAEL	VEGA		632.9946	mg/kg bw/d	No-information available	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Carcinogenicity		Carcino	ADMETSAR	Trinary	-	Active/"-" = Inactive/Not predicted	The data set used for model building was compiled from CPDB, which contains alarge number of chemical structures (1547 substances) with tumor data in rodents. Forthese chemicals, the carcinogenic potency is expressed as TD50 values. The data set was prepared infollowing steps:(1) Removing mixtures, inorganic, salts and organometallic compounds;(2) Removing compounds that have inconsistent results in different experimental groups;(3) Removing compounds with molecular weights less than 40 or more than 600;(4) Only one stereoisomer was retained because the 2D fingerprints of apair of stereoiso-mers are identical. Finally, 476 carcinogens and 440 noncarcinogens were collected. The trinary model was built by MACSS fingerprint and support vector machine. The "Danger" and "Active" prediction were changed to "Active" and "non-required" prediction changed to "-"	DOI: 10.1093/bioinformatics/bty707
Organ toxicology	Ocular toxicity	EC	ADMETSAR		-	Active/"-" = Inactive/Not predicted	A total of 5220 chemicals (3874+/1346-) for a serious eye irritation (EI) dataset and 2299 chemicals (887+/1412-) as an eye corrosion (EC) dataset were collected from available databases and literature. The EI model was built by AtomPairs with support vector machine and the EC model was built by MACCS and support vector machine.	DOI: 10.1093/bioinformatics/bty707
Endocrine Disruption		PR	VEGA_NRMEA		-	Agonist/Antagonist/ a-anta agonist and antagonist /"-" = Inactive/Not predicted		https://www.vegahub.eu/wp/wp-content/uploads/2019/12/VEGA_NRMEA_model_Introduction.pdf
Organ toxicology	Skin toxicity	SkinSen Rules	VEGA		Michael Acceptor	Alerts		https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Endocrine Disruption		ARO	ADMETLAB2		-	Agonist/Antagonist/"-" = Inactive/Not predicted	Aromatase: Endocrine disrupting chemicals (EDCs) interfere with the biosynthesis and normal functions of steroid hormones including estrogen and androgen in the body. Aromatase catalyzes the conversion of androgen to estrogen and plays a key role in maintaining the androgen and estrogen balance in many of the EDC-sensitive organs. Traditional multitask graph neural network (GNN) methods usually handle homogeneous tasks, such as pure regression or classification tasks. However, in ADMET prediction, both regression tasks and classification tasks are needed. Therefore, a multi-task graph attention (MGA) framework was used to simultaneously learn the regression and classification tasks for ADMET predictions in this study. Result interpretation: Category 1: actives ; Category 0: inactives. The output value is the probability of being actives within the range of 0 to 1. Empirical decision: If the prediction is upper or equal to 0.5 the molecules is considered as “Active”. If not the molecules is noted “-“.	doi: 10.1093/nar/gkab255
Carcinogenicity		Genotox-Carci-muta	ADMETLAB2		5.0	Number of structural alert	Molecules containing these substructures may cause carcinogenicity or mutagenicity through genotoxic mechanisms.There are 117 substructures in this endpoint.	doi: 10.1093/nar/gkab255
Endocrine Disruption		ER	VEGA		-	Agonist/Antagonist/"-" = Inactive/Not predicted	The model has been built as a set of rules, extracted with Sarpy software from a dataset obtained from a collection of high-quality estrogen receptor (ER) signaling data (1529 chemicals screened across 18 high_x0002_throughput screening assays integrated into a single score) from the ToxCast program. The model is based on 59 rules. Statistics for goodness-of-fit were after the implementation in VEGA: n = 1529, not predicted = 241, Accuracy 0.97, Sensitivity 0.85, Specificity 0.97, MCC 0.70. TP 60, TN 1179, FP 38, FN 11.	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Organ toxicology	Skin toxicity	SkinSen	ADMETSAR		-	Active/"-" = Inactive/Not predicted	A large data set of 1007 compounds and their experimental LLNA data were collected from two databases including the OECD's eChemPortal database and the National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods (NICEATM) database. The compounds were classified as negative and positive based on their EC3 values according to the following convention: Negative (without EC3) and Positive (with EC3).[1]	DOI: 10.1093/bioinformatics/bty707
Endocrine Disruption		TR	ADMETSAR		-	Active/"-" = Inactive/Not predicted	Binary models for 6 targets implicated in endocrine disruption (ED), namely AR (androgen receptor), ER (estrogen receptor), TR (thyroid receptor), GR (glucocorticoid receptor), PPARγ (peroxisome proliferator-activated receptors γ) and Aromatase. All the datasets were collected from Tox21 and random under-sampling technique was used to achieve a balanced dataset for model training. A multi-label model was developped by combining the best single-label model of each target and the resulting model can be used to distinguish whether certain endocrine disrupting chemicals can simultaneously modulate multiple receptors related to ED. Finally, all the binary models and multi-label model were respectively evaluated by corresponding single-label test sets and a multi-label test set with reasonable reliability.	DOI: 10.1093/bioinformatics/bty707
Developmental/Reproductive Toxicology	Developmental toxicity	Dev tox	VEGA		-	Agonist/Antagonist/"-" = Inactive/Not predicted	The data set was split into training (234 substances) and test sets (58 substances) using rational design, by CAESAR Partner Helmholtz-Zentrum für Umweltforschung, using ChemProp. QSAR classification model for Developmental Toxicity based on a Random Forest method implemented using WEKA open-source libraries. EPA descriptors have been used for modeling. They refer to descriptors calculated using Toxicity Estimation Software Tool (T.E.S.T.). The selected number of descriptors is 13. Statistics for goodness-of-fit were for the training set: n = 234Accuracy 100%; FP rate 0%; FN rate 0%; PPV 100%; NPV 100%; Sensitivity 100%; Specificity 100%; Nb unpredicted 0.	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Endocrine Disruption		ARO	VEGA	TOX21	Agonist	Agonist/Antagonist/"-" = Inactive/Not predicted	no information	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Organ toxicology	Hepatotoxicity	DILI	ADMETLA2		-	Active/"-" = Inactive/Not predicted	Drug-induced liver injury (DILI) has become the most common safety problem of drug withdrawal from the market over the past 50 years. Result interpretation: Category 0: DILI negative(-); Category 1: DILI positive(+). The output value is the probability of being toxic, within the range of 0 to 1. Empirical decision: 0-0.3: excellent (green); 0.3-0.7: medium (yellow); 0.7-1.0(++): poor (red). We adapted the results, if prediction >= 0.5 the compound is considered as Active.	doi: 10.3389/fphar.2017.00889
Genotoxicity/Mutagenicity	Mutagenicity	AMES	vNN-ADMET		-	Active/"-" = Inactive/Not predicted	Mutagens : Chemical mutagenicity (AMES test). The k-nearest neighbor (k-NN) method is widely used to develop QSAR models (Zheng and Tropsha, 2000). An alternative approach is to use a predetermined similarity criterion, vNN method, which uses all nearest neighbors that meet a structural similarity criterion to define the model's applicability domain (Liu et al., 2012, 2015; Liu and Wallqvist, 2014). When no nearest neighbor meets the criterion, the vNN method makes no prediction. Mutagens are chemicals that cause abnormal genetic mutations leading to cancer. A common way to assess a chemical's mutagenicity is the Ames test (Ames et al., 1973). This test has become the standard for assessing the safety of chemicals and drugs, and has been used to test thousands of molecules. We examined whether the vNN method could effectively use existing data to predict mutagenicity. Ames mutagenicity dataset consisting of 6,512 compounds, of which 3,503 were Ames-positive (Hansen et al., 2009), and developed a vNN Ames mutagenicity prediction model. The model performed well, with an overall accuracy of 82%; sensitivity and specificity values of 86 and 75%, respectively; and a high kappa value of 0.62. The model also reliably predicted 79% of the compounds in the Ames dataset when using 10-fold CV.	doi: 10.3389/fphar.2017.00889.
Endocrine Disruption		ER-LBD	ADMETLAB2		-	Agonist/Antagonist/"-" = Inactive/Not predicted	ER-LBD: Estrogen receptor (ER), a nuclear hormone receptor, plays an important role in development, metabolic homeostasis and reproduction. Two subtypes of ER, ER-alpha and ER-beta have similar expression patterns with some uniqueness in both types. Endocrine disrupting chemicals (EDCs) and their interactions with steroid hormone receptors like ER causes disruption of normal endocrine function. Traditional multitask graph neural network (GNN) methods usually handle homogeneous tasks, such as pure regression or classification tasks. However, in ADMET prediction, both regression tasks and classification tasks are needed. Therefore, a multi-task graph attention (MGA) framework was used to simultaneously learn the regression and classification tasks for ADMET predictions in this study. Result interpretation: Category 1: actives ; Category 0: inactives. The output value is the probability of being actives within the range of 0 to 1. Empirical decision: If the prediction is upper or equal to 0.5 the molecules is considered as “Active”. If not the molecules is noted “-“.	doi: 10.1093/nar/gkab255
Organ toxicology	Skin toxicity	SkinSen	VEGA	CAESAR	Active	Active/"-" = Inactive/Not predicted	Skin sensitisation on mouse (local lymph node assay model) OECD 429. The final dataset is composed of 209 mono- constituent organic compounds. The dataset was randomly split into training and test set with respectively the 80% (167) and the 20% (42) of the compounds. The model consists in an Adaptive Fuzzy Partition (AFP) based on 8 descriptors. The AFP produces as output two values that represent the belonging degree respectively to the sensitizer and non-sensitizer classes. Statistics on the training set Accuracy: 91% Sensitivity: 95% Specificity: 74% TP 127, TN 27, FP 7, FN 6.	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Cell toxicology	Mito-toxicity	MMP	ADMETSAR		Active	Agonist/Antagonist/"-" = Inactive/Not predicted	The chemicals associated with mitochondrial toxicity were collected from Pubchem bioassay database and DrugBank, and Zhang's dataset.[1] In total, 1440 positive chemicals which can cause membrane potential drop and 1089 negative chemicals that have been marketed but without related mitochondrial toxicity and side effects were collected from Zhao et al.[2] The model was built by MACCS and random forest.	DOI: 10.1093/bioinformatics/bty707
Genotoxicity/Mutagenicity	Mutagenicity	AMES	VEGA	SarPy-IRFMN	Active	Active/"-" = Inactive/Not predicted	Model based on a set of rules extracted from a set of 4,000 compounds that were used for defining structural alerts (SAs) by SARpy software without any ‘a priori’ knowledge. There are 112 rules for mutagenicity and 93 rules for non-mutagenicity. Qualitative information were transformed: "mutagenic" whatever the quality of the prediction is considered as "Active" and "non-mutagenic" prediction as "-" whatever the quality of the prediction.	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Carcinogenicity		Carcino	VEGA	CAESAR	-	Active/"-" = Inactive/Not predicted	It uses a Counter Propagation Artificial Neural Network (CP ANN) consisting of two layers of neurons arranged in a two-dimensional rectangular matrix. The algorithm is based on 12 descriptors and 645 chemical as training set. Qualitative information were changed: "mutagenic" whatever the quality of the prediction was replaced to "Active" and "non-mutagenic" prediction was replaced to "-" whatever the quality of the prediction.	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Endocrine Disruption		PPARg	ADMETSAR		Active	Active/"-" = Inactive/Not predicted	Binary models for 6 targets implicated in endocrine disruption (ED), namely AR (androgen receptor), ER (estrogen receptor), TR (thyroid receptor), GR (glucocorticoid receptor), PPARγ (peroxisome proliferator-activated receptors γ) and Aromatase. All the datasets were collected from Tox21 and random under-sampling technique was used to achieve a balanced dataset for model training. A multi-label model was developped by combining the best single-label model of each target and the resulting model can be used to distinguish whether certain endocrine disrupting chemicals can simultaneously modulate multiple receptors related to ED. Finally, all the binary models and multi-label model were respectively evaluated by corresponding single-label test sets and a multi-label test set with reasonable reliability.	DOI: 10.1093/bioinformatics/bty707
Organ toxicology	Cardiotoxicity	hERG Blocker	ADMETSAR		Active	Active/"-" = Inactive/Not predicted	The original chemicals with experimental IC50 values were collected from literature and ChEMBL databse by Zhang et al.[1] Only patch clamp determined IC50 values on different mammalian cell lines were collected in this study. In tatal, 717 toxic moleucles (IC50 < 30 uMol) and 261 nontxsoic molecules were collected. The model was built by AtomPairs and support vector machine.	DOI: 10.1093/bioinformatics/bty707
Cell toxicology	Genome Instability	ATAD5	ADMETLAB2		-	Agonist/Antagonist/"-" = Inactive/Not predicted	ATAD5: ATPase family AAA domain-containing protein 5. As cancer cells divide rapidly and during every cell division they need to duplicate their genome by DNA replication. The failure to do so results in the cancer cell death. Based on this concept, many chemotherapeutic agents were developed but have limitations such as low efficacy and severe side effects etc. Enhanced Level of Genome Instability Gene 1 (ELG1; human ATAD5) protein levels increase in response to various types of DNA damage. Traditional multitask graph neural network (GNN) methods usually handle homogeneous tasks, such as pure regression or classification tasks. However, in ADMET prediction, both regression tasks and classification tasks are needed. Therefore, a multi-task graph attention (MGA) framework was used to simultaneously learn the regression and classification tasks for ADMET predictions in this study. Result interpretation: Category 1: actives ; Category 0: inactives. The output value is the probability of being actives within the range of 0 to 1. Empirical decision: If the prediction is upper or equal to 0.5 the molecules is considered as “Active”. If not the molecules is noted “-“.	doi: 10.1093/nar/gkab255
Endocrine Disruption		AR-LBD	ADMETLAB2		-	Agonist/Antagonist/"-" = Inactive/Not predicted	AR-LBD: Androgen receptor (AR), a nuclear hormone receptor, plays a critical role in AR-dependent prostate cancer and other androgen related diseases. Endocrine disrupting chemicals (EDCs) and their interactions with steroid hormone receptors like AR may cause disruption of normal endocrine function as well as interfere with metabolic homeostasis, reproduction, developmental and behavioral functions. Traditional multitask graph neural network (GNN) methods usually handle homogeneous tasks, such as pure regression or classification tasks. However, in ADMET prediction, both regression tasks and classification tasks are needed. Therefore, a multi-task graph attention (MGA) framework was used to simultaneously learn the regression and classification tasks for ADMET predictions in this study. Result interpretation: Category 1: actives ; Category 0: inactives. The output value is the probability of being actives within the range of 0 to 1. Empirical decision: If the prediction is upper or equal to 0.5 the molecules is considered as “Active”. If not the molecules is noted “-“.	doi: 10.1093/nar/gkab255
Carcinogenicity		Carcino	ADMETSAR	Binary	-	Active/"-" = Inactive/Not predicted	The data set used for model building was compiled from CPDB, which contains alarge number of chemical structures (1547 substances) with tumor data in rodents. Forthese chemicals, the carcinogenic potency is expressed as TD50 values. The data set was prepared infollowing steps:(1) Removing mixtures, inorganic, salts and organometallic compounds;(2) Removing compounds that have inconsistent results in different experimental groups;(3) Removing compounds with molecular weights less than 40 or more than 600;(4) Only one stereoisomer was retained because the 2D fingerprints of apair of stereoiso-mers are identical. Finally, 476 carcinogens and 440 noncarcinogens were collected and the binary model was built by Morgan fingerprint and k-nearest neighors method. The "+"prediction was changed to "Active". The trinary model was built by MACSS fingerprint and support vector machine. The "+"prediction was changed to "Active"	DOI: 10.1093/bioinformatics/bty707
Endocrine Disruption		MR	VEGA_NRMEA		-	Agonist/Antagonist/"-" = Inactive/Not predicted		https://www.vegahub.eu/wp/wp-content/uploads/2019/12/VEGA_NRMEA_model_Introduction.pdf
Organ toxicology	Respiratory toxicity	Respiratory	ADMETLAB2		-	Active/"-" = Inactive/Not predicted	Among these safety issues, respiratory toxicity has become the main cause of drug withdrawal. Drug_x0002_induced respiratory toxicity is usually underdiagnosed because it may not have distinct early signs or symptoms in common medications and can occur with significant morbidity and mortality.Therefore, careful surveillance and treatment of respiratory toxicity is of great importance. Result interpretation: Category 1: respiratory toxicants; Category 0: non-respiratory toxicants. The output value is the probability of being toxic, within the range of 0 to 1.	doi: 10.1093/nar/gkab255
Endocrine Disruption		ARO	ADMETSAR		-	Agonist/Antagonist/"-" = Inactive/Not predicted	Binary models for 6 targets implicated in endocrine disruption (ED), namely AR (androgen receptor), ER (estrogen receptor), TR (thyroid receptor), GR (glucocorticoid receptor), PPARγ (peroxisome proliferator-activated receptors γ) and Aromatase. All the datasets were collected from Tox21 and random under-sampling technique was used to achieve a balanced dataset for model training. A multi-label model was developped by combining the best single-label model of each target and the resulting model can be used to distinguish whether certain endocrine disrupting chemicals can simultaneously modulate multiple receptors related to ED. Finally, all the binary models and multi-label model were respectively evaluated by corresponding single-label test sets and a multi-label test set with reasonable reliability.	DOI: 10.1093/bioinformatics/bty707
Carcinogenicity		Non-Genotox-Carc	ADMETLAB2		1.0	Number of structural alert	Molecules containing these substructures may cause carcinogenicity through nongenotoxic mechanisms. There are 23 substructures in this endpoint.	doi: 10.1093/nar/gkab255
Organ toxicology	Hepatotoxicity	PPARg up liver stea	VEGA		-	Active/"-" = Inactive/Not predicted	Data referred to ToxCast assays ATG_PPARg_TRANS_up (AEID: 134). Attagene (ATG) assays are cell-based, multiplexed-redout assays that uses HepG2, a human liver cell line, with measurements taken at 24 hour after chemical dosing in 24-well plate. The consensus of four single models based on 1) Random Forest (RF) and Balanced Random Forest (BRF) were applied tod the training dataset of 908 chemicals. The output statistics for goodness of fit were Balanced Accuracy: 0.97, Sensitivity: 0.99, Specificity: 0.94, MCC: 0.88. TP 211, TN 655, FP 41, FN 1	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Organ toxicology	Ocular toxicity	EI	ADMETLAB2		-	Active/"-" = Inactive/Not predicted	Assessing the eye irritation/corrosion (EI/EC) potential of a chemical is a necessary component of risk assessment. Cornea and conjunctiva tissues comprise the anterior surface of the eye, and hence cornea and conjunctiva tissues are directly exposed to the air and easily suffer injury by chemicals. There are several substances, such as chemicals used in manufacturing, agriculture and warfare, ocular pharmaceuticals, cosmetic products, and household products, that can cause EI or EC. Result interpretation: Category 1: corrosives / irritants chemicals; Category 0: non-corrosives / non-irritants chemicals. The output value is the probability of being toxic, within the range of 0 to 1.	doi: 10.1093/nar/gkab255
Endocrine Disruption		RARr	VEGA_NRMEA		-	Agonist/Antagonist/"-" = Inactive/Not predicted		https://www.vegahub.eu/wp/wp-content/uploads/2019/12/VEGA_NRMEA_model_Introduction.pdf
Carcinogenicity		inhal carcino	VEGA		-	Active/"-" = Inactive/Not predicted	The RAIS database include the inhalation slope factor (ISF) values only for chemicals with carcinogenic effects, so chemicals with a defined value (in our case ISF) were considered carcinogenic, and compounds with no value were considered non-carcinogenic. The slope of this line, known as the slope factor, is an upper-bound estimate of risk per increment of dose for carcinogens that can be used to assess the increase over a lifetime in incidence of cancers in humans from inhalation exposure to a dose of a carcinogenic chemical. The final dataset for the classification model included 598 compounds (210 positive, 388 negative). Classification and regression trees (CART) based on 9 molecular descriptors. Statistic for goodness-of-fit: Accuracy = 0.81 Sensitivity = 0.73 Specificity = 0.86 (TP 154, TN 333, FP 55, FN 56).	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Organ toxicology	Hepatotoxicity	DILI	vNN-ADMET		-	Active/"-" = Inactive/Not predicted	DILI: The k-nearest neighbor (k-NN) method is widely used to develop QSAR models (Zheng and Tropsha, 2000). An alternative approach is to use a predetermined similarity criterion, vNN method, which uses all nearest neighbors that meet a structural similarity criterion to define the model's applicability domain (Liu et al., 2012, 2015; Liu and Wallqvist, 2014). When no nearest neighbor meets the criterion, the vNN method makes no prediction. Drug-induced liver injury (DILI) has been one of the most commonly cited reason for drug withdrawals from the market. This application predicts whether a compound could cause DILI. The dataset of 1,431 compounds was obtained from four sources used by Xu et al. This dataset contains both pharmaceuticals and non-pharmaceuticals; Prediction classified a compound as causing DILI if it was associated with a high risk of DILI and not if there was no such risk. More information are available doi: 10.3389/fphar.2017.00889. “Yes”/”No” predictions are changed to "Active"/”-“	doi: 10.3389/fphar.2017.00889.
Genotoxicity/Mutagenicity	Mutagenicity	AMES	PKCSM		-	Active/"-" = Inactive/Not predicted	AMES test: mutagenicity prediction based on AMES test: The prediction is based on molecular properties (Molecular Weight; Heavy Atom count; LogP; Heteroatoms count; Rotatable Bonds count; Ring count; TPSA; Labute ASA; Fluorine atom Count; Toxicophore [1-36]; Pharmacophore count) calculated using the RDKit cheminformatics toolkit and used for training the predictive models. Two different algorithms, Random Forest and Logistic Regression, did the qualitative predictions (classification tasks). The Ames test is a widely employed method to assess a compounds mutagenic potential using bacteria. A positive test indicates that the compound is mutagenic and therefore may act as a carcinogen. This predictive model was built on the results of over 8445 compounds Ames tests. The best performing predictor in each task was chosen based 5-fold cv approach. The Weka toolkit was used for training and testing the models. How to interpret the results: It predicts whether a given compound is likely to be Ames positive and hence mutagenic.	doi: 10.1021/acs.jmedchem.5b00104
Endocrine Disruption		ERb	VEGA_NRMEA		-	Agonist/Antagonist/"-" = Inactive/Not predicted		https://www.vegahub.eu/wp/wp-content/uploads/2019/12/VEGA_NRMEA_model_Introduction.pdf
Organ toxicology	Hepatotoxicity	NRF2 up liver stea	VEGA		-	Active/"-" = Inactive/Not predicted	Data referred to ToxCast assays ATG_NRF2_ARE_CIS_up (AEID: 97). Attagene (ATG) assays are cell-based, multiplexed-redout assays that uses HepG2, a human liver cell line, with measurements taken at 24 hour after chemical dosing in 24-well plate. The consensus of four single models based on 1) Random Forest (RF) and Balanced Random Forest (BRF) were applied tod the training dataset of 853 chemicals. The output statistics for goodness of fit were Balance Accuracy: 0.99, Sensitivity: 1.00, Specificity: 0.99, MCC: 0.98. TP 276. TN 570, FP 7, FN 0.	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Endocrine Disruption		VDR	VEGA_NRMEA		-	Agonist/Antagonist/ a-anta agonist and antagonist /"-" = Inactive/Not predicted		https://www.vegahub.eu/wp/wp-content/uploads/2019/12/VEGA_NRMEA_model_Introduction.pdf
Cell toxicology	Sub-loc	Sub-loc	ADMETSAR		Mitochondria		no information available	DOI: 10.1093/bioinformatics/bty707
Carcinogenicity		inhal carcino	VEGA		0.0955	mg/kg-day		https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Endocrine Disruption		AR	ADMETLAB2		Active		AR: Androgen receptor (AR), a nuclear hormone receptor, plays a critical role in AR-dependent prostate cancer and other androgen related diseases. Endocrine disrupting chemicals (EDCs) and their interactions with steroid hormone receptors like AR may cause disruption of normal endocrine function as well as interfere with metabolic homeostasis, reproduction, developmental and behavioral functions. Traditional multitask graph neural network (GNN) methods usually handle homogeneous tasks, such as pure regression or classification tasks. However, in ADMET prediction, both regression tasks and classification tasks are needed. Therefore, a multi-task graph attention (MGA) framework was used to simultaneously learn the regression and classification tasks for ADMET predictions in this study. Result interpretation: Category 1: actives ; Category 0: inactives. The output value is the probability of being actives within the range of 0 to 1. Empirical decision: If the prediction is upper or equal to 0.5 the molecules is considered as “Active”. If not the molecules is noted “-“.	doi: 10.1093/nar/gkab255
Genotoxicity/Mutagenicity	Mutagenicity	AMES	VEGA	ISS	Active	Active/"-" = Inactive/Not predicted	The model has been built as a set of 69 rules, taken from the work of Benigni and Bossa (ISS) as implemented in the software ToxTree and based from a training model of 670 compounds. Qualitative information were transformed: "mutagenic" whatever the quality of the prediction is considered as "Active" and "non-mutagenic" prediction as "-" whatever the quality of the prediction.	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Carcinogenicity		Carcino	ADMETLAB2		-	Active/"-" = Inactive/Not predicted	Among various toxicological endpoints of chemical substances, carcinogenicity is of great concern because of its serious effects on human health. The carcinogenic mechanism of chemicals may be due to their ability to damage the genome or disrupt cellular metabolic processes. Many approved drugs have been identified as carcinogens in humans or animals and have been withdrawn from the market. Result interpretation: Category 1: carcinogens; Category 0: non-carcinogens. Chemicals are labelled as active (carcinogens) or inactive (non-carcinogens) according to their TD50 values. The output value is the probability of being toxic, within the range of 0 to 1. We adapted the results, if Prediction Value >= 0.5 the compound is considered as “Active”, if not the value is replaced by “-“.	doi: 10.1093/nar/gkab255
Organ toxicology	Cardiotoxicity	hERG II Blocker	PKCSM		-	Active/"-" = Inactive/Not predicted	hERG II Inhibitor: human ether-a-go-go gene II Inhibitor: The prediction is based on molecular properties (Molecular Weight; Heavy Atom count; LogP; Heteroatoms count; Rotatable Bonds count; Ring count; TPSA; Labute ASA; Fluorine atom Count; Toxicophore [1-36]; Pharmacophore count) calculated using the RDKit cheminformatics toolkit and used for training the predictive models. Two different algorithms, Random Forest and Logistic Regression, did the qualitative predictions (classification tasks). Inhibition of the potassium channels encoded by hERG (human ether-a-go-go gene) are the principal causes for the development of acquire long QT syndrome - leading to fatal ventricular arrhythmia. Inhibition of hERG channels has resulted in the withdrawal of many substances from the pharmaceutical market. This predictor was built using hERG II inhibition information for 806 compounds. The best performing predictor in each task was chosen based 5-fold cv approach. The Weka toolkit was used for training and testing the models. How to interpret the results: The predictor will determine if a given compound is likely to be a hERG II inhibitor. Qualitative information were changed: "Yes" was replaced to "Active" and "No" prediction was replaced to "-".	doi: 10.1021/acs.jmedchem.5b00104
Organ toxicology	Skin toxicity	SkinSen	VEGA	IRFMN-JRC	Active	Active/"-" = Inactive/Not predicted	Skin sensitisation on mouse (local lymph node assay model) OECD 429. The training set contains 264 compounds. The test set counts 68 compounds. The model consists in Decision trees based on 8 descriptors. Statistics for goodness-of-fit: Training set: n = 264; Accuracy = 0.80; Specificity = 0.79; Sensitivity = 0.81 TP: 145, TN: 66, FP:18, FN: 35.	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Carcinogenicity		oral carcino	VEGA		-	Active/"-" = Inactive/Not predicted	The RAIS database include the oral slope factor (OSF) values only for chemicals with carcinogenic effects, so chemicals with a defined value (in our case OSF) were considered carcinogenic, and compounds with no value were considered non-carcinogenic. The slope of this line, known as the slope factor, is an upper-bound estimate of risk per increment of dose for carcinogens that can be used to assess the increase over a lifetime in incidence of cancers in humans from oral or inhalation exposure to a dose of a carcinogenic chemical. The final dataset for the classification model included 593 compounds (257 positive, 336 negative). Classification and regression trees (CART) based on 7 molecular descriptors. Statistic for goodness-of-fit: Accuracy = 0.81 Sensitivity = 0.82 Specificity = 0.79 (TP 211, TN 267, FP 69, FN 46).	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Human toxicology		MRTD	PKCSM		0.174	mg/Kg of bw /day	MRTD: Max. Recommended Therapeutic Dose (MRTD). The prediction is based on molecular properties (Molecular Weight; Heavy Atom count; LogP; Heteroatoms count; Rotatable Bonds count; Ring count; TPSA; Labute ASA; Fluorine atom Count; Toxicophore [1-36]; Pharmacophore count) calculated using the RDKit cheminformatics toolkit and used for training the predictive models. Two different algorithms, Gaussian Processes and Model Tree Regression did the quantitative predictions (regression tasks). The maximum recommended tolerated dose (MRTD) provides an estimate of the toxic dose threshold of chemicals in humans. The model is trained using 1222 experimental data points from human clinical trials and predicts the logarithm of the MRTD (log mg/kg/day). This will help guide the maximum recommended starting dose for pharmaceuticals in phase I clinical trials, which are currently based on extrapolations from animal data. The best performing predictor in each task was chosen based 10-fold CV approach. The Weka toolkit was used for training and testing the models. How to interpret the results: For a given compound, a MRTD of less than or equal to 0.477 log(mg/kg/day) is considered low, and high if greater than 0.477 log(mg/kg/day). To facilitate the comparison between tool predictions, the data expressed in log(mg/kg/day) were transformed in mg/Kg of BW/day.	doi: 10.1021/acs.jmedchem.5b00104
Cell toxicology	Oxydative stress	ARE	ADMETLAB2		-	Agonist/Antagonist/"-" = Inactive/Not predicted	ARE: Oxidative stress has been implicated in the pathogenesis of a variety of diseases ranging from cancer to neurodegeneration. The antioxidant response element (ARE) signaling pathway plays an important role in the amelioration of oxidative stress. The CellSensor ARE-bla HepG2 cell line (Invitrogen) can be used for analyzing the Nrf2/antioxidant response signaling pathway. Nrf2 (NF-E2-related factor 2) and Nrf1 are transcription factors that bind to AREs and activate these genes. Traditional multitask graph neural network (GNN) methods usually handle homogeneous tasks, such as pure regression or classification tasks. However, in ADMET prediction, both regression tasks and classification tasks are needed. Therefore, a multi-task graph attention (MGA) framework was used to simultaneously learn the regression and classification tasks for ADMET predictions in this study. Result interpretation: Category 1: actives ; Category 0: inactives. The output value is the probability of being actives within the range of 0 to 1. Empirical decision: If the prediction is upper or equal to 0.5 the molecules is considered as “Active”. If not the molecules is noted “-“.	doi: 10.1093/nar/gkab255
Carcinogenicity		Male rat carcino	VEGA		-6.0682	[log(1/(mg/kg-day))]	no information	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
General Toxicology		LD50/ROA	ADMETLAB2		Active	Active/"-" = Inactive/Not predicted	Determination of acute toxicity in mammals (e.g. rats or mice) is one of the most important tasks for the safety evaluation of drug candidates. Result interpretation: Category 0: low-toxicity, > 500 mg/kg; Category 1: high-toxicity; < 500 mg/kg. The output value is the probability of being toxic, within the range of 0 to 1. We adapted the results, if prediction >= 0.5 the compound is considered as Active.	doi: 10.1093/nar/gkab255
Endocrine Disruption		AR-LBD	ADMETSAR		-	Agonist/Antagonist/"-" = Inactive/Not predicted	Binary models for 6 targets implicated in endocrine disruption (ED), namely AR (androgen receptor), ER (estrogen receptor), TR (thyroid receptor), GR (glucocorticoid receptor), PPARγ (peroxisome proliferator-activated receptors γ) and Aromatase. All the datasets were collected from Tox21 and random under-sampling technique was used to achieve a balanced dataset for model training. A multi-label model was developped by combining the best single-label model of each target and the resulting model can be used to distinguish whether certain endocrine disrupting chemicals can simultaneously modulate multiple receptors related to ED. Finally, all the binary models and multi-label model were respectively evaluated by corresponding single-label test sets and a multi-label test set with reasonable reliability.	DOI: 10.1093/bioinformatics/bty707
Carcinogenicity		Carcino	VEGA	IRFMN-Antares	Active	Active/"-" = Inactive/Not predicted	The model has been built as a set of 127 rules, extracted with SARpy software (based on molecular fragments) from a dataset obtained from the carcinogenicity database of EU-funded project ANTARES. 1,543 compounds were used as dataset. Qualitative information were changed: "mutagenic" whatever the quality of the prediction was replaced to "Active" and "Possible non-mutagenic" prediction was replaced to "-" whatever the quality of the prediction.	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Endocrine Disruption		GR	ADMETSAR		-	Active/"-" = Inactive/Not predicted	Binary models for 6 targets implicated in endocrine disruption (ED), namely AR (androgen receptor), ER (estrogen receptor), TR (thyroid receptor), GR (glucocorticoid receptor), PPARγ (peroxisome proliferator-activated receptors γ) and Aromatase. All the datasets were collected from Tox21 and random under-sampling technique was used to achieve a balanced dataset for model training. A multi-label model was developped by combining the best single-label model of each target and the resulting model can be used to distinguish whether certain endocrine disrupting chemicals can simultaneously modulate multiple receptors related to ED. Finally, all the binary models and multi-label model were respectively evaluated by corresponding single-label test sets and a multi-label test set with reasonable reliability.	DOI: 10.1093/bioinformatics/bty707
Organ toxicology	Respiratory toxicity	Respiratory	ADMETSAR		Active	Active/"-" = Inactive/Not predicted	In total, 2529 compounds (1440+/1089-) were obtained from three databases. The positive data are compounds that have adverse effects on the human respiratory system, and the negative data are substances that are harmless to the respiratory system, including respiratory non-sensitizers and skin non-sensitizers.[1]	DOI: 10.1093/bioinformatics/bty707
Organ toxicology	Skin toxicity	SkinSen Rules	ADMETLAB2		3.0	nomber of alert	Molecules containing these substructures may cause skin irritation.There are 155 substructures in this endpoint. Molecules containing these substructures may cause skin irritation.	doi: 10.1093/nar/gkab255
Carcinogenicity		oral carcino	VEGA		0.182	mg/kg BW - day	The RAIS database include the oral slope factor (OSF) values only for chemicals with carcinogenic effects, so chemicals with a defined value (in our case OSF) were considered carcinogenic, and compounds with no value were considered non-carcinogenic. The slope of this line, known as the slope factor, is an upper-bound estimate of risk per increment of dose for carcinogens that can be used to assess the increase over a lifetime in incidence of cancers in humans from oral or inhalation exposure to a dose of a carcinogenic chemical. The final dataset for the classification model included 315 compounds and 226 were used for the training. The multi-layer perceptron – artificial neural networks (MLP-ANNs) based on 12 molecular descriptors. Statistic for goodness-of-fit: R2 0.70, RMSE 0.88.	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Endocrine Disruption		EDC-s	VEGA		-	Agonist/Antagonist/"-" = Inactive/Not predicted	No description	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
General Toxicology		LOAEL	PKCSM		805.3784	mg/kg of bw/day	LOAEL: Toxicity Oral Rat Chronic Toxicity (LOAEL). The prediction is based on molecular properties (Molecular Weight; Heavy Atom count; LogP; Heteroatoms count; Rotatable Bonds count; Ring count; TPSA; Labute ASA; Fluorine atom Count; Toxicophore [1-36]; Pharmacophore count) calculated using the RDKit cheminformatics toolkit and used for training the predictive models. Two different algorithms, Gaussian Processes and Model Tree Regression did the quantitative predictions (regression tasks). It is important to consider the toxic potency of a potential compound. Exposure to low-moderate doses of chemicals over long periods of time is of significant concern in many treatment strategies. Chronic studies aim to identify the lowest dose of a compound that results in an observed adverse effect (LOAEL), and the highest dose at which no adverse effects are observed (NOAEL). This predictor was built using the LOAEL results from 445 compounds. The best performing predictor in each task was chosen based Leave-one-out approach. The Weka toolkit was used for training and testing the models How to interpret the results: For a given compound, the predicted log Lowest Observed Adverse Effect (LOAEL) in log (mg/kg_bw/day) will be generated and convert in mg/kg bw/day. The LOAEL results need to be interpreted relative to the bioactive concentration and treatment lengths required.	doi: 10.1021/acs.jmedchem.5b00104
Endocrine Disruption		ER	ADMETLAB2		-	Agonist/Antagonist/"-" = Inactive/Not predicted	ER: Estrogen receptor (ER), a nuclear hormone receptor, plays an important role in development, metabolic homeostasis and reproduction. Endocrine disrupting chemicals (EDCs) and their interactions with steroid hormone receptors like ER causes disruption of normal endocrine function. Therefore, it is important to understand the effect of environmental chemicals on the ER signaling pathway. Traditional multitask graph neural network (GNN) methods usually handle homogeneous tasks, such as pure regression or classification tasks. However, in ADMET prediction, both regression tasks and classification tasks are needed. Therefore, a multi-task graph attention (MGA) framework was used to simultaneously learn the regression and classification tasks for ADMET predictions in this study. Result interpretation: Category 1: actives ; Category 0: inactives. The output value is the probability of being actives within the range of 0 to 1. Empirical decision: If the prediction is upper or equal to 0.5 the molecules is considered as “Active”. If not the molecules is noted “-“.	doi: 10.1093/nar/gkab255
Genotoxicity/Mutagenicity	Mutagenicity	Chro-Ab	VEGA		Active	Active/"-" = Inactive/Not predicted	Data for chromosomal aberrations determined by in vitro test using Chinese hamster lung (CHL) and ovary (CHO) cells, with and without metabolic activation (metabolic system S9). After the implementation in VEGA, the dataset was split in training (442 chemicals) and test (35 chemicals). One-variable model based on SMILES-derived descriptors. Training set: n = 442, Balanced Accuracy 0.77, Sensitivity 0.72, Specificity 0.81, MCC 0.54. TP 149, TN 191, FP 44, FN 58.	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Organ toxicology	Ocular toxicity	EI	ADMETSAR		-	Active/"-" = Inactive/Not predicted	A total of 5220 chemicals (3874+/1346-) for a serious eye irritation (EI) dataset and 2299 chemicals (887+/1412-) as an eye corrosion (EC) dataset were collected from available databases and literature. The EI model was built by AtomPairs with support vector machine and the EC model was built by MACCS and support vector machine.	DOI: 10.1093/bioinformatics/bty707
Endocrine Disruption		RARb	VEGA_NRMEA		-	Agonist/Antagonist/"-" = Inactive/Not predicted		https://www.vegahub.eu/wp/wp-content/uploads/2019/12/VEGA_NRMEA_model_Introduction.pdf
Developmental/Reproductive Toxicology	Repro/dev toxicity	Repro/dev tox	VEGA		-	Agonist/Antagonist/"-" = Inactive/Not predicted	Data collection is described in "A Framework for Identifying Chemicals with Structural Features Associated with Potential to Act as Developmental or Reproductive Toxicants" Wu et al. 2013. (DOI:10.1021/tx400226u). The final dataset counts 685 substances: from the original dataset (n. 716) we selected substances on the basis of their structure (e.g. polymers, inorganics compounds and organometals were excluded) and with data for at least one ndpoint (developmental toxicity, reproductive toxicity). The model is a structure-based model and does not make use of descriptors. Statistics for goodness-of-fit: Sensitivity 89%, Specificity 44%, Accuracy 85%, MCC 0.27.	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Endocrine Disruption		ARO	VEGA	IRFMN	-	Agonist/Antagonist/"-" = Inactive/Not predicted	This assay is based on Aromatase Breast cancer cell line (MCF-7 aro) Cell-based assay, and measures the inhibition of the conversion of testosterone to estradiol catalyzed by aromatase. The control used for this assay is Letrozole (IC50 =9.44 ± 1.4 nM (n =27)).. The final dataset has 3254 compounds, with 281 active agonists, 170 active antagonists, and 2803 inactive. the model is built on 18 descriptors.Statistics for goodness-of-fit were: Accuracy 0.94, MCC 0.74.	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Organ toxicology	Hepatotoxicity	DILI	PKCSM		-	Active/"-" = Inactive/Not predicted	DILI: Drug-induced liver injury: The prediction is based on molecular properties (Molecular Weight; Heavy Atom count; LogP; Heteroatoms count; Rotatable Bonds count; Ring count; TPSA; Labute ASA; Fluorine atom Count; Toxicophore [1-36]; Pharmacophore count) calculated using the RDKit cheminformatics toolkit and used for training the predictive models. Two different algorithms, Random Forest and Logistic Regression, did the qualitative predictions (classification tasks). Drug-induced liver injury is a major safety concern for drug development and a significant cause of drug attrition. This predictor was built using the liver associated side effects of 531 compounds observed in humans. A compound was classed as hepatotoxic if it had at least one pathological or physiological liver event which is strongly associated with disrupted normal function of the liver. The best performing predictor in each task was chosen based 5-fold cv approach. The Weka toolkit was used for training and testing the models. How to interpret the results: How to interpret the results: It predicts whether a given compound is likely to be associated with disrupted normal function of the liver. Qualitative information were changed: "Yes" was replaced to "Active" and "No" prediction was replaced to "-".	doi: 10.1021/acs.jmedchem.5b00104
Genotoxicity/Mutagenicity	Mutagenicity	AMES	ADMETSAR		-	Active/"-" = Inactive/Not predicted	In total 4866 mutagens and 3482 non-mutagens were collected from literature and CPDB and CCRIS by Xu et al. The model was built by Morgan fingerprint and random forest.	DOI: 10.1093/bioinformatics/bty707
Endocrine Disruption		ERa	VEGA_NRMEA		-	Agonist/Antagonist/ a-anta agonist and antagonist /"-" = Inactive/Not predicted		https://www.vegahub.eu/wp/wp-content/uploads/2019/12/VEGA_NRMEA_model_Introduction.pdf
Genotoxicity/Mutagenicity	Mutagenicity	MicroN-In vitro	VEGA		Active	Active/"-" = Inactive/Not predicted	The dataset includes 380 mono-constituent organic compounds with experimental data collected from peer_x0002_reviewed literature, SCCS and EFSA opinions, ECVAM guidelines and review, and eChemPortal inventory. We carefully revised the sources in order to ensure their quality and reliability and, to our knowledge, most of the selected data can be classified with a Klimisch score of 1 due to the facts that the studies were done with test procedure in accordance with validated standard methods. The In vitro Micronucleus Activity (IRFMN/VERMEER) model (version 1.0.0) provides a qualitative prediction of genotoxicity as induction of micronucleus in mammalian cells in vitro. It is based on a set of rules extracted from a set of compounds by SARpy software without any ‘a priori’ knowledge. Active Structural Alerts (SAs) adimensional were of 82 genotoxic (active/positive) and were of Inactive Structural Alerts adimensional 56 non-genotoxic (inactive/negative). 293 molecules (171 active, 122 inactive) were used in the tranining set that allow to determine Accuracy 0.88; Specificity 0.73; Sensitivity 0.97; Matthews correlation coefficient 0.75.	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Organ toxicology	Skin toxicity	SkinSen	ADMETLAB2		-	Active/"-" = Inactive/Not predicted	Skin sensitization is a potential adverse effect for dermally applied products. The evaluation of whether a compound, that may encounter the skin, can induce allergic contact dermatitis is an important safety concern.	doi: 10.1093/nar/gkab255
Endocrine Disruption		TRb	VEGA_NRMEA		-	Agonist/Antagonist/"-" = Inactive/Not predicted		https://www.vegahub.eu/wp/wp-content/uploads/2019/12/VEGA_NRMEA_model_Introduction.pdf
Cell toxicology	Cytotoxicity	Cyto- tox	vNN-ADMET		-	Agonist/Antagonist/"-" = Inactive/Not predicted	Cytotoxicity (HepG2): The k-nearest neighbor (k-NN) method is widely used to develop QSAR models (Zheng and Tropsha, 2000). An alternative approach is to use a predetermined similarity criterion, vNN method, which uses all nearest neighbors that meet a structural similarity criterion to define the model's applicability domain (Liu et al., 2012, 2015; Liu and Wallqvist, 2014). When no nearest neighbor meets the criterion, the vNN method makes no prediction. We developed a cytotoxicity prediction model, using a training dataset of in vitro toxicity against HepG2 cells for 6,097 structurally diverse compounds, which we collected from Chemical European Biology Laboratory (ChEMBL) (Bento et al., 2014). In developing our model, we considered compounds with an IC50 of 10 μM or less in the in vitro assay as cytotoxic. We classified cytotoxic compounds as positives and non-toxic compounds as negatives. The cytotoxicity model performed well, with an overall accuracy of 84% and a kappa value of 0.64. Because compounds in the dataset achieved only sparse coverage of the chemical space, the model only predicted compounds that were well represented in the dataset. Results were adapted and the “Yes” and “No” indicators were changed respectively to “Active” and “-“.	doi: 10.3389/fphar.2017.00889.
Endocrine Disruption		AR	VEGA_NRMEA		-	Agonist/Antagonist/ a-anta agonist and antagonist /"-" = Inactive/Not predicted		https://www.vegahub.eu/wp/wp-content/uploads/2019/12/VEGA_NRMEA_model_Introduction.pdf
Organ toxicology	Hepatotoxicity	H-HT	ADMETLAB2		-	Active/"-" = Inactive/Not predicted	The human hepatotoxicity. Drug induced liver injury is of great concern for patient safety and a major cause for drug withdrawal from the market. Adverse hepatic effects in clinical trials often lead to a late and costly termination of drug development programs. 2304 molecules (1299 + /1005 - ) were used among them 1850 (1044 + /806 - ) were used for the training dataset. Performance of classification models in training was AUC: 0.975, ACC: 0.895, SP: 0.976, Sen: 0.835, MCC: 0.802	doi: 10.1093/nar/gkab255
Genotoxicity/Mutagenicity	Mutagenicity	AMES	VEGA	CAESAR	Active	Active/"-" = Inactive/Not predicted	Combine 2 models: first datamining with Support Vector Machin (SVM) and then expert knowledge coded as structural alerts (SA). 3,367 chemicals have allowed the determination of 41 descriptors. Qualitative information were transformed: "mutagenic" whatever the quality of the prediction is considered as "Active" and "non-mutagenic" prediction as "-" whatever the quality of the prediction.	https://www.vegahub.eu/portfolio-item/vega-qsar-models-qrmf/
Organ toxicology	Cardiotoxicity	hERG I Blocker	PKCSM		-	Active/"-" = Inactive/Not predicted	hERG I Inhibitor: human ether-a-go-go gene I Inhibitor: The prediction is based on molecular properties (Molecular Weight; Heavy Atom count; LogP; Heteroatoms count; Rotatable Bonds count; Ring count; TPSA; Labute ASA; Fluorine atom Count; Toxicophore [1-36]; Pharmacophore count) calculated using the RDKit cheminformatics toolkit and used for training the predictive models. Two different algorithms, Random Forest and Logistic Regression, did the qualitative predictions (classification tasks). Inhibition of the potassium channels encoded by hERG (human ether-a-go-go gene) are the principal causes for the development of acquire long QT syndrome - leading to fatal ventricular arrhythmia. Inhibition of hERG channels has resulted in the withdrawal of many substances from the pharmaceutical market. This predictor was built using hERG I inhibition information for 368 compounds. The best performing predictor in each task was chosen based 5-fold cv approach. The Weka toolkit was used for training and testing the models. How to interpret the results: The predictor will determine if a given compound is likely to be a hERG I inhibitor. Qualitative information were changed: "Yes" was replaced to "Active" and "No" prediction was replaced to "-".	doi: 10.1021/acs.jmedchem.5b00104

Select an endpoint:

Endpoint	Tool	Value	Unit	Comments	Reference
Synth	SWISSADME	5.63	score	Synthetic accessibility score: from 1 (very easy) to 10 (very difficult) based on 1024 fragmental contributions (FP2) modulated by size and complexity penaties, trained on 12’782’590 molecules and tested on 40 external molecules (r2 = 0.94)
Fsp3	ADMETLAB2	0.762	score		doi: 10.1093/nar/gkab255
Muegge	SWISSADME	1.0	Nb of alert
Bioavailability Score	SWISSADME	0.11	Probability
MCE-18	ADMETLAB2	51.081	score		doi: 10.1093/nar/gkab255
Brenk	SWISSADME	2.0	Nb of alert
Natural Product-likeness	ADMETLAB2	2.015	score		doi: 10.1093/nar/gkab255
Leadlikeness	SWISSADME	2.0	Nb of alert
Alarm_NMR	ADMETLAB2	0.0	Nb of alert		doi: 10.1093/nar/gkab255
BMS	ADMETLAB2	0.0	Nb of alert		doi: 10.1093/nar/gkab255
Chelating	ADMETLAB2	0.0	Nb of alert		doi: 10.1093/nar/gkab255
PAINS	ADMETLAB2	0.0	Nb of alert	PAINS. Pan Assay Interference Compounds (PAINS) is one of the most famous frequent hitters filters, which comprises 480 substructures derived from the analysis of FHs determined by six target-based HTS assay. By application of these filters, it is easier to screen false positive hits and to flag suspicious compounds in screening databases. One of the most authoritative medicine magazines Journal of Medicinal Chemistry even requires authors to provide the screening results with the PAINS alerts of active compounds when submitting manuscripts. Results interpretation: If the number of alerts is not zero.	doi: 10.1093/nar/gkab255
PAINS	SWISSADME	0.0	Nb of alert	Pan Assay Interference Structures: implemented from Baell JB. & Holloway GA. 2010 J. Med. Chem.
Lipinski	ADMETLAB2	Accepted	Result	Lipinski Rule: Content: MW≤500; logP≤5; Hacc≤10; Hdon≤5. Results interpretation: If two properties are out of range, a poor absorption or permeability is possible, one is acceptable. Empirical decision: < 2 violations：excellent (green)；≥2 violations: poor (red)	doi: 10.1093/nar/gkab255
Lipinski	SWISSADME	0.0	Nb of alert	Lipinski (Pfizer) filter: implemented from Lipinski CA. et al. 2001 Adv. Drug Deliv. Rev: 5 rules: MW ≤ 500; MLOGP ≤ 4.15; N or O ≤ 10; NH or OH ≤ 5.
Pfizer	ADMETLAB2	Accepted	Result		doi: 10.1093/nar/gkab255
GSK	ADMETLAB2	Rejected	Result		doi: 10.1093/nar/gkab255
GoldenTriangle	ADMETLAB2	Accepted	Result		doi: 10.1093/nar/gkab255
Ghose	SWISSADME	0.0	Nb of alert
QED	ADMETLAB2	0.177	score		doi: 10.1093/nar/gkab255
Synth	ADMETLAB2	5.345	score	Synth: Synthetic accessibility score is designed to estimate ease of synthesis of drug-like molecules, based on a combination of fragment contributions and a complexity penalty. The score is between 1 (easy to make) and 10 (very difficult to make). The synthetic accessibility score (SAscore) is calculated as a combination of two components: 𝑆𝐴𝑠𝑐𝑜𝑟𝑒 = 𝑓𝑟𝑎𝑔𝑚𝑒𝑛𝑡𝑆𝑐𝑜𝑟𝑒 − 𝑐𝑜𝑚𝑝𝑙𝑒𝑥𝑖𝑡𝑦𝑃𝑒𝑛𝑎𝑙𝑡𝑦. Results interpretation: high SAscore: ≥ 6, difficult to synthesize; low SAscore: < 6, easy to synthesize. Empirical decision: ≤ 6：excellent (green); > 6: poor (red)	doi: 10.1093/nar/gkab255
Veber	SWISSADME	2.0	Nb of alert
Egan	SWISSADME	1.0	Nb of alert

Fungi

Fungi id	Species
391	Cordyceps fumosorosea