Instructions for preparing inputs for the OpenFE public benchmark
Overview & Aims
This page provides guidance on preparing the Schrodinger benchmark files for use in the OpenFE 2024 public dataset industry benchmark.
Whilst each partner organization will be responsible for preparing inputs using their own tooling, this guidance aims to provide a set of rules to standardize the process and ensure that scientifically equivalent simulation inputs are generated.
This page will be continually updated based on feedback from benchmark partners. Please reach out should you have any questions or require additional information whilst preparing your inputs.
Checklist
By the end of these instructions you should have:
A PDB file containing the system protein, crystallographic waters and ions (named protein.pdb)
An SDF file containing the ligands being mutated (named ligands.sdf)
(Optional) An SDF file containing any system cofactors (named cofactors.sdf)
Input preparation instructions
Note
Instructions for preparing the systems using Maestro are provided in Section Input preparation using Maestro.
Input data source
All input data are sourced from the v2.0 release of the Schrodinger binding free energy benchmarks.
For your convenience, a snapshot of these inputs have been placed under the OpenFE Public Benchmark repository. Specifically, input PDB, SDF and edges CSV files can be found under the relevant original structures sub-folder.
We recommend that you clone this repository and use those input files.
Extracting cofactors
In its benchmark inputs, Schrodinger places cofactor molecules within the PDB file. Currently OpenFE cannot parse such inputs as small molecules are expected as separate inputs.
Should your input PDB have cofactors, please extract these from the PDB and store them
within an SDF file named cofactors.sdf. You may be required to manually assign and/or
correct bond orders such that they represent the expected state of the molecules and the
SDF files are readable by RDKit.
Note
Waters and ions do not need to be extracted from the PDB file.
Capping proteins
Whether or not proteins have to be capped depends on how the protein’s termini are handled in the Schrodinger provided PDB files.
- If the termini are neutral, e.g. NH2 or C(=O)H, or have ACE/NME caps:
Assign ACE and/or NME caps to the termini
Note
In some cases Schrodinger / Maestro may call the NME cap NMA, if this happen, the cap should be renamed to NME.
- If the termini are charged, e.g. COO- or NH3+:
Keep the termini in their charged state
Note
In cases where this is observed, e.g. JNK1, there is a possibility that interactions between the perturbed ligand and the termini may form. Keeping the termini charged should retain the intended interaction.
Note
Some tools, e.g. Pymol, are known to erroneously place caps out of order with the protein chain or TER cards between the protein chain and the cap. Please ensure that the caps appear in order (i.e. before or after the relevant termini residue), and that there are no TER cards between the termini residue and the cap.
Non canonical amino acids (PTMs)
Some of the input structures have non-canonical amino acids, specifically TPO, PTR, or TYS residues. The OpenFE team has reviewed all such cases and the amino acids were found to be far from the binding site.
As the OpenFE software cannot easily handle PTMs at this stage, these residues should be modified back to their canonical alternative. Tools such as Pymol offer the ability to mutate residues in this manner.
Protonating inputs
All files in the Schrodinger input files are considered to be in their intended protonation state. No additional protonation steps should be carried out.
Note
Care should be taken that any processing steps do not alter the protonation state. For example, PyMol is known to change histidines from HID to HIE during capping.
To ensure that protonation states were not altered, please ensure that the prepared PDB file has the same number of protein atoms outside of capping groups.
Setting residue names
Where possible, residues should be assigned PDB-compliant names.
Example 1: Waters named SPC (e.g. in the case of Thrombin in the JACS set), should be renamed to HOH.
Example 2: Capping groups named NMA should be renamed to NME (e.g. in the case of PTP1B in the JACS set).
Fixing hydrogen atom names
In some cases, hydrogen names may need to be manually altered to match expected, i.e. PDB compliant, names.
These exact cases can be difficult to identify, running the validation script (see below), will help identify these. Please reach out to the OpenFE team should you encounter any unknown hydrogen names.
Example 1: GLY termini hydrogens being named 3HA and HA instead of HA3 and HA2.
Example 2: HIS (in the HID state) hydrogens being named 1HD, 2HD, and 1HE instead of HD1, HD2, and HE1.
Validating prepared files
To ensure that prepared files can be run using OpenFE, a short MD simulation validation script has been provided under utils/input_validation.py. In an environment with OpenFE 1.0 installed, please run this script by calling:
# If you don’t have cofactors
python input_validation.py --pdb protein.pdb
# If you have cofactors
python input_validation.py --pdb protein.pdb --cofactors cofactors.sdf
If the script outputs “SIMULATION COMPLETE”, then your inputs are suitable for use with OpenFE. If they do not, then there is likely an issue with the input file. Please report the error message emitted when contacting the OpenFE team for advice on how to fix any issues.
Note
This script runs a very short simulation, it is recommended that it is executed on a machine with a CUDA-enabled GPU.
Preparing the ligand file
For some datasets, the Schrodinger public binding free energy benchmark set includes multiple binding modes (e.g. different rotamers) and protonation states of ligands. For this current study, we will only consider a single conformation and protonation state for each of the ligands.
If the dataset contains ligands in multiple conformations or protonation states, the state that likely contributes the most to binding should be identified (by looking at previous results) and the less favorable state should be removed from the input ligands.sdf file.
The FEP+ ligand predictions can be found here.
In the following, we will go into the details on how to extract the necessary information for ligands with multiple binding modes, multiple protonation states, and multiple stereo isomers.
1. Multiple binding modes
For ligands that were run in multiple binding modes, the table of FEP+ ligand predictions reports only the binding mode that was calculated to contribute more to binding.
Example: JNK1 (JACS set)
Opening the Table of ligand predictions
The table shows the experimental and calculated binding free energies for 21 ligands, while there had been 38 nodes in the FEP+ network
Remove all ligands from the
ligands.sdffile that are not listed in this tablee.g.
18637-1is present in the table but not18637-1 flip, therefore we would remove18637-1 flipIt may also be helpful to look at the Table of edge predictions to identify the ligand pairs for which multiple binding modes had been used
e.g. first edge between ligand
18637-1and its alternate binding mode18637-1 flip
2. Multiple protonation states
For ligands for which multiple protonation states were included in the ligand network, the table of FEP+ ligand predictions reports calculated binding free energies from all states. The values include a pka correction as described in work by Oliveira et al. For this study we will be using a single protonation state per ligand, choosing the protonation state that had been used in the original studies by Schindler et al. (Merck set), Chen et al. (charge annihilation set), and Cappel et al. (MCS docking set).
From the systems picked by industry partners, as of writing these instructions, the following systems have ligands in multiple protonation states:
Merck set: EG5, TNKS2
MCS docking set: HNE
Charge annihilation set: JNK1, EGFR, DLK, JAK1, TYK2, ITK, CDK2
If you picked one of these systems, please reach out to us with any questions regarding the protonation state assignment!
3. Multiple stereo isomers
For ligands where multiple stereo isomers where included in the ligand network,
the table of FEP+ ligand predictions reports results from both stereo isomers.
In this case we will keep the stereo isomer with the more negative calculated binding free energy and remove the other stereo isomer from the ligands.sdf file.
Example: MUP-1 (fragments dataset)
Opening the Table of ligands predictions
For ligand
SBTthere are results from two stereo isomers,SBT_RandSBT_SThe calculated binding free energy of ligand
SBT_Sis more negative than for ligandSBT_R(-9.14 vs. -8.77 kcal/mol)In this case we would remove ligand
SBT_Rfrom theligands.sdffile
Input preparation using Maestro
If you are preparing the input files using Maestro, the following steps can be carried out to prepare the protein, cofactor, and ligand files:
Read in original inputs (protein and ligands)
Duplicate the inputs (to have a reference)
If multiple ligand conformations or protonation states, extract preferred ligand input
Extract all cofactors into a single maestro entry
Check the sequence to see if it contains any non-natural amino acids, mutate back if present
Check all caps to see if any need to be charged or not. Leave residue if charged
Open the Protein preparation wizard, if caps are needed
- Run only step 3 - “Preprocess”. The only three options activated are:
Cap termini
Convert selomethionines to methionines
Include peptides when capping termini
Note: Not changing protonation states!
- If applicable, convert waters SPC to HOH
Highlight all waters
Builder > Other Edits > Change Atom Properties > Property: Residue/Chain Name
Residue name “HOH”
- Change the cap names if not ACE, NME:
Highlight all Cterm caps
Builder > Other Edits > Change Atom Properties > Property: Residue/Chain Name
Residue name “NME”
Export outputs, named “ligand.sdf”, “protein.pdb” and optionally “cofactors.sdf”
Run input through validation script
Submitting prepared input files
All prepared inputs should be submitted to the OpenFE Public Benchmark github repository, more specifically to the prepared_structures subfolder. This should be done via Pull Request, with a folder for each prepared system including the protein PDB, ligand SDF, and if available cofactor SDF file. A short bullet point summary of any remediation steps, including any software used, should also be included as a markdown file. Further details can be found in the Contributing inputs page.
If necessary, you may email the OpenFE team with this information and the Pull Request will be opened on your behalf.
Once the Pull Request is opened, the OpenFE team will carry out a minimal review of the contents, including a short validation that the alchemical transformations will work. If all checks pass, the Pull Request will be merged and you should be ready to start the next step in the benchmarking process (setting up the alchemical network).