Structure verification (CHECK)

Introduction.

The command CHECK will bring you to the CHECK menu. This menu holds options that all check one or more aspects of protein structures. Most checks determine exceptional situations, like for example a contact that is seldomly seen in the database, but also hard errors, like for example a wrong SCALE matrix in a PDB file can be detected.

Several of the commands in this menu are also executable from another menu. For example CHICHK evaluates and checks torsion angles. This option can also be called as EVACHI from the CHIANG (torsion angle) menu.

A few options in the check menu are so called 'terminal' options. That means that they can destroy the status of the soup, and will definitely leave WHAT IF in an undefined state after the option finished.

There is a complete description of the output of all checks on the World wide web at:

    http://www.sander.embl-heidelberg.de/rob/checkhelp/
Please have a look at those documents if anything is unclear.

X-ray or NMR?

These checks were written for X-ray structures, but most of them work perfectly fine for NMR structures. Interpretation of the results might be different, though, and our experience with NMR structures is limited.

Most checks will automatically generate a table of values for each individual model if a multiple model NMR structure is checked.

Completely checking a protein (FULCHK)

The command FULCHK will cause WHAT IF to write a complete report about a protein structure. You will get the output in LaTeX format in a file "pdbout.tex", and in plain text format in "pdbout.txt". Obviously pictures can only be given in the LaTeX output. If you want to use the LaTeX output, you will need the latex program and some others. For your convenience suitable versions of these programs are archived on our anonymous ftp site "swift.embl-heidelberg.de" in the directory "/whatif/support".

To use the LaTeX output, you can type:


latex pdbout           (to reformat the file)
xdvi pdbout            (to preview the output)
dvips pdbout           (to make postscript output)
lpr pdbout.ps          (or a similar command, to print the postscript file)
A maximum of 100 lines will be given in any table. If more than 100 problems should be listed, the table is truncated at 50 lines, and the total number of lines is written at the bottom. Since most tables are sorted such that the worst numbers are at the top, this should not be a problem. If you want to see the whole list anyway, you can get it sometimes by running the individual check while creating a logfile (see DOLOG), or in any case by setting WHAT IF parameter 593 (the limit of the number of lines in a table) to a higher value (see SETWIF).

FULCHK is a terminal option. That means that you can not run FULCHK just in the middle of a WHAT IF session. You run FULCHK on one molecule, preferably in a "fresh" WHAT IF. After FULCHK finished, you are immediately asked to terminate the session with FULLSTOP.

Running only a subset of the checks (FSTCHK)

The command FSTCHK does the same as the FULCHK option. However, rather than running all checks, only a subset of all checks is executed. You can control which options are skipped and which are executed with the TODO.CHK file (of which there is an example in your dbdata directory of the WHAT IF account). In this file the first three characters of each line are the Check-Id, and columns 4-6 are either 'YES' or 'NO'. The rest of each line is free; in the example file you can find out what the check does and how long it normally takes.

Starting a new summary file (NEWCHK)

Most checking options write a summary in a file that can be inspected by for example a simple perl-script like used in our WWW version of the CHECK procedures. The file is called 'check.db'. WHAT IF keeps adding its results to the end of this file. The command NEWCHK closes the old copy of this file if it exists. It also closes any TEX files that were made already. If you want to keep those files you should rename them BEFORE you run any other check option, because the check options will not even hesitate for a millisecond, and overwrite the old files.

Individual Checks: Atoms and coordinates

Checking chain names (CHNCHK)

This check verifies the chain names in the PDB file. All residues with a certain chain name should be consecutive in the file, otherwise an error message will be given. Another error message will be given if any residue has a lower number than the previous residue in the same chain.

Checking coordinate rounding (CRDCHK)

The Coordinate rounding check looks for "0"s at the end of the coordinates in the PDB File. If there are many atoms with round coordinates (upto 0.1A) this probably means the structure (or a subset) was not refined at all.

Nomenclature checks (NAMCHK)

The command NAMCHK alolows you to check the names of atoms. All atoms with non-IUPAC names will be listed. This involves simple torsion angle calculations (like for the PHE side chain) as well as checks for the exchange of atoms (like CG and OG in the THR side chain).

Atomic occupancy check (WGTCHK)

WGTCHK checks whether all atomic occupancies are between 0 and 1.

B-factor check (XBFCHK)

XBFCHK verifies the B-factors in the structure. If many buried atoms have a B factor below 5.0, a warning is given. This either means that the structure has been determined at low temperature, or that there are problems in the refinement. If the average B factor for buried atoms is very high or very low, another warning is given. Finally, the distribution of B factors (basically the differences between B factors of bonded atoms) is analyzed. If the result is very strange, a warning is printed. If this warning appears, the B-factors should probably be constrained during the refinement. Because these strange observed differences can not be caused by thermal motion, adding constraints could improve the behaviour of the refinement.

Individual Checks: Symmetry related

Checking distance of atoms to symmetry axes (AXACHK)

The command AXACHK will verify for each atom in the structure whether it has a distance bigger than 0.7 Angstrom to all proper symmetry axes. Any atom coming closer than this distance must form a "bump" to a symmetry related copy of itself. The only exception is a water molecule that is exactly on an axis; therefore WHAT IF will not complain in such a case.

Checking validity of water molecules (H2OCHK)

The option H2OCHK will perform three checks on all water molecules in the soup.

For all clusters of water molecules H2OCHK will verify whether they are free-floating in the unit-cell, or touch the protein somewhere. If a cluster is free-floating this is reported as a problem: it is very unlikely that such clusters can be seen in the X-ray density, so the listed water molecules are probably refinement artefacts.

For all water molecules the closest protein molecule is located. If this is a molecule that is symmetry related to the ones given in the input file, a warning is given. For optimum usability of the file the listed waters should be moved such that they are closest to the untransformed protein molecule. See the MOVWAT option for this.

For all water molecules all possible Hydrogen bonding partners are located. All water molecules that lack the possibility to form any hydrogen bond are listed.

Matthews' coefficient (MVMCHK)

The volume of the unit cell for a normal protein structure is about 1.8-4.5 cubic Angstrom per Dalton. This check will calculate this so-called Matthews' coefficient for the current structure, and complain if it is outside these boundaries. Most of the time, this check triggers because "Z" (multiplicity of the unit cell) is given incorrectly on the CRYST1 card of the PDB file.

Non-crystallographic symmetry checks (NCSCHK)

Similar protein molecules should be folded in similar ways. Within one protein structure that means that if there are two or more identical molecules in the asymmetric unit, these should be similar. Normally this is ensured by using Non-crystallographic symmetry constraints. The NCS check in WHAT IF will generate 2 plots (if they are potentially interesting) for each pair of "identical" molecules, such that you can judge whether you think they are indeed identical in structure. No automatic interpretation (yet?).

Verifying symmetry information (SYMCHK)

The command SYMCHK is a killer command. That means that it starts by wiping out the soup. It will then prompt you for the name of the PDB file for which the symmetry information should be checked. This file will be read and checked.

This option checks the internal consistency of the SCALE and CRYST card in the PDB file, and it checks if the crystal can be reconstructed from the atomic coordinates and the provided symmetry information. It also checks whether the cell complies with rules set by the IUCr, and whether there is extra symmetry between so-called independent molecules.

Individual Checks: Geometry related

Bond angle check (ANGCHK)

The bond angle check will compare each bond angle in protein residues with the Engh and Huber distance parameters [See Engh and Huber, Acta Cryst. A47, 392-400 (1991)] and print a table of all bond angles that differ by more than 4 standard deviations from the expected values. It will also calculate an RMS Z-score for all angles, telling you how well the bond angles in general have been restrained.

Checking bondlengths (BNDCHK)

The command BNDCHK does not require any additional input. It will perform a number of checks on the chemical bonds in the structure.

First it will check whether all atoms in all protein and nucleic acid residues are present.

After that it will compare each bond in protein residues with the Engh and Huber distance parameters [See Engh and Huber, Acta Cryst. A47, 392-400 (1991)] and print a table of all bonds that differ by more than 4 standard deviations from the expected values.

As a third check, the RMS deviation from the mean Engh and Huber parameters is determined (expressed in standard deviations). This RMS value is expected to be around 1.0. If it is bigger than 1.5 or smaller than 0.666 WHAT IF will complain.

Lastly, BNDCHK will determine whether the deviation from the Engh and Huber bondlengths is significantly correlated with the direction of the bond in the crystallographic unit cell. If such a correlation is found, a new unit cell is calculated where the correlation is gone. If this message appears, the cell used during refinement probably is not accurate enough. We do not have any experience on what to do about it, though.....

Checking torsion angles (CHICHK)

The command CHICHK is equivalent to the EVACHI command in the CHIANG menu.

All torsion angles in the molecule will be compared with the distribution of the same torsion angle in 150 of the 300 best refined proteins from the PDB. You will get a score for 'normality' and not for 'correctness' or energetics. In this score 0.0 means that this torsion angle value is as normal as it can be, and negative values represent less common conformations. Residue values below -2.0 warrant investigation, below -3.0 something strange must be happening.

For this analysis all torsion angles in the residue except omega are used.

Another part of the CHICHK verifies the phi/psi combination versus a Ramachandran plot. Residues that are in forbidden areas of the Ramachandran plot will be listed. Also, a separate check on omega values will be performed (for PRO and non-PRO residues), and residues with unusual values are listed.

Omega check (OMECHK)

The Omega angle is often fairly strictly restrained to 180 degrees. That is not good: there should be a little flexibility allowed. This check verifies that the variation in omega angles observed in the protein is "normal".

Side Chain planarity check (PLNCHK)

The planarity of side chains of protein residues is verified against a database distribution. If any side chain deviates more than 4.0 standard deviations from planarity, this fact is reported. For this check any hydrogen atoms are ignored (but see PL3CHK).

Side Chain planarity check 2 (PL2CHK)

For each atom connected to an aromatic ring system the distance of the atom to the least squares plane of the ring is calculated, and compared with a database distribution. If any value deviates more than 3.0 standard deviations from the plane, this fact is reported.

Side Chain planarity check 3 (PL3CHK)

We do not have "normal planarity" data for DNA/RNA bases and protein side chains with hydrogen atoms. In the PL3CHK, the planarity of these systems is calculated, and a complaint is expressed if the RMS deviation is larger than 0.1 Angstrom.

Proline puckering check (PUCCHK)

Proline rings are not flat. They have a fairly precise puckering pattern. This check verifies whether all proline residues in the structure look normal. Since there is no hard data, the results are a bit subjective.

Ramachandran analysis (RAMCHK)

In a phi/psi plot, all residues are normally in some areas at the left side. If you look more carefully, some residues are in more tight areas than others. WHAT IF has 60 different "normal" areas for all 20 residue types and 3 secondary structure types. Instead of plotting all 60 different plots, it will calculate a Z-score for you that can tell you whether the Ramachandran plot "looks OK". That Z-score is calculated by the RAMCHK option.

Checking the hand of chiral atoms (HNDCHK)

The command HNDCHK can be used to check for wrong handedness of chiral atoms in the twenty natural ocurring residues. All atoms with deviating chirality will be listed. Not only the "opposite" chirality will be detected, but also "too flat" or "too puckered" atoms.

Individual Checks: DG Database related

Backbone conformation check (BBCCHK)

The backbone conformation check will check whether there are fragments of 5 C-alpha coordinates that are not represented by other structures in the WHAT IF database. It will calculate an overall normality Z-score that expresses this.

Chi-1/Chi-2 correlation check (C12CHK)

The Chi-1/Chi-2 correlation check verifies for each residue whether the chi-1 and chi-2 angle combination is normal for this residue type in this secondary structure element. A final Z-score will be calculated that expresses how normal the chi-1 and chi-2 angles in this structure are.

Checking for peptide plane flips (FLPCHK)

The command FLPCHK causes WHAT IF to compare all local backbone conformations (5 residue stretches) with similar (RMSD on alpha carbons less that 0.5 Angstrom) conformations in the database. The RMSD of the backbone oxygen in the structure and the database positions is given. If this value for a residue is above 1.5 manual inspection of the peptide plane seems advisable. In brackets the number of hits in the database is listed. This number should normally be 80, as that is the maximal number of hits WHAT IF looks for. If this number is considerably less than 80, the RMSD value for the oxygen position becomes a less sensitive measure of quality.

Looking for rotamer normality (ROTCHK)

The command ROTCHK will compare for all residues their chi-1 rotamer with the distribution of observed rotamers for the same residue type in a similar local backbone conformation in the database. A normality index will be listed. If this index is lower than 0.5 a warning will be given. A few values are expected to appear for every structure, but normality values lower than 0.2 should occur only extremely sparingly!

Individual Checks: Packing and Folding

Checking for unusual short distances (BMPCHK)

The command BMPCHK activates a bump check that is rather different from the bump functions used by e.g. the DEBUMP option.

From a study of WHAT IF's database of high quality structures it was determined that no pair of non-hydrogen-bonded atoms should have an inter-atomic distance more than 0.4 Angstrom shorter than the sum of the two Van der Waals radii. For hydrogen bonded atoms this limit was found to be 0.55 Angstrom.

In the BMPCHK, all interatomic distances between non-bonded atoms are calculated, and verified against these rules. If two atoms do come closer, the amount by which the contact is too short is printed in a table. In the table it will be indicated whether the bump is between symmetry relatives (inter) or within the given asymmetric unit (intra).

A bump will never be reported between two atoms for which the sum of their atomic occupancies is less than 1.0

Checking first generation packing quality (QUACHK)

The command QUACHK is similar to the OLDQUA option in the QUALITY menu. It activates the packing quality control. See the chapter on QUALITY control for an explanation. For short:

Every residue with a quality value below -5.0 is suspicious. A sequence of residues with low quality scores is "interesting".

Every molecule with a global quality below -2.7 is guaranteed wrong. A molecule with a quality below -2.0 might be misfolded or poorly refined. Every molecule with a global quality below -1.2 does not belong in a database of reliable structures.

Checking second generation packing quality (NQACHK)

The command QUACHK is similar to the NEWQUA option in the QUALITY menu. It activates the second generation packing quality control. See the chapter on QUALITY control for an explanation. For short:

Every residue with an "all-all" quality value below -2.5 is suspicious. A sequence of residues with low quality scores is "interesting".

Every molecule with a quality Z-score for all atoms below -5.0 is guaranteed wrong. A molecule with a quality below -3.0 might be misfolded or poorly refined. Every molecule with a global quality below -2.0 does not belong in a database of reliable structures.

Input Output distribution check (INOCHK)

Most PHE residues are expected to be buried. Most LYS residues are expected to be exposed. The INOCHK tests whether this protein is "normal" in this aspect. It will report a normality RMS Z-score for the whole structure. Inside-out structures, membrane proteins and misthreaded structures will trigger this check.

The way it works: For each residue the accessibility is calculated. These values are divided by the "vacuum accessibility" of the residue type, resulting in an "accessibility fraction". These numbers are now sorted from low to high. We then expect PHE residues to appear in the beginning of the array. Using mean and standarddeviation for the location in the array from the WHAT IF database, a Z-score is calculated for each residue. (At this moment, the average location of PHE is 0.301 into the array, with a standarddeviation of 0.197, For GLU these values are 0.593 and 0.209, showing a much higher tendency to be relatively outside). The Z-scores for the residues are used to calculate an RMS Z-score for the structure.

Due to the way this check works, for NMR ensembles the result is slightly different if a model is examined alone or in the context of other models. We think the result in the context of other models might actually be a more accurate result, so we will not ``fix'' this.

Individual Checks: Other tools

Verifying normality of surfaces (ACCCHK)

The command ACCCHK will calculate and evaluate accessible surfaces. It will indicate whether the distribution of polar and apolar accessible and buried atoms looks normal or not. At present I am not sure yet how to interpret the numbers.... This option is not used for the FULCHK report. Please see INOCHK for an alternative where a better interpretation exists.

Unsatisfied buried H-bond donors and acceptors (BPOCHK)

The command BPOCHK will cause WHAT IF to list all buried unsatisfied hydrogen bond donors or acceptors. This check uses a very straightforward definition of a hydrogen bond. A more sophisticated check of unsatisfied hydrogen bond potential is part of the HNQCHK.

Hydrogen bond network checks (HNQCHK)

HNQCHK performs a set of commands from the HBONDS menu in a row, having to do with the HB2 options. For this a complete calculation is done of the optimal hydrogen bond network in the protein. A number of warnings can be generated from the result.

The optimization of the hydrogen bond network considers two possibilities for the side-chain conformations of HIS, ASN and GLN residues. The X-ray experiment can not see the difference between the two conformations. If the orientation of the side chain of one of these residues in the optimized H-bond network is different from the orientation in the input file, a warning is given.

If any buried hydrogen bond donors do not have an acceptor, they are listed. In high resolution structures these do not occur, because it is energetically highly unfavourable!

If any polar side chain acceptor does not accept a hydrogen bond, the atom is listed.

From the optimized hydrogen bond network the protonation state of the HIS residues (HISD, HISE or HISH) can be deduced. Also, from the geometry of the HIS ring it is often possible to see which Engh and Huber parameters have been used for refinement. All these assignments are printed in a table. If the two assignments for a residue differ it is good to verify whether the correct parameters have been used for the refinement.