Chemistry - Pubchem, InChI, SMILES, and uniqueness
Solution 1:
Unfortunately Pubchem is right, the two structures have the same InChI string and key, since the protonation state is the same in the zwitterion and the neutral form. So the reason for the discrepancy is by design.
I also always thought, InChI was designed for distinguishing between these conformations, but it turns out to just be one of the limitations of the system. The issue is addressed in section 13.2 of the technical FAQ of the InChI trust:
The different protonation states of the same compound will have InChIKeys differing only by the protonation indicator (unless both states have a number of inserted/removed protons greater than 12; in this case the protonation flag will also be the same, ‘A’).
This is exemplified below by standard InChIKeys as well as standard InChI strings for neutral, zwitterionic, anionic and cationic states of glycine (note that neutral and zwitterionic states do not differ in the total number of protons so they have the same standard InChI/InChIKey):
Solution 2:
InChI is intended to ignore tautomeric forms. As Martin indicates this also means zwitterions are considered identical to the neutral form.
Unlike you and Martin, I'm not sure I see this as a bug, since predicting the most stable tautomer or zwitterion/neutral is a complicated issue.
If you want to keep track of zwitterions, I think SMILES is a better format, since you can specify exactly what you want as far as explicit hydrogens and charges. You'll need to stick to a particular toolkit to create a canonical ordering.
Moreover, there's a complicated relationship between CIDs and InChI / InChI keys. There are other cases where PubChem will have separate records for compounds that might be "the same" under InChI.
- Axial or non-traditional stereochemistry. For example hexihelicene (CID=98863) should have two enantiomers, but the InChI reflects no stereochemistry.
- "Extra" stereo centers. PubChem takes depositions in the SD file format, which allows 2D stereochemistry in wedge/hash notation. If you take the actual 3D geometry, you might realize the wedge/hash were used for appearance, not for indicating a stereo center (i.e., multiple CIDs generate the same InChI from the full 3D molecule).
There are also cases where PubChem indicates an InChI key computed from the 2D depiction in the SD file, but there are missing or undefined stereo centers.
So I'd say that because of incomplete stereochemistry and/or inconsistencies in representations, PubChem CIDs will not always match up with "structural uniqueness" and this is by design, both for PubChem and InChI.
The moral of the story, frankly, is that chemistry is complicated and coming up with "perfect" unique identifiers is incredibly hard.
Solution 3:
Martin's answer led me to discover an important extension of InChI that does allow for specification of some tautomer and zwitterion identification.
InChi identifiers that begin with
InChI=1S/...
are standard InChI. In standard InChI, the InChI identifier "must be the same for any arrangement of mobile hydrogen atoms", with the quote from Section 6 of the Technical FAQ of the Inchi Trust.However, non-standard InChIs are also possible. These start with
InChI=1/...
. Note the missingS
. In non-standard InChI, there can be an extra layer beginning with/f
that is called the fixed hydrogen layer.Playing around with rdkit (via its Python API) I was able to produce a non-standard InChI that I think corresponds to Pubchem compound 6925665.
from rdkit import Chem
zwitterion_phe_smiles = 'C1=CC=C(C=C1)CC(C(=O)[O-])[NH3+]'
zwitterion_phe_mol = Chem.MolFromSmiles(zwitterion_phe_smiles)
# produces "standard" InChI so not explicitly zwitterionic
print Chem.MolToInchi(zwitterion_phe_mol)
InChI=1S/C9H11NO2/c10-8(9(11)12)6-7-4-2-1-3-5-7/h1-5,8H,6,10H2,(H,11,12)
# produces "non-standard" InChI with fixed-H layer so zwitterion can be IDed
print Chem.MolToInchi(zwitterion_phe_mol, options='/FixedH')
InChI=1/C9H11NO2/c10-8(9(11)12)6-7-4-2-1-3-5-7/h1-5,8H,6,10H2,(H,11,12)/f/h10H
# going from respective InChIs to SMILES
## The standard InChI produces neutral SMILES
zwitterion_nonstandard_inchi = 'InChI=1/C9H11NO2/c10-8(9(11)12)6-7-4-2-1-3-5-7/h1-5,8H,6,10H2,(H,11,12)/f/h10H'
standard_inchi = 'InChI=1S/C9H11NO2/c10-8(9(11)12)6-7-4-2-1-3-5-7/h1-5,8H,6,10H2,(H,11,12)'
print Chem.MolToSmiles(Chem.MolFromInchi(zwitterion_nonstandard_inchi))
print Chem.MolToSmiles(Chem.MolFromInchi(standard_inchi))
[NH3+]C(Cc1ccccc1)C(=O)[O-]
NC(Cc1ccccc1)C(=O)O
So for zwitterions tautomeric InChI's are possible. Worryingly, however, some kinds of tautomerism are not handled even by non-standard InChI, again quoting from section 6 of the FAW:
In its current state, InChI recognizes the most common form of H migration (for the full list, see Table 6, Section IVb of the InChI Technical Manual). However, several ways of tautomeric migration that are not supported by default may appear important for some chemists. In particular, these are keto-enol and long-range tautomerisms.
Why Pubchem chose to list standard InChI instead of non-standard InChI is not entirely clear to me. I suppose it is difficult to figure out programmatically when non-standard InChI would be required. Ideally I suppose Pubchem would have both standard and non-standard InChIs for each of their molecules, but I'm not sure when/if they will ever make that change.