serviceReCGen: Compound structure generator

Refined Compound Generator

ReCGen provides three functions: a fragment DB creation tool, a DB merging tool, and a structure generation tool. Figure 1 shows the relationship between these tools. The ReCGen web service utilizes a part of these tools.

|image scrollImage is scrollable

Figure 1: The whole process from fragment creation to structure generation.

The number and type of virtual compound structures output from ReCGen depends on the fragment DB used. To generate structures in ReCGen, first, fragment the compound library (SD file) using the Fragment DB Creation Tool, and then create a database of ECFP fragments. With regard to the compound library, it is desirable to have as much structural diversity as possible. The web application version uses a fragmented database of DrugBank compounds. Next, new structures can be generated from the MOL file that contains the core structure and the structure transformation part, using the fragment DB that has been created. The DB merge tool combines the fragment DBs created from different chemical libraries into a single DB to create a larger fragment DB.

ECFP Fragment

An ECFP fragment is defined as a substructure centered on an atom that encompasses atoms with the same number of linkages. The number of linkages can be varied within a certain range for each central atom to generate fragments with structures ranging from small to large, making it possible to create fragment libraries with a variety of structures.

|image scrollImage is scrollable

Figure 2. ECFP Fragment

The differences between structure generation using ECFP fragments and RECAP rule-based generation are shown below.
(1) Generate highly novel chemical structures.
(2) It is possible to make precise structural changes and generate structures around the reference compound.
The fragmentation of a compound, when viewed as a graph of molecular structures, is calculated by applying a depth-first search algorithm. Figure 3 shows examples of ECFP fragments (26 types) generated from the structural formula of aspirin. The "U" and "Th" in the figure are used as markers to indicate the cut surface of the fragment, not actual Uranium and Thorium atoms bonded to it.

|image scrollImage is scrollable

Figure 3. ECFP fragments generated from aspirin structure.

Fragment Notation

There are various file formats for chemical structure information, but the standard one used is the SDF format (MOL format). ReCGen uses this structural format to represent ECFP fragments.

Fragmentation in ReCGen is a method of cutting off aromatic or saturated rings, even in the middle of a fragment, and it is necessary to keep some information about whether an atom in an ECFP fragment was part of an aromatic ring or ring structure. This is because without this information, the appropriate molecular structure cannot be reconstructed from the ECFP fragment. Therefore, we have extended the standard specification of the SDF format to represent ECFP fragments.

Figure 4 shows an example of an ECFP fragment in SDF format. The "U" and "Th" in blue letters in Figure 4 are heavy atoms that indicate the cut surface of the fragment. It corresponds to the original atomic species when fragmenting, as shown in Table 1. In addition, the bond valence values shown in red are special numbers assigned when the original structure is an aromatic or saturated ring. The details of the bond valence are shown in Table 2.

|image scrollImage is scrollable

Figure 4. Internal representation of ECFP fragment.

Chemical symbol Description
U Atoms in the chain structure
Th Atoms in the aromatic ring structure
Pa Atoms in non-aromatic ring structures
Table 1. Heavy atoms
Bond valence Description
10 Bond in the aromatic ring
11 Single bonds in non-aromatic ring
12 Double bonds in non-aromatic rings
13 Triple bonds in non-aromatic rings
Table 2. Heavy atoms

Fragment DB

The ECFP fragment can generate a variety of substructure fragments as the bond path length is increased. The number of ECFP fragments generated from ReCGen can therefore, be huge, and the computation time required for structure generation can also be longer. That is why ReCGen uses SQLite, an RDBMS, to manage and search the huge number of fragment structures.

In order to find a fragment structure that can be properly connected to the input compound, it is necessary to perform a substructure search of the fragment DB. Fingerprints are generally used to efficiently search for compounds. ReCGen also uses fingerprints to search for substructures, but since ECFP fragments contain special-purpose atoms such as U atoms, we have created our own fingerprints. Fingerprints used for searching are as follows.

The presence or absence of a total of 342 substructures is represented by 1 or 0, resulting in a 342-bit fingerprint.

  • Set 8 atom types in the structural formula C, Car, N, Nar, O, Oar, S, Sar, P, Par, F, Cl, Br, I, (B, Si, Se), Th, Pa, U
  • Whether atomic pairs with path length 1 are included as substructures (171 types)
  • Whether atomic pairs with path length 2 are included as substructures (171 types)

Figure 5 shows an example for nicotinic acid amide. In this example, 11 out of 342 bits are set to 1. In an using SQLite, a 342-bit fingerprint is divided into 6 64-bit integers.

Path length Fingerprint structure
1 Car-Car, Car-Nar, C-N, C-O, C-Car
2 Car--Car, Car--Nar, Car--N, Car--O, C--Car, N--O
Figure 5. Example of atom types of fingerprint used for searching

Principle of structure generation

The structural transformation implemented in ReCGen does not simply combine fragments, but combines them with certain overlapping parts (sticky areas). By using this method, the generation of structures that cannot be synthesized in reality can be suppressed to some extent. Specifically, as shown in Figure 6, the following four steps are used to generate the new structure.

  1. Cutting out the fragment in the sticky area
  2. Create atomic index correspondences
  3. Input structure and fragment alignment
  4. Remove nonessential atoms and create bonds to complete the process.

The fragment in the sticky area of step 1) is defined as the ECFP4 fragment in the range up to path length 2 centered on At. If there are multiple At's and their positions are close to each other, they will be handled by extending the sticky area, as described in detail later. The fragment to be combined is retrieved by performing a substructure search against the fragment DB to see if it contains the sticky area. In step 2, the atoms to be superimposed are assigned based on the substructure search results. In step 3), the two molecules are superimposed at the corresponding atoms. In step (4), the fragments are combined to form a complete structure, and steps (2) through (4) are repeated for the number of fragments obtained in the DB search.

Figure 6. Principle of ECFP fragment binding

If there is more than one At in the input structure, it is expanded. For example, in the case of an input structure with two atoms, as shown in Fig. 7, there is an overlapping atom in the sticky area made from each of the atoms. In such cases, the sticky areas should be connected to form a single sticky area. With this extension, we can handle both the case of forming ring structures and the case of branching structures.

Figure 7. Extension of sticky area

Also, as shown in Fig. 8, the only case where there are no overlapping atoms in the two sticky areas is when a branching structure is possible.

Figure 8. An example of a branch that can always be created without being extended

Multi-step generation

Multi-step generation means that a fragment is added to the input structure once, and then another fragment is added to generate the structure, as shown in Figure 9. As mentioned above, the basic structure generation in ReCGen is to define a sticky area from the cut surface of the input structure, and then join the ECFP fragments so that the sticky areas overlap. On the other hand, the cut surface of an ECFP fragment can also be used to define the sticky area, and the ECFP fragment can be classified according to the number of sticky areas that can be defined. We call the cut surface of the ECFP fragment, where the sticky area can be defined, the "connector". As can be seen in the middle row of Fig. 9, after the ECFP fragment is added once, the structure may have zero, one, or more connectors. ReCGen will only proceed to the next level of generation if the number of connectors in the structure with the ECFP fragment is the same as the number in the input structure.

Figure 9. Principle of two-step generation

Example of structure generation

ReCGen allows for various types of structure generation. The structure shown in Fig. 10 is an example of input that can be handled by ReCGen. If you save the file in MOL format using an appropriate structure drawing tool, it can be used as an input structure file for ReCGen.

  • Example of two-point conversion

  • Example of two-step conversion

  • Two-step structural interpolation

  • Structure formation from a core structure


Inquiries about any of the above services can be done through the link below.

Contact Us



INTAGE Healthcare’s Scope of Service