Skip to content

About Them, Chemical Structures

Over the past week, some of the key learnings / experiments wrt our master's project included:

  • Whenever presenting a research topic, it's better to accompany the presentation with visual aids. For example, we could have shown more images from our example dataset to convey the complexity of the topic. Many of the images are blurry or otherwise hard to decipher, which complicates the research object and makes it necessary to consider additional pre-processing steps, such as implementing GANs to improve image quality. (Attaching more examples of the images below.)
  • For instance, on the last photo, one can see that the "Cl" symbol is blurred, and the "N" and "H" symbols lie so close to each other, that it's hard to decipher the separating line between them.
  • As we started working with the dataset, we realized an increase in its quantity could yield better results, whereby all the different rotations / flips of the image are also considered. So, in a way, the same chemical formula can be present in the dataset but be displayed in different formats. This increases the size and variety of consideration without the need to look for external examples (at least for the initial end-to-end product). So far, we've been able to implement the following types of rotations and flips of the same image, using the OpenCV library:

However, as one can see there's a problem with flipping such characters as "S" and "OH" in the right way, to preserve their orientation (this is the next step of the problem that I don't yet know how to solve - would appreciate any help with this!). I did hear mentions of Google Image API but not sure how much it can help with this rule-based type of augmentation.

  • I did try Pytesseract on individually recognized characters but more often than not it actually failed to decipher the symbol correctly.
  • Another key learning is that apart from the numerical accuracy metric, i.e. the Levenstein distance, it would be an interesting idea to explore semantic evaluation. I.e. how close are the results to what chemists expect, without the need to go into the nitty-gritty numerical details. This once again ties in with the usefulness of our final product to customers.

Would really love to hear feedback and any useful comments about the augmentation issue!

Thanks

2 thoughts on “About Them, Chemical Structures

  1. Steve Kaisler

    I have three thoughts that may help you:

    You don't have to rely on just image processing to work this problem. You can also use collateral databases to assist you in recognizing the molecule that is represent by a particular structure. This includes additional information about the individual elements comprising the molecule.

    You should also consider edge detection algorithms such as a Hough transform to sharpen up the edges.

    Let's pick the first structure above. What data can we glean from it?
    a. You can pick out the letters from the image: O, N, Br, NH, OH, and H.
    b. OH is an indicator of an acid.
    c. NH is an indicator for an amine
    d. N has valence 3, O has valence 2, H has valence 1 and Bromine has valence 1.
    5. The molecule is buitl around a benzene ring which suggests it is an aromatic hydrocarbon.

    Now, you can search a chemical database ank: What molecules have a benzen ring with a bromine atom attach? Perhaps even specifying position 6.

    The ring with 4 Cs and an amine attached is another indicator of a submolecule.

    So, there are two approaches: You can segment the molecule and search a database for the submolecules. You can also ask what submolecules with a benzene rign have an OH radical attaches.

    You can also try to do pattern matching to help identify submolecules. as part of this, you can fill in some of the atoms which do not have labels because there are only a small number of molecules with those configurations.

    Another approach is to start with a benzene ring with bromine attached and ask how it can evolve into different molecules through chemical reactions.

    Another approach is to start with the benzene ring with bromine, determine its formula, then start searching existing molecules for that substring. On the one hand you are starting with the image, but ont he other hand you are starting with a partial formula and trying to find here they meet in the middle.

    So, what I am suggesting is that problems of this nature need multiple analytics and methods to solve the problem rather than just focusing on one

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *