Skip to content

Ideas about Datasets

Fun fact: I’m writing this blog on a bus heading from NYC back to DC.

Choice of appropriate datasets is definitely not an easy one - however, we (as a team) have done quite an extensive research regarding our topic (recognition of chemical structures) and already have an inkling into what shall be used.

Firstly, since the original competition on optical recognition of chemical structures ran on Kaggle, about four million images containing chemical structures are readily available for download, at least for the very first simple end-to-end product (https://www.kaggle.com/c/bms-molecular-translation/data).

Secondly, since the topic has been studied before, a lot of external data is available - such as the USPTO dataset of different patents containing images related to our topic (https://developer.uspto.gov/data). PubChem is another publicly available source to use (https://pubchem.ncbi.nlm.nih.gov).

ChemSpider is another freely available source to find chemical structure images (http://www.chemspider.com) that we haven’t actually seen being used in any of the studies made, but this dataset does exist.

2 thoughts on “Ideas about Datasets

  1. skaisler

    Narmin:

    Some thoughts: I would suggest focusing on just one class of compounds/chemical structures for your initial prototype.

    In your next blog, perhaps you can give some thoughts on how you might recognize chemical structure.

    I looked on PubChem for example structures. Here is an arbitrary one that I picked.

    https://pubchem.ncbi.nlm.nih.gov/compound/73
    which is: adenosine-3',5'-bisphosphate
    Looking at it and then its formula: C10H15N5O10P2
    One sees there is a significant difference between the image and chemical formula.
    One sees what appear to be a couple of benzene (?) rings in the structure, but you don't get this from the chemical formula.

    If you look at https://pubchem.ncbi.nlm.nih.gov/compound/73#section=Structures, you see that the major atoms are labeled.

    I would suggest that you focus on using just the 2D images fro this project.

    I would suggest that this problem appears to be a pattern matching problem. Since the structure is a graph, you might want to consider matching subgraph images to the structure as an initial step in determining its composition.

    Note that Section 3.1.4 on that page gives you an expanded formula description of the structure in terms of atoms and bonds between them (with carbon atoms not marked).

    I will be interested to see how you will proceed.

    Reply
  2. pless

    ok, so I like this post. You've done a good job identifying possible datasets. What are their similarities and differences? Are the labels really the same? Can you merge them all into just one? Can you use one to train your system and test effectively on the other one?

    I hope that you had fun in NYC!

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *