The project I will be doing is connected to translation of chemical molecule images to InChI, which is a standard representation in the form of the next, which describes the molecule.
The data source for this project is readily published in Kaggle under Brystol-Myers Squibb competition. The dataset includes images of around 4 million images, which are readily arranged under a convenient file structure to work with. This is the major data source to be used.
After careful literature review, another source of potential data was identified. There is a database called PubChem, which is the biggest source of data for other related research projects. InChI representations are readily available in the database and could be used to increase the amount of training images.
Another source of potential data identified during literature review is USPTO, which was heavily used by the practitioners. This source will be reviewed in detail and may also be used to obtain data.
Additionally, data augmentation is also planned to be used, which will increase the number of training images. Despite the fact that this is not a source of data but rather a technique to increase the number of training items, it is worth noting that augmentation will help to prepare the training set.
Fidan:
I replied to Narmin with some suggestions.
Please see her blog for my suggestions.
This is a great post... identifying multiple data sources. Can you be more clear about what the labels are? I wonder if you can train your network to provide the chemical names, BUT ALSO train a network with more general labels: "This is an Acid", "This is a lipid.
This would make sense if you can get access to labels like this automatically from one of your datasets.