Skip to content

In this post, I will try to summarize what I have learned and applied during the past two weeks. First of all, our team has prepared the initial briefing of our project and presented it to the group. It was exceptionally useful since we received valuable feedback to work on. In the final briefing, we have decided to add more visualizations, possibly a gif of our data augmentation process, and an additional metric to measure our success.

We have prepared initial data augmentation, which you can find in Narmin's blog. I will not repeat it here 🙂 I was responsible for choosing and running one of the existing solutions from Kaggle. My final choice was the TPU notebook, which was the fastest among all. It had an encoder/decoder architecture with attention units. The score was moderate, although it is very interesting what will happen once we add our augmentations.

I have also read a lot about Vision Transformers and find them really exciting! Probably we will try to implement it for our project soon. Looking forward to transforming the theoretical knowledge I got to practice!

2

The project I will be doing is connected to translation of chemical molecule images to InChI, which is a standard representation in the form of the next, which describes the molecule.
The data source for this project is readily published in Kaggle under Brystol-Myers Squibb competition. The dataset includes images of around 4 million images, which are readily arranged under a convenient file structure to work with. This is the major data source to be used.
After careful literature review, another source of potential data was identified. There is a database called PubChem, which is the biggest source of data for other related research projects. InChI representations are readily available in the database and could be used to increase the amount of training images.
Another source of potential data identified during literature review is USPTO, which was heavily used by the practitioners. This source will be reviewed in detail and may also be used to obtain data.
Additionally, data augmentation is also planned to be used, which will increase the number of training images. Despite the fact that this is not a source of data but rather a technique to increase the number of training items, it is worth noting that augmentation will help to prepare the training set.

The first week of lectures is over. Just 2 classes, but so much information obtained. The first class was a great intro to the subject itself, from which I have learned some basic course information.

One of the most useful concepts learned was George Heilmeier's Catechism, which I believe is necessary for any project conducted. Simple questions help identify whether the idea is really worth following or is just a failure from the beginning.

Additionally, even though I knew that documentation is important for any project, I got persuaded once more that sometimes investing time in good documentation is worth more than some other improvement to the project. Documentation is one of the key things to pay attention to when working on a project. Experimental and creative research types are both interesting in their own way and may be more or less important based on the discipline.

Last but no the least, MHET project was interesting to learn about, and the fact that it was being developed in a language other than Python or R made it stand out for me, as I am less familiar with other tools.

Overall, this was a week full of new concepts and people. I would like to thank everyone who took part in it 🙂

Hey there! My name is Fidan Musazade, and I am currently a graduate student at ADA and GWU studying computer science. At the same time, I am a Leading Data Scientist at the International Bank of Azerbaijan (IBAR).

I was born in Baku, Azerbaijan, and lived there until 2010. In 2010, I have moved to Minsk, Belarus, where I spent amazing 4 years. Afterward, I returned to Azerbaijan and started my university life. I have obtained my Bachelor's Degree in Business Administration but later decided to shift to a more technical field. Therefore, I started working as Data Scientist for IBAR in early 2019.

My primary interest in computer science is NLP and text-to-speech as well as speech-to-text analysis. I have written a paper on existing text-to-speech systems and how they can be applied in the banking sector but have not published it anywhere yet.

Other than that, I have a keen interest in learning new languages. I am fluent in Russian, Turkish, Azerbaijani, English, and German. I also have some limited knowledge of Spanish and Belarusian languages. Hopefully, this list will be expanded someday. Additionally, I love swimming and sunbathing 🙂

Looking forward to an exciting semester!

1

Welcome to your brand new blog at GW Blogs.

To get started, simply log in, edit or delete this post and check out all the other options available to you.

For assistance, visit our comprehensive support site, check out our Edublogs User Guide guide or stop by The Edublogs Forums to chat with other edubloggers.

You can also subscribe to our brilliant free publication, The Edublogger, which is jammed with helpful tips, ideas and more.