Skip to content

2

Over the past week, some of the key learnings / experiments wrt our master's project included:

  • Whenever presenting a research topic, it's better to accompany the presentation with visual aids. For example, we could have shown more images from our example dataset to convey the complexity of the topic. Many of the images are blurry or otherwise hard to decipher, which complicates the research object and makes it necessary to consider additional pre-processing steps, such as implementing GANs to improve image quality. (Attaching more examples of the images below.)
  • For instance, on the last photo, one can see that the "Cl" symbol is blurred, and the "N" and "H" symbols lie so close to each other, that it's hard to decipher the separating line between them.
  • As we started working with the dataset, we realized an increase in its quantity could yield better results, whereby all the different rotations / flips of the image are also considered. So, in a way, the same chemical formula can be present in the dataset but be displayed in different formats. This increases the size and variety of consideration without the need to look for external examples (at least for the initial end-to-end product). So far, we've been able to implement the following types of rotations and flips of the same image, using the OpenCV library:

However, as one can see there's a problem with flipping such characters as "S" and "OH" in the right way, to preserve their orientation (this is the next step of the problem that I don't yet know how to solve - would appreciate any help with this!). I did hear mentions of Google Image API but not sure how much it can help with this rule-based type of augmentation.

  • I did try Pytesseract on individually recognized characters but more often than not it actually failed to decipher the symbol correctly.
  • Another key learning is that apart from the numerical accuracy metric, i.e. the Levenstein distance, it would be an interesting idea to explore semantic evaluation. I.e. how close are the results to what chemists expect, without the need to go into the nitty-gritty numerical details. This once again ties in with the usefulness of our final product to customers.

Would really love to hear feedback and any useful comments about the augmentation issue!

Thanks

2

Fun fact: I’m writing this blog on a bus heading from NYC back to DC.

Choice of appropriate datasets is definitely not an easy one - however, we (as a team) have done quite an extensive research regarding our topic (recognition of chemical structures) and already have an inkling into what shall be used.

Firstly, since the original competition on optical recognition of chemical structures ran on Kaggle, about four million images containing chemical structures are readily available for download, at least for the very first simple end-to-end product (https://www.kaggle.com/c/bms-molecular-translation/data).

Secondly, since the topic has been studied before, a lot of external data is available - such as the USPTO dataset of different patents containing images related to our topic (https://developer.uspto.gov/data). PubChem is another publicly available source to use (https://pubchem.ncbi.nlm.nih.gov).

ChemSpider is another freely available source to find chemical structure images (http://www.chemspider.com) that we haven’t actually seen being used in any of the studies made, but this dataset does exist.

Some of my key learnings from this week include the following:

  1. Any official scientific research, such as our Master's thesis, requires a complete understanding of why it's being done, what it will change in the world around us and how (i.e. what's new?), what the risks are and how we can mitigate them, as well as financial costs and whether there are any competitors in the market. Whatever research is performed for a business purpose, for example, must have a solid foundation because of resource limits (both financial and people). And, since businesses care a lot about success, the research process has to follow a time-stamped plan whereby regular evaluations are performed and certain milestones are set to be met. In other words, there must be a goal and timeline to be reached.
  2. Our Master's thesis should not be some mandated topic communicated to us from top down, but instead, something we feel passionate about and something that can make a change in the world, i.e. leave an impact, even if it's a minuscule one. Because at the end, it's all about our purpose in life and career.
  3. The thesis is not just some theoretical paper but mostly a practical application of an end-to-end system that solves a particular problem. In other words, there must be a problem statement and our thesis is the solution. (A quote I remembered: "Don't start writing until you know the solution.").
  4. Data collection comes with a lot of challenges - not only in terms of finding its sources but mostly in terms of making sure that ethics are complied with. E.g., when it comes to private data, such as patient health data. The rules around the collection of such data must be researched beforehand and appropriate measures taken.
  5. Documentation is important for posterity as well as for yourself and your colleagues (you might forget why you coded things up this or the other way, as time passes by!).
  6. Visuals matter a lot when you're trying to relay the important information about your paper to the audience. People are into pictures more, not into plain text and numbers. Therefore, good to consider adding extra visual elements.
  7. There are two types of research types to choose from: (a) making experiments, e.g. trying out different machine learning techniques not tried before, and (b) creating your own system or using new data. Seems like our research topic is more an experimental one but does have a creative side to it - professor Pless suggested adding an extra application layer that will make it possible to tell which part of an InChi string, for example, relates to a particular part of the chemical structure.

Hi! I am Narmin Jamalova, graduate student at the ADA-GWU joint Master's in Computer Science & Data Analytics.

I was born and raised in Baku, Azerbaijan. Fun fact about me: As a kid, my sphere of interests encompassed everything dinosaur- and archeology-related: I could name any dinosaur species by just looking at their picture, intensively studied the history of the Ancient World and even got as far as learning how to read ancient Egyptian hieroglyphs in the hope of some day becoming an archeologist. As I grew up, my interests have changed (although I still think archeology is an amazing research ground!).

Within Computer Science, I am very interested in the domains of computer vision and machine learning, and, in particular, on their application to improving energy systems of tomorrow, which will form the core of our lives and help reduce carbon impact. For instance, accurate predictions of renewable power supplies, corrosion identification, optimization of energy consumption and smart solar grids.

As a hobby, I engage in poem- and story-writing (not for publication but mostly for self-reflection).

A brief extract from one of my poems for anyone who might be interested 🙂 :

"Leave some space for me, oh mountains!

I must grow to match your height!

Need to stretch to seize the light,

See into the world you hide ...

Lie and tell me that I'm worth

Of thousand ice caves in the North.

Can I change the pass of time?

Lie that I'm one of a kind.

I know the truth - and I don't mind.

Clutching onto life so dear,

I shall be fooled - I'm a believer.

And each day I shall go on,

Climb an even higher stone.

Till I match your height, oh mountains!

Will I then see all the world?.."

1

Welcome to your brand new blog at GW Blogs.

To get started, simply log in, edit or delete this post and check out all the other options available to you.

For assistance, visit our comprehensive support site, check out our Edublogs User Guide guide or stop by The Edublogs Forums to chat with other edubloggers.

You can also subscribe to our brilliant free publication, The Edublogger, which is jammed with helpful tips, ideas and more.