Skip to content

4

In machine learning data plays a crucial role. No matter how great an algorithm is if the data provided is not sufficiently good, result will be a poor model. On the other hand with the correct data even with the "average" algorithm a model with high precision can be created. There are few points to consider when validating the data.

Data has to be clean, meaning: 1. It must be complete - there should not be any empty field. 2. Accuracy of it should be audited and fixed if there is any inconsistency. 3. Similar data should have the same format. For example times/events all must be in 'timestamp' format.

Data availability: Usually teams in a project makes several experiments and thus, they need available and easily accessible data to do it. Therefore, data should be kept in a place such as data lake or warehouse where every member of a team can access it.

Data Relevancy: For machine learning project data has to be relevant to the problem under consideration. There should be enough data sufficiently explaining the domain in which the problem exists. For example, in an anomaly detection project information about only card issuer and the location of the transaction might not be enough. Additional data such as transaction amount, date, etc. would result in much better performance.

Data Bias: A machine learning algorithm only learns from the data it was provided. Therefore if there is a bias in the data it will lead to unexpected results. There is a famous example in which the task was to distinguish US tanks from the Russian tanks. In the lab the project seemed promising with high accuracy, but in the field it failed. The reason was that when training the model they used the biased data, in which US tank images were all taken in a sunny day and the Russian tank images were taken in a cloudy day. Hence, the model learned to distinguish sunny day from the cloudy day not the friendly tanks from the enemy tanks.

Data Time Window: When data which captured in a wrong time period is used to train a model it wont be able to produce good result due to lack of the variance in the data. For example if it is a stock data and it only considers last month, it will not understand the dynamics of the stock market. In this case having few years of data would include a lot of different recurring events which can be useful. Having too long of a window can be a problem too. It can lead model to learn outdated information, hence perform poorly.

The main goal is to build a demonstrable system that exhibits one or more solutions to the selected problem. Focus should be on a specific problem and trying to find solutions to it. It is also crucial to have an end to end system as soon as possible though it is not the perfect solution. Later it can be developed to achieve better results. When choosing a topic it is important to take into account data that is going to be used and the testing approach. There are some areas in which finding data is hard if not impossible, due to several reasons such as privacy. Testing approach has to be clear and easy to understand. Suggested system should be tested and compared against the state of the art techniques in the domain. This way it can be concluded whether the system is a good solution or not.

In technical approach there are 7 stages to successfully complete the MS project:

1.Define the Problem: In this step problem has to be defined. State of the art techniques has to be analyzed and what new things will be added to the solution have to be determined. The way the success of the suggested solution will be measured needs to be identified in this stage.

2.Data Collection – Curation and Analysis: The data needed should be considered in this stage. The amount of data needed and its source must be known. It is important to avoid proprietary data or data that needs a license agreement to use. Usually data comes in a different format than the desired one. Therefore, how the data will be converted into usable format needs to be decided.

3.Methods and Tools Selection: The resources (hardware, time) and the tools (python, postgresql) needs to be selected.

4.System (Software/Hardware(?)) Development: It is important to have a brief outline of tasks and milestones beforehand. This will help to easily understand at which phase the project is along the way. For this cohort there are three major milestones:

–December Demonstration to Ada/GWU Faculty

–May Demonstration to ADA/GWU Faculty (Virtual)

–Delivery of Documentation

5.Test, Retest, and Test Again: During development of the system tests have to be done incrementally. Rather than keeping it to the end and realizing few things do not work as expected, it makes life simpler to do it incrementally and knowing the root of the problem. It is also crucial to record the tests, their results, and decision regarding the progress.

6.Operational Demonstration: In this stage functionality of the system should be demonstrated. To achieve this:

–Identify inputs you will use

–Describe expected results

–Demonstrate system output

–Discuss differences (if any) between expected output and actual output

For this cohort major milestones are noted in the stage 4.

7.Documentation: The documentation is needed to understand the project easily, both for the people working on the project and also for the people who wants to understand what has been done and what was the intention behind it.

I am Habil Gadirli from Baku, Azerbaijan. I received my B.Sc. in Computer Science from ADA University in 2019. For the next year and half I worked at WeTravel as a Backend Engineer. Starting from September 2020 I am a student in the dual-degree master’s program of “Master of Science in Computer Science and Data Analytics” offered by ADA University and George Washington University. One of the great interests of mine is to learn how to build scalable, fast, and maintainable architectures, and how companies likes of Visa, Google, Amazon, etc. manages to handle vast amount of traffic and transactions. Other than that I am also interested in Machine Learning, and Computer Vision. I am planning my master's thesis to be focused around Computer Vision. Outside of the Computer Science I am following news in the crypto world and football(soccer). Also I investigate how successful startups are built and what is the story behind each of them. Recently, I have worked on the IOI granted project together with Professor Jamaladdin Hasanov. It is in the 15th volume of the IOI journal. It can be accessed through the following link: https://ioinformatics.org/journal/v15_2021_23_36.pdf

1

Welcome to your brand new blog at GW Blogs.

To get started, simply log in, edit or delete this post and check out all the other options available to you.

For assistance, visit our comprehensive support site, check out our Edublogs User Guide guide or stop by The Edublogs Forums to chat with other edubloggers.

You can also subscribe to our brilliant free publication, The Edublogger, which is jammed with helpful tips, ideas and more.