Madhu Ahobalan

Senior Member of technical i, Amadeus software t

PG Level Advanced Certification Programme in Computational Data Science
Cohort #1

My Projects

Resume classification - Naive Bayes

Background: Companies often receive thousands of resumes for each job posting and employ dedicated screening officers to screen qualified candidates. Finding suitable candidates for an open role from a database of 1000s of resumes can be a tough task. Automated resume categorization can speeden the candidate selection process. Such automation can really ease the tedious process of fair screening and shortlisting the right candidates and aid quick decision making. Objective: The aim is to correctly classify the professional domains of candidates, by training an ML model on a resume dataset Task Flow: NLTK library based pre-processing and EDA - data cleaning and stop words removal, feature extraction, model building (multinomial Naive Bayes Classifier), report analysis

Bike Rental prediction - Linear Regression

Background: Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able to rent a bike from one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world. The Bike sharing dataset contains the hourly and daily count of rental bikes between the years 2011 and 2012 in the capital bike share system with the corresponding weather and seasonal information. This dataset consists of 17389 instances of each 16 features. Objective: Predict the bike-sharing counts per hour based on features including weather, day, time, humidity, wind speed, season e.t.c. Task Flow: EDA & Visualization, Pre-processing and data engineering, Feature scaling and One-hot encoding using Pipeline, Implement the linear regression by finding the coefficients using normal equation, Model building(Naive Bayes Classifier), Report analysis.

Credit card defaulter prediction - Logistic regression

Background: Credit risk arises when a corporate or individual borrower fails to meet their debt obligations. From the lender's perspective, credit risk could disrupt its cash flows or increase collection costs, since the lender may be forced to hire a debt collection agency to enforce the collection. The loss may be partial or complete, where the lender incurs a loss of part of the loan or the entire loan extended to the borrower. Credit scoring algorithms, which calculate the probability of default, are the best methods that banks use to determine whether or not a loan should be granted. Objective: Predict the loan defaulters using a Logistic Regression model on the credit risk data and calculate credit scores Task Flow: EDA & Visualization, Data pre-processing, Data engineering, Logistic Regression from scratch using gradient method and using sklear, Credit scoring, Performance Metrics, SHAP implementation for Logistic Regression.

Dementia prediction using SVM

Background: Dementia is a general term for loss of memory and other mental abilities severe enough to interfere with daily life. It is caused by physical changes in the brain. Alzheimer's is the most common type of dementia, but there are many kinds. Objective: Predict Dementia using an SVM model on brain MRI features Task Flow: Perform data exploration, preprocessing and visualization and implement SVM Classifier on the data using OASIS - Longitudinal brain MRI Dataset.

Emotion Classification from Speech

Background: Speech Emotion Recognition (SER) is the task of recognizing the emotion from speech, irrespective of the semantics. Humans can efficiently perform this task as a natural part of speech communication, however, the ability to conduct it automatically using programmable devices is a field of active research Objective: Build an ensemble ML model to recognize emotion from speech data Task Flow: Load the TESS audio and Ravdess data and extract features and labels, Train and test the model with TESS + Ravdess data, Record the team audio samples and add them to TESS + Ravdess data, Train and test the model with TESS + Ravdess + Team recorded (combined) data, Test each of the models with live audio sample recording, Classification report and metrics, Choice of C for SVM. Speech code library used: Librosa

Customer Segmentation Kmeans

Background: In Retail and E-Commerce (B2C), and more broadly in B2B, one of the key elements shaping the business strategy of a firm is understanding of customer behavior. More specifically, understanding the customers based on different business metrics: how much they spend (revenue), how often they spend (frequency), are they new or existing customers, what are their favorite products, etc. Such understanding in turn helps direct marketing, sales, account management and product teams to support customers on a personalized level and improve the product offering. Objective: Perform customer segmentation for an Online Retail using an Unsupervised Clustering technique. Determine the optimum number of segments for the customer base and identify customer segments based on the overall buying behavior using Online Retail dataset. Task Flow: Understand EDA and analytics based insights from the data, feature engineering and transformation, apply k-means algorithm to identify a specific number of clusters, train a supervised algorithm on clustered data, evaluation of test data, report analysis.

Implementation of Linear Regression - OOPs

Background: Object oriented programming is based around the concept of "objects". Objects have two kinds of attributes (accessed via . syntax): data attributes (or instance variables) and function attributes (or methods). Object data is typically modified by object methods. Objective: Build OOP based classes and methods and use them to implement Linear Regression for solving real world data related queries Task Flow: Define a class, add a method which takes a list of values as input and returns the mean, variance, covariance, estimated coefficients, and predicted values of those values, Using the class LinearRegression, calculate the estimated coefficients, fit the model, and predict the values on the Pizza Franchise dataset.

Implementation of Multiple Linear Regression - MPI and OpenMP

Background: Predicting full load electrical power output of a base load power plant is important in order to maximize the profit from the available megawatt hours. The base load operation of a power plant is influenced by four main parameters, which are used as input variables in the dataset, such as ambient temperature, atmospheric pressure, relative humidity, and exhaust steam pressure. These parameters affect electrical power output, which is considered as the target variable. Objective: Implement Multiple Linear Regression using MPI and OpenMP using Combined Cycle Power Plant dataset. The dataset contains the values for Ambient Temperature, Exhaust Vaccuum, Ambient Pressure, Relative Humidity and Energy Output. Task Flow: Identify the features and target and split the data into train and test, implement multiple linear regression by estimating the coefficients on the given data, Use MPI package to distribute the data and implement communicator, define functions for each objective and make a script (.py) file to execute using MPI command, use OpenMP component to predict the data and calculate the error on the predicted data, implement the Linear Regression from sklearn and compare the results.

Fare Amount Prediction Using Dask

Background: Prediction of taxi fares is an interesting challenge and can be modelled using a large dataset such as the 2016 NYC Taxi fares dataset which includes information on capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. Working with large datasets require parallelized task execution. Here Dask, an open source project that gives abstractions over NumPy Arrays, Pandas Dataframes and regular lists and allows us to run operations on them in parallel, using multicore processing, is used. Objective: Using Dask, perform exploratory data analysis and linear regression modeling to predict Taxi fares Task Flow: Read the dataset using Dask library, perform data analysis, prepare dataset for model implementation, remove outliers from training set based on coordinates, predict the test data and calculate the mean squared error and r2 score, perform report analysis.

Image Classification MLP

Background: Traffic sign recognition is a challenging, real-world problem relevant for AI based transportation systems. Traffic signs show a wide range of variations between classes in terms of color, shape, and the presence of pictograms or text. However, there exist subsets of classes (e.g., speed limit signs) that are very similar to each other. Further, the classifier has to be robust against large variations in visual appearances due to changes in illumination, partial occlusions, rotations, weather conditions etc. Using a comprehensive traffic sign detection dataset, here we perform the classification of traffic signs using a Multilayer Perceptron (MLP) Objective: Implement the Multi-Layer Perceptron (MLP) algorithm to classify images using the German Traffic Sign Detection Benchmark(GTSDB) dataset. Task Flow: Loading the data, normalizing the features, train the MLP classifier on features, tune the hyper-parameters, Classification report, implement simple neural networks using Keras, experiment using dropout, regularization and batch normalization.

Face Mask Classification - CNN

Background: In the pandemic and post pandemic situations, wearing a mask is compulsory for everyone in order to prevent the transmission of coronavirus. Detection of people who are not wearing masks is a challenge due to the large populations. This face mask detection project can be used in schools, hospitals, banks, airports etc as a digitalized scanning tool. Objective: To build and implement a Convolutional Neural Network model to classify between masked/unmasked/partially masked faces. Task Flow: Analyze the shape and distribution of datasets, load the images using ImageDataGenerator, build the CNN model using Keras, transfer learning (use the pre-trained models (VGG16 or ResNet50)), capture the live image, make mask on/off prediction, report analysis

Video Classification - CNN, RNN

Background: Applications such as surveillance, video retrieval and human-computer interaction require methods for recognizing human actions in various scenarios. In the area of robotics, the tasks of autonomous navigation or social interaction could also take advantage of the knowledge extracted from live video recordings. Typical scenarios include scenes with cluttered, moving backgrounds, nonstationary camera, scale variations, individual variations in appearance and cloth of people, changes in light and view point and so forth. All of these conditions introduce challenging problems that can be addressed using deep learning (computer vision) models. The dataset consists of labelled videos of 6 human actions (walking, jogging, running, boxing, hand waving and hand clapping) performed several times by 25 subjects in four different scenarios: outdoors s1, outdoors with scale variation s2, outdoors with different clothes s3 and indoors s4 Objective: Train a CNN-LSTM based deep neural net to recognize the action being performed in a video Task Flow: Generate the frames of video, define and build the neural network, build the time distributed model and DenseNet (to model the information from the sequential video frames use a hybrid architecture that consists of convolutions for spatial processing as well as recurrent layers for temporal processing, specifically, a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN) consisting of GRU layers), use pre-trained model for feature extraction, Report analysis.

Stock Price Anomaly Detection - LSTM, Autoencoders

Background: The identification of rare items, events, or remarks which raise suspicion by significant differences from the bulk of the info in different areas such as statistics, signal processing, finance, economics, manufacturing, networking, and data processing, and anomaly detection (including outlier detection) is a different subject. To tackle this problem, we can use deep learning to solve it. Over the years, researchers have come up with various models for analyzing and detecting such anomalies in sequential data. Here we train our model on the yahoo finance sequential data to detect anomalies or outliers in our data. Objective: Perform PCA stock analytics and detect the stock price anomalies by implementing an LSTM autoencoder Task Flow: Use PCA to identify the top components of the stocks and LSTM + Autoencoder Neural Network to detect/predict anomalies (sudden price changes) in the S&P 500 index. Specifically, perform PCA Analysis(Load and pre-process the prices data (PART-A)), Apply PCA, Apply T-SNE and visualize with a graph, anomaly detection (load and preprocess the data (PART-B)), data preprocessing, create time series data, build an LSTM autoencoder, train the autoencoder, detect anomalies in the S&P 500 index data, Data preparation.

Stock Trading using Deep Reinforcement Learning

Background: Deep reinforcement learning combines artificial neural networks with a framework of reinforcement learning that helps software agents learn how to reach their goals. That is, it unites function approximation and target optimization, mapping states and actions to the rewards they lead to. Reinforcement learning refers to goal-oriented algorithms, which learn how to achieve a complex objective (goal) or how to maximize along a particular dimension over many steps; for example, they can maximize the points won in a game over many moves. Objective: Build an environment for the agent and perform stock trading using Deep Reinforcement Learning (FinRL package) Task Flow: Data Loading, Preprocess data, Exploratory data analysis, Top 10 stocks with high volume, Daily returns of the stocks, Train & trade data split, Build environment, Implement DRL algorithms, Trading, Backtesting performance, DashBoard, Report analysis

Data querying and Analysis - NoSQL database

Background: Lots of data around the world is available in the form other than the traditional table structure of a relational database. A NoSQL database provides a mechanism for storage and retrieval of that kind of data. The primary goal of movie recommendation systems is to filter and predict only those movies that a corresponding user is most likely to want to watch. The algorithms for these recommendation systems use the data about this user from the system’s database. This data is used to predict the future behavior of the user concerned based on the information from the past. Objective: Using the Cassandra NoSQL database, explore the Movielens dataset and build a movie recommendation engine Task Flow: Database connection, Insert data into database, Querying the database, Visualizing the data, Analyzing the data, Implement recommender function.

Simple Analytics - PySpark

Background: Data analysis is important in business to understand problems facing an organization, and to explore data in meaningful ways. Data in itself is merely facts and figures. Data analysis organizes, interprets, structures and presents the data into useful information that provides context for the data. The dataset chosen for this mini-project is Real Estate Valuation dataset. Objective: Perform simple analytics on the data and predict the house price per unit area based on real estate valuation data. Task Flow: Start a spark session, fetch the data using handyspark, derive the insights, data visualization, feature scaling, feature engineering, train and evaluate the model.

Complex Analytics - PySpark

Background: The dataset chosen for this mini-project is a 10% subset of the KDD Cup 1999 dataset (Computer network intrusion detection). The task is to build a network intrusion detector, a predictive model capable of distinguishing between bad connections, called intrusions or attacks, and good normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment. Objective: Perform complex analytics on a network intrusion dataset using Pyspark. Task Flow: Create a spark session, create an RDD from a file, RDD basic operations, create a DataFrame, register the DataFrame as a temporary table, feature scaling, find highly correlated columns, analysis report.

End-to-End Analytics - PySpark

Background: The airline industry is a very competitive market that has grown rapidly in the past 2 decades. Airline companies resort to traditional customer feedback forms which in turn are very tedious and time consuming. This is where Twitter data serves as a good source to gather customer feedback tweets and perform sentiment analysis. The dataset chosen for this mini-project is Twitter US Airline Sentiment. This dataset comprises tweets for 6 major US Airlines and a multi-class classification can be performed to categorize the sentiment (neutral, negative, positive). Objective: Perform sentiment classification by analyzing the tweets data with Pyspark. Task Flow: Mainly, it includes analysis of the the text data using pyspark, deriving the insights and visualizing the data, feature extraction and data classification (using pyspark, handyspark and NLTK tools) Create a spark session, Load the data, EDA & visualization, Preprocessing and cleaning, Feature extraction, Encode the labels, Train the classifiers, Models prediction and evaluation, Deployment.

Market Basket Analysis

Background: Market Basket Analysis is one of the key techniques used by the large retailers that uncovers associations between items by looking for combinations of items that occur together frequently in transactions. For the purposes of customer centricity, market basket analysis examines collections of items to identify affinities that are relevant within the different contexts of the customer touch points. The dataset chosen for this mini project is Instacart Dataset. Objective: Extract association rules and find groups of frequently purchased items from a large-scale grocery orders dataset. Task Flow: The aim of this work is to extract summary level insight from a given dataset, integrate the data and identify the underlying pattern or structure, understand the fundamentals of market basket analysis, construct "rules" that provide concrete recommendations for businesses. To achieve these objectives, association rules and apriori algorithm based solution is developed Loading the data, data integration, EDA and data wrangling, create a basket, apply apriori algorithm.

Exploratory Data Analysis Time Series

Background: Exploratory Data Analysis (EDA) refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypotheses and to check assumptions with the help of summary statistics and graphical representations. It is a good practice to understand the data first and try to gather as many insights from it. The dataset chosen for this mini project is a French retail company quarterly sales data. Objective: Perform Exploratory Data Analysis (EDA) of the time series data using visualizations and statistical methods. Task Flow: Loading the data, Data preprocessing, EDA and data visualization, Check for the time series stationarity, Detrend the time series, Visualize lag plots, Report analysis.

Bitcoin Price Forecasting using ARMA

Background: Speculating on the Bitcoin market may offer the opportunity to obtain substantial returns, but it may also entail a very high risk. So to judge the best time to enter the market is extremely important in order to get profits and not to lose too much money. The price of Bitcoin changes every day, just like the price of fiat currencies. However the Bitcoin price changes are on a greater scale than that of the fiat currency changes. As a result, to get an idea of the future price trend can be extremely important. The dataset chosen for this mini project provides the history of daily prices of Bitcoin. Objective: Perform EDA and forecast the Bitcoin price using ARMA model on time series data. Task Flow: Loading the data, Perform EDA, Test stationarity using ADF test, Identify trends and seasonality, Visualize auto-correlation plot, Train the AR model, Train the ARMA model, Predictions and report analysis.

Air Quality Forecasting using ARIMA

Background: The most important function of air pollution early warning systems is to report the air quality to relevant departments when the air quality reaches the early warning standard. Air quality forecasting is an effective way of protecting public health by providing an early warning against harmful air pollutants. Urban air pollution events can be forecasted by meteorological elements to provide an early warning. The dataset chosen for this mini project is the air-quality data from the Beijing Municipal Environmental Monitoring Center. This dataset includes hourly air pollutants data from 12 nationally-controlled air-quality monitoring sites. Objective: Implement ARIMA model to forecast the air quality using Beijing air quality dataset. Task Flow: Loading the data, Perform EDA, Identify trends and seasonality, Time series stationarity, Autocorrelation plot analysis, Implement ARIMA model, Implement SARIMAX model, Report analysis. Tools used: pmdarima, statsmodels

Customer Churn Analysis

Background: Customer Churn is used to describe subscribers to a service who decide to discontinue their service for a certain time frame. Churn prediction consists of detecting which customers are likely to cancel a subscription to a service based on how they use the service. Businesses often have to invest substantial amounts attracting new clients, so every time a client leaves it represents a significant investment loss. Both time and effort then need to be channeled into replacing them. Being able to predict when a client is likely to leave and offer them incentives to stay can offer huge savings to a business. The dataset chosen for this mini project is customer churn dataset representing the trips of the users and drivers rating along with luxury cars used. Objective: Analyze and preprocess the data and build a machine learning model to predict Customer Churn. Task Flow: Loading the data, Data exploration and analysis, Feature engineering, Data preprocessing, Train the ML models, Model optimization: Hyperparameter tuning, Factors driving customers to churn and report analysis

Finance Portfolio

Background: The universe of stocks can truly baffle investors who wish to make the best selection of stocks for their portfolios. It is a daunting task to make a prudent selection of stocks, given the vastness of the choices and the diverse behavioral characteristics of each of these stocks with respect to itself and to one another. Also an investor needs to determine the optimal weights, which will ensure maximum return risk for the portfolio invested in. In addition to that, he needs to know how much to invest in each one of the assets in the portfolio. The dataset chosen for this mini project is Dow Jones Industrial Average (DJIA) Index dataset. Objective: Analyze and preprocess the data and build a finance portfolio to select the optimal portfolio of diversified assets. Task Flow: The aim of this work is to build a finance portfolio, optimize and find the maximum return, minimum risk of a portfolio, cluster the asset parameters to group the similar assets, select the optimal portfolio of diversified assets. To achieve these objectives we use a clustering and efficient frontier based solution Loading the data, Data summarization, Compute stock returns, Portfolio return and portfolio risk, Cluster the assets using K-Means, Diversification index, Efficient frontier, Sharpe ratio, Visualize portfolio and report analysis.

Real-time system for Tweet analytics and Sentiment Prediction

With the onset of the COVID-19 pandemic, social media – specifically Twitter, has rapidly become a crucial communication tool for information generation, dissemination, and consumption of different views, opinions and emotions on outbreak-related incidents. Tweet analytics and sentiment prediction could help in policymaking, health interventions and law enforcement based on inputs regarding food scarcity, new disease symptoms, increased crimes etc. Through tweet analytics, businesses can profit by understanding the market need for specific commodities and thereby increase the production of the in-demand items. NGOs will be able to better direct their efforts by organizing rehabilitation camps based on tweet analytics. The objective of this project is 2 fold: 1. perform COVID relevant tweet analytics and derive meaningful insights (Tweet Analytics)** 2. develop a real-time system for sentiment prediction on Twitter streaming data relevant to the COVID pandemic (Sentiment Analysis) ** Tweet Analytics Determine the TOP 5 topics the public is talking about in Twitter and in which cities are these topics discussed most. For eg: Topics could be Vaccinations, Covid deaths, Pharma availability, Govt support etc

About Program

Teaches professionals how to unlock the power of data to solve complex business problems and make data driven decisions. Designed by IISc, #1 ranked University (NIRF) and a premier academic institution for world-class education in science, engineering, and design. Delivered by TalentSprint with its deep understanding of the modern technologies, access to industry experts, and a state of art technology platform. Delivered in an executive-friendly format. Unique 5-step learning process of LIVE online faculty-led interactive sessions, capstone projects, mentorship, hackathons, and presentations to ensure fast-track learning.