Projects

Data Science Projects

Project 1 Walmart Sales Forecasting

Background: Sales Forecasting is the process of using a company’s sales records over the past few years to predict the short-term or long-term sales performance of the company in the future. Sales forecasting is a globally conducted corporate practice where number of objectives are identified, action-plans are chalked out as well as budgets and resources are allotted to them. Business Sales Executives often find themselves scrambling for answers when it comes to sales forecasting during business reviews. The Sales Forecast Model will help sales executives to find such answers upfront and be ready with numbers and predictions to share with leaderships team.

Problem Statement: The goal of this analysis is to predict future sales for the Walmart stores based on the varying features and events. • Build the Machine Learning model that would learn from past records, events and predict the accurate outcomes. • Predict the Sales forecast for Store and its departments on specific week of the year considering if it is before holiday or after holiday.

Modeling: These ML algorithms are used to train the model and evaluated using Weighted mean absolute error. The model with lowest RMSE score and best accuracy score is baselined. Following are the list of ML algorithms are used to train the model –

  • KNN Regression
  • Decision Tree
  • Random Forest
  • Gradient Boosting Machine (XGBoost Regressor)
  • ARIMA - Auto Regressive Integrated Moving Average

Project 2 California House Price Prediction

To buy a house is on everyone’ checklist. One of the most important factors in buying a house is the income or the money each household have. Besides other factor include location, distance from work, the size of the house etc. Buyer considers these factors while looking for a house in the market. Prediction of the housing value is therefore dependent on these factors. Considering these values as the predictor for the housing prices will also help the government, private companies, insurance companies and real estate agent to invest money accordingly.

Project 3 Analyzing the single-cell RNA sequencing (sc-RNASeq) data and assigning cell type identity

Recent advancements in next-generation sequencing (NGS) technologies have made single-cell sequencing an increasingly powerful tool for understanding the biology and cellular function, disease diagnosis, therapy response prediction, and treatment selection. Historically, sequencing technology only enabled an average analysis of a total cell population (Bulk RNA-Seq). In contrast, today, tens of thousands of individual cells from a single tissue sample or patient can be analyzed, giving researchers an unprecedented opportunity to understand individual cell populations and their behavior in diseased tissue. RNA-sequencing (RNA-seq) is a genomic approach for detecting and quantifying messenger RNA (mRNA) molecules in biological samples and is helpful in exploring cellular responses. Majorly there are two RNA sequencing - bulk RNA and single single-cell (scRNASeq). The advantage of scRNASeq is to identify and discover the rare populations and mutations which is helpful to understanding cancer, cardiac disorders. Identifying different and new cell populations is important for developing new therapy and diagnosis. However, analyzing the large volumes of data generated from these experiments requires specialized statistical and computational methods. We will use the data from 10X genomics and identify the different cell type clusters based on the expression of different genes. Cells will be treated as rows (samples), and gene expression will be as columns (Variables). This white paper describes the single cell RNA- Seq data analysis for identifying different cell populations in normal human peripheral blood mononuclear cells (PBMCs) using R based platform Seurat.

Project 4 Cancer_Detection-A-Machine-Learning-and-Deep-Learning-Approach

OVERVIEW Cancer is a deadly disease. Researchers and clinicians are trying to find the methods to detect it at early stages. Early diagnosis will play an important role in planning the treatment plan and improvement of the patient’s survival rate. Cancer can be benign (localized) or metastatic (spread to distant organs). One of the most important early diagnosis is detection in lymph nodes to find out whether the cancer has metastasized or not. The method to do this is H & E staining of histological slides of lymph nodes taken from biopsies.

GOAL Currently pathologists manually examine the slides in the laboratory and decide if the patient has metastatic cancer or not. Reading the slides and making a report based on human judgement which can be inconsistent and vary from person to person and from day to day. Therefore, developing a computation model to read the slides would provide and can automate the process to give unbiased results.

DATA SOURCE The data for this project are downloaded from Kaggle website https://www.kaggle.com/c/histopathologic-cancer-detection/data

Project 5 - Predicting-Movie-Ratings-from-Reviews

Naïve Bayes is used to predict the movie rating from reviews. Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.

Project 6 - AirBnb_2019_Austin_Crimes_Map

Airbnb, Inc. is an online marketplace for arranging or offering lodging, primarily homestays, or tourism experiences. The company does not own any of the real estate, nor does it host events; it acts as a broker, receiving commissions from each booking. The company is based in San Francisco, California, United States. The project here looks at the various aspects of Airbnb and see what combination of features will most accurately predict the rate per night of a given listing. Here, I have fosused only on the data collected for the city of Austin, Texas in Year 2019 ( till Novenber). In addition to this, this notebook also shows the crimes reported within the areas as included by the Airbnb dataset. Finally, a map is generated using the boundaries and zipcodes for better visulaization. WHENEVER we plan our trip, we always look for few major features - reviews, location, property type, parking, breakfast included, cancellation policy and few other. The analysis in this notebook is done including most of these features.

Project 7 - INFOGRAPHIC - Air Travel – How safe is Air Travel?

The infographic is used for external audience for a quick reference, easy to understand visuals and data. The idea behind the present infographic is to project positive message about airlines safety and support this with the data in addition to make it appealing and easily understandable for the public.

The link provided above is the Blog describing the infographic.

Project 8 - Identification of significant variables to drive the price of used cars on eBay

Identification of significant variables to drive the price of used cars on eBay

Data Set Used Cars Database from Kaggle

Question Identification of significant variables to drive the price of used cars on eBay

Project 9 - St Louis Crime -2017 and 2018

St Louis Crime -2017 and 2018

This project ran a exploratory data analysis (EDA) using Python for the UCR Part 1 Crime Data available for year 2017 and 2018 by St. Louis County Police Department.

Data have Crimes include homicide/non-negligent manslaughter, rape, robbery, aggravated assault, burglary, larceny, motor vehicle theft, arson, and human trafficking.

For further details and to download the data please click the Link

Project 10 - Building Weather App

In this project, I wrote a python script and used API to create a “weather app”. User will enter the zip code and the weather will printed according to the input zipcode.