This portfolio is a compilation of portfolio_notebooks which I created for data analysis or for exploration of machine learning algorithms. A separate category is for separate projects.
Titanic: Machine Learning from Disaster is a knowledge competition on Kaggle.
Many people started practicing in machine learning with this competition, so did I.
This is a binary classification problem: based on information about Titanic passengers we predict whether they survived or not.
General description and data are available on Kaggle.
Titanic dataset provides interesting opportunities for feature engineering.
Ghouls, Goblins, and Ghosts... Boo! is a knowledge competition on Kaggle. This is a multiple classification problem: based on information about monsters we predict their types. A fun competition for Halloween. General description and data are available on Kaggle.
This dataset has little number of samples, so careful feature selection and model ensemble are necessary for high accuracy.
Otto Group Product Classification Challenge is a knowledge competition on Kaggle. This is a multiple classification problem. Based on information about products we predict their category. General description and data are available on Kaggle.
The data is obfuscated, so the main questionlies in the selection of the model for prediction.
In real world it is common to meet data in which some classes are more common and others are rarer. In case of a serious disbalance prediction rare classes could be difficult using standard classification methods. In this notebook I analyse such a situation. I can't share the data, used in this analysis.
Banks strive to increase the efficiency of their contacts with customers. One of the areas which require this is offering new products to existing clients (cross-selling). Instead of offering new products to all clients, it is a good idea to predict the probability of a positive response. Then the offers could be sent to those clients, for whom the probability of response is higher than some threshold value.
In this notebook I try to solve this problem.
House Prices: Advanced Regression Techniques is a knowledge competition on Kaggle. This is a regression problem: based on information about houses we predict their prices. General description and data are available on Kaggle.
The dataset has a lot of features and many missing values. This gives interesting possibilities for feature transformation and data visualization.
Loan Prediction is a knowledge and learning hackathon on Analyticsvidhya. Dream Housing Finance company deals in home loans. Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. Based on customer's information we predict whether they should receive a loan or not. General description and data are available on Analyticsvidhya.
Caterpillar Tube Pricing is a competition on Kaggle. This is a regression problem: based on information about tube assemblies we predict their prices. General description and data are available on Kaggle.
Dataset consists of many files, so there is an additional challenge in combining the data snd selecting the features.
Bag of Words Meets Bags of Popcorn is a sentimental analysis problem. Based on texts of reviews we predict whether they are positive or negative. General description and data are available on Kaggle.
The data provided consists of raw reviews and class (1 or 2), so the main part is cleaning the texts.
Natural language processing in machine learning helps to accomplish a variety of tasks, one of which is extracting information from texts. This notebook is an overview of several text exploration methods using English translation of Japanese light novel "Fate/Zero" as an example.
This notebook shows how a new text can be generated based on a given corpus using an idea of Markov chains. I start with simple first-order chains and with each step improve model to generate better text.
This notebook shows how text can be summarized choosing several most important sentences from the text. I explore various methods of doing this based on a news article.
Clustering is an approach to unsupervised machine learning. Clustering with KMeans is one of algorithms of clustering. in this notebook I'll demonstrate how it works. Data used is about various types of seeds and their parameters. It is available here.
This is a simple example of feedforward neural network with regularization. It is based on Andrew Ng's lectures on Coursera. I used data from Kaggle's challenge "Ghouls, Goblins, and Ghosts... Boo!", it is available here.
I have a dataset with telematic information about 10 cars driving during one day. I visualise data, search for insights and analyse the behavior of each driver. I can't share the data, but here is the notebook. I want to notice that folium map can't be rendered by native github, but nbviewer.jupyter can do it.
Recommenders are systems, which predict ratings of users for items. There are several approaches to build such systems and one of them is Collaborative Filtering.
This notebook shows sevуral examples of collaborative filtering algorithms.