- Select a language like for example either of R or Python
- Select a machine learning package to use and associated data manipulation, charting, output, etc. packages
- Get and explore the data using techniques like Exploratory Data Analysis for an initial understanding of data and some inferences
- Break your original data set into training set and testing set. Clarify what you want to predict in testing set – for example do you want to give loan to customer based on his profile OR what services to offer based on their past recorded behavior in data sets. Typically testing set is smaller than training set and testing set would not have the prediction output (result) column in data set. That would be available in training set
- Find out dependent / independent variables, skewness, outliers in data, check if any values need to be converted into categorical values from numeric if they have only few states – typical examples: levels like 1, 2, 3 or YES/NO type fields / columns
- Plot histograms, box plots, etc. in above step 4 for help
- Add missing values using various techniques: Either simpler options like add mean, median, mode depending on type of data OR you can use machine learning algorithm for the same for replacing missing value or creating dummy fields / columns
- Move onto feature engineering by creating completely new variables from available data OR / AND transform by adding thresholds, etc. to remove outliers. Find out the important feature/s and check the relevance of the newly created features. If the new features have high co-relation to earlier features / variables you may not get many new inferences (mostly) so it would be good to do some more manipulation to get create new variables which have new inferences / results / observations
- Select your statistical model and create the tasks for machine learning (ML)
- Train your ML tasks with training data using the selected algorithm like decision tree, regression, random forest, etc. based on the fitment and suitability
- Predict using prediction task based on your testing data set from the trained model in step 10
- Check your accuracy by observing the result in real situation versus your result from step 11
This is part 1 of the series on Machine Learning. Treat this as a generic guideline. Many times we will be required to tailor this to various situations and data sets in which case the steps will get enhanced / substituted / refined as per requirement.
Reach out to me at neil@techandtrain.com if you want to discuss Data Science / R / Java / Python / etc. or want to conduct a training for MBA / BE / MCA / MSc students or are interested in having a workshop for your managers / executives on Data Science / R / Java / AWS / Excel / etc.