Data Analysis Process in Analytics / Data Science

This article is based on understanding from Wikipedia article on Data Analysis & my experiences in Data Science / Analytics / AI / ML – https://en.wikipedia.org/wiki/Data_analysis

Various areas like Data Mining, Predictive Analysis, Exploratory Data Analysis, Text Analytics, Business Intelligence, Confirmatory Data Analysis and Data Visualization overlap with this area

Before starting your journey on solving an industry or academic or research problem in Data Science / Analytics / AI / ML / Decision Science, a fundamental step where many students & professionals struggle is Data Analysis. In this article, I am providing a step by step approach on analyzing your data. Directly starting with programming of various algorithms or neural network on your data could at times be counterproductive and should be avoided. Initial stage should involve robust data analysis via steps given below followed by model building which can include custom or already proven algorithms or a derivative of some popular models. For each of the points discussed below, I have added additional information on top of interpretation of Wikipedia information based on my experience in industry towards the end of each of the points or I have added new points post the interpretations.

Your steps for data analysis should generally be:

  1. Setup your data analysis process at a high level with your objectives – inspecting data, cleaning it, processing (could include dimensionality reduction / feature engineering), transformation, modelling and communicating it. Many forget the functional and feedback loop in this process setup to improve data quality – that must be included too.
  2. Next step is in understanding the data in terms of what is it telling us. Data could be quantitative style numbers or textual or a mix of it. Treatment for all three is different. For quantitative / numerical data, we try to understand whether it is time-series, ranking, part to whole, deviation, frequency distribution, correlation, nominal or geographical or geospatial data. For textual or mixed type of data we need to use approaches of text mining, sentiment analysis, natural language processing to get insights around frequency of words, influential words & sentences by weight, trends, categories, clusters and more. Most of this article revolves around quantitative or numerical data perse and not textual data. I have provided a very brief idea on textual data analysis here in this point.
  3. Next step would be to have the quantitative techniques being applied on the data in terms of sanity, audit / reconciliation of totals via formulas, relationships between data, checking things like whether variables in data are related in terms of correlation / sufficiency / necessity / etc. I would suggest using R Studio or similar tool for this step.
  4. Post this we want to actually perform actions like filtering, sorting, checking range and classes, summary, clusters, relationships, context, extremes, etc. At this stage, exploratory data analysis techniques come in very handy where we use various libraries which provide graphical representation. Excel & Tableau come in handy here.
  5. Our next step will be to check for biases, deciphering facts & opinions, deciphering any numerical incorrect / irrelevant inferences which are being projected and need correction / improvement. This needs detailed study of data from domain / functional perspective and applying statistical analysis on it. Working with a business / functional consultant in this phase is especially useful.
  6. Some areas which we need to take care of include quality of data, quality of measurements, transformations various variables / observations into log scale or others like what we have on richter scale for earthquakes, mapping to objectives and characteristics. This is an intuitive step where visualizing data through various transformations in R / Python / etc. using libraries like Ggplot2, Plotly, Matplotlib, etc. helps.
  7. Next comes checking outliers, missing values, randomness, analysis & plotting various of charts based on type of data whether categorical or continuous. This is statistical analysis & visualization where I find R to be most suited.
  8. Building models around our data analysis steps could involve linear, non-linear models and checking values via hypothesis testing and mapping to algorithms to process, predict, cluster, find trends and so on. Products / tools like R / Python with libraries like Scikit learn, Numpy, Pandas, MLR, Caret, Keras, TensorFlow, etc. help here
  9. While running the models take care of cross-validation of data & sensitivity analysis – This can generally be done using some options in model training & testing phase for supervised learning.
  10. Feedback loop to circle and improve data & results, accuracy analysis and improvement, pipeline building, interpretation of results & functional mapping to domain are additional things that we need to consider on top of the basics given in Wikipedia article. Also, things like dimensionality reduction techniques like PCA, SVD and such need to be explored in detail as they are helpful in this analysis.

Additional information on top of what is in Wikipedia article:

  1. Explainable AI / ML – https://en.wikipedia.org/wiki/Explainable_artificial_intelligence
  2. Interpretable ML – https://statmodeling.stat.columbia.edu/2018/10/30/explainable-ml-versus-interpretable-ml/
  3. Tools / languages / products to use: R, Python, Pandas, Numpy, Tableau and so on
  4. EDA – https://en.wikipedia.org/wiki/Exploratory_data_analysis
  5. Which chart to use – https://www.tableau.com/learn/whitepapers/which-chart-or-graph-is-right-for-you
  6. List of charts – https://python-graph-gallery.com/all-charts/
  7. Confirmatory data analysis – https://en.wikipedia.org/wiki/Statistical_hypothesis_testing
  8. Singular Value Decomposition – https://en.wikipedia.org/wiki/Singular_value_decomposition
  9. Dimensionality Reduction – https://en.wikipedia.org/wiki/Dimensionality_reduction

Email me: Neil@TechAndTrain.com

Visit my creations:

  • www.TechAndTrain.com
  • www.QandA.in
  • www.TechTower.in

By Neil Harwani

Interested in movies, music, history, computer science, software, engineering and technology

Leave a comment