Discuss@GL4L

Data Science Project Approach


#1

Stages of a Data Science Project – CRISP-DM (Cross Industry Standard Process for Data Modelling)

1.Explore the problem space and define problem statement precisely.

a.Identifying right problems to solve

b.Prioritize problems using management techniques

c.State the problems precisely

2.Explore the solution space – Model, process & procedure, sources of data etc. List and evaluate the alternate solutions. Decide whether ML is required

3.Mine the Data

a.Prepare the data

b.Survey the data

4.Model the Data

5.Evaluate and tune the model (https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining

6.Deploy and Implement the solution – How will it integrate with the existing IT solutions, data flows etc.

Prepare the Data - Steps
1.Data preparation includes: finding data and assembling the data set, manipulating the data to enhance its utility for mining. The output of this step is Analytics Base Table (ABT). It can be covered in three steps

a.Data discovery – hunting, and locating potentially good data for the given problem spec.

b.Data characterization - describes the data in terms of type of data, range of data, distribution of the data for a feature, missing values, outliers, data pollution etc. for each attribute of the data. It helps assess quality, reliability of data. (Ref: https://www.mapr.com/blog/big-c-big-data-top-8-reasons-characterization-‘roight’-your-data)

c.Address data quality issues and enrich data

a.Data set assembly builds a standard representation for the incoming data so that it can be mined. Create the ABT and the data quality report

Survey the Data
The data survey examines and reports on the general properties of the manifold** in state space. The purpose is to help the data scientist get a feel of the data in terms of any trends, shapes

It helps guide the data scientist in identifying areas for further exploration and locate areas of interest.

Give example here
Note - ** Manifold is a term that refers to the mathematical space consisting of smooth hyper surface. It is also known as metric space
image image

Data quality and enriching data

1.Representative Sample – The data file represents the universe but is a sample only. For models built on this data to be effective, the sample should be a true representative of the population. Deciding sample size is (Volume of data) is critical for this. If entire universe / population data is considered, our model can be 100% confident. Analyziing population data is not possible. What if we are OK with our model being 95% OK, what is the sample size we will need. Statistics helps us define the sample size.

2.Categorical variables – have two or more categories with no specific order. E.g. Gender. Categorical variables with order are called Ordinal variables. We can face two types of challenges here… To model the categorical variables we may numerate them and doing so introduces an order which does not exist in nature.

3.Ordinal variables – have categories along with an order that exists in the physical world. For e.g. Low , medium and High income groups. When these variables are numerated, the quantum difference between numerated levels may or may not reflect the real difference between the categories

4.Normalization – Refers to the process of ensuring all attributes in a ABT have same scale. Two or more attributes become comparable only when they have same scale. There are various techniques of normalizing the data such as range normalization, Z score normalization. Some models such as ANN need normalization where as others benefit from it though it is not mandatory

5.Missing and empty values – Missing and empty values are not the same. A value may be missing or the respondent would have consciously decided not to share it, leaving it empty. Can be ignored, entire record dropped, filled with mean, mode values, predicted etc.

6.Data Width – refers to the number of attributes in the data. More the attribute more likely the information in a data set but too many attributes can bring the machine learning algorithm to its knees. The challenge is which attributes to remove.

7.Data Pollution – Data in the attributes do not seem right. This may result from manual errors, errors in design of the data structures ( e.g. Using “B” to denote business entity and not and individual and putting that value in gender column as the original design did not expect non-human objects).

Tidy data for machine learning …

1.It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data (Dasu and Johnson 2003).

2.Data preparation is not just a first step, but must be repeated many over the course of analysis as new problems come to light or new data is collected.

3.Part of the challenge is the breadth of activities it encompasses: from outlier checking, to date parsing, to missing value imputation

4.Ref: http://vita.had.co.nz/papers/tidy-data.pdf

5.Ref: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html

6.https://statsguys.wordpress.com/2014/01/03/first-post/ - Data cleaning

7.http://datatechblog.com/2015/08/data-analysis-using-r-gathering-organizing-munging/

The technical attributes of the data

1.Characteristics of data in terms of -

a.Volume of data. What is the likely volume of data that will be processed. Higher the volume of data, the model is likely to be more accurate but resource requirements for storage and computing will be high.

b.Velocity of data. Is the data at rest or in motion when it is ingested into the system? What is the rate at which it is created and does it have to processed on the fly?

c.Variety of data. Data can come from various sources whose formats may differ widely. The structure of data may not lend themselves to easy joining of data. Data can take a variety of formats, such as numeric, categorical (string), or Boolean (true/false).

2.Tools and technologies required to capture, ingest, store and process the data. For e.g. will the project require streaming tool such as flume, Sqoop, R scripts, HDFS, MapReduce, Spark etc.

Where and how does it all fit in Big Data