The collection, transformation, and organization of data in order to draw conclusions, make predictions, and drive informed decision-making.

Data Collection

Data Understanding

Initial data analysis

A descriptive statistic is a summary statistic that quantitatively describes or summarizes features from a collection of information.

Univariate analysis

Central tendency

  • mean
  • median
  • mode (众数)

Dispersion

  • range

  • quartiles (四分位数)

  • variance

  • standard deviation

Univariate distributions

discrete distributions

  • Uniform distribution (discrete)
  • Bernoulli distribution

continuous distributions

  • Uniform distribution (continuous)
  • Normal distribution
Graphical methods

Histograms (直方图)

1
2
3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
1
2
3
comp1 = np.random.standard_normal(200)
comp2 = 10 + 2 * np.random.standard_normal(200)
values = pd.Series(np.concatenate([comp1, comp2]))
1
2
values.plot.hist(bins=100, color="black")
plt.show()
1
2
3
4
5
6
fig, ax = plt.subplots(figsize=(5, 5))
ax.hist(values, bins=100, rwidth=0.9)
ax.text(50, 190, 'Observed\ndifference', bbox={'facecolor':'white'})
ax.set_xlabel('Session time differences (in seconds)')
ax.set_ylabel('Frequency')
plt.show()
1
2
3
4
5
import seaborn as sns

# bins: the number of bars
sns.histplot(values, bins=100, color="black")
plt.show()

Stem-and-leaf display (茎叶图)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import pandas as pd
import stemgraphic

x = [
    12, 127, 28, 42, 39, 113, 42, 18, 44, 118, 44, 37, 113, 124, 37, 48, 127, 36, 29, 31,
    125, 139, 131, 115, 105, 132, 104, 123, 35, 113, 122, 42, 117, 119, 58, 109, 23, 105,
    63, 27, 44, 105, 99, 41, 128, 121, 116, 125, 32, 61, 37, 127, 29, 113, 121, 58, 114,
    126, 53, 114, 96, 25, 109, 7, 31, 141, 46, 13, 27, 43, 117, 116, 27, 7, 68, 40, 31, 115,
    124, 42, 128, 52, 71, 118, 117, 38, 27, 106, 33, 117, 116, 111, 40, 119, 47, 105, 57,
    122, 109, 124, 115, 43, 120, 43, 27, 27, 18, 28, 48, 125, 107, 114, 34, 133, 45, 120,
    30, 127, 31, 116, 146
]
y = pd.Series(x)

fig, ax = stemgraphic.stem_graphic(y)

Bivariate and multivariate analysis

Quantitative measures of dependence

Correlation and Covariance

Descriptions of conditional distributions

Conditional discrete distributions

Conditional continuous distributions

Graphical methods

Cross-tabulations (交叉表) and contingency tables (列联表)

Scatterplots (散点图)

Exploratory Data Analysis (EDA)

The six practices of EDA are iterative and non-sequential.

Discovering

Data professionals familiarize themselves with the data so they can start conceptualizing how to use it. Review the data and ask questions about it.

Structuring

Structuring refers to categorizing and organizing data columns based on the data already in the data set.

For example, categorizing data into months or quarters rather than years, structuring could be categorizing the items into metal and non metal categories and getting a total count for each type.

Cleaning

Missing values, misspellings, duplicate entries or extreme outliers are all fairly common issues that need to be addressed during the data set cleaning.

Joining

The process of augmenting or adjusting data by adding values from other datasets.

Add more value or context to the data by adding more information from other data sources.

Validating

The process of verifying that the data is consistent and high quality. Validating data is the process for checking for misspellings and inconsistent number or date formats. And checking that the data cleaning process didn’t introduce more errors.

Presenting

Making cleaned dataset or data visualizations available to others for analysis or further modeling.

In the workplace as a data professional presenting might look like preparing visuals and a slide presentation to share with your team.

Box plot

Histogram

Multi-vari chart

Run chart

Pareto chart

Quantitative techniques

Median polish

Trimean

Ordination

Confirmatory Data Analysis (CDA)

A statistical hypothesis test (统计假设检验) is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis.

t-Test

A t-test is one of thestatistical hypothesis test / significance tests,, it is most commonly applied when the test statistic would follow a normal distribution.

The t-test’s most common application is to test whether the means of two populations (总体) are different. The t-test quantifies the difference between the arithmetic means of the two samples.

The most frequently used t-tests are one-sample and two independent sample tests.

Assuming the null hypothesis that the samples are drawn from populations with the same population means is true. A p-value larger than a chosen threshold (alpha e.g. 5% or 1%) indicates that our observation is not so unlikely to have occurred by chance. Therefore, we do not reject the null hypothesis of equal population means. If the p-value is smaller than our threshold, then we have evidence against the null hypothesis of equal population means.

Assumptions:

  • Whether the two samples data groups are independent.
  • Whether the data elements in respective groups follow any normal distribution.
  • Whether the given two samples have similar variances. This assumption is also known as the homogeneity assumption.

Before conducting the two-sample T-Test we need to find if the given data groups have the same variance. If the ratio of the larger data groups to the small data group is less than 4:1 then we can consider that the given data groups have equal variance.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import numpy as np
import scipy.stats as stats

data_group1 = np.array([14, 15, 15, 16, 13, 8, 14,
                        17, 16, 14, 19, 20, 21, 15,
                        15, 16, 16, 13, 14, 12])
data_group2 = np.array([15, 17, 14, 17, 14, 8, 12,
                        19, 19, 14, 17, 22, 24, 16,
                        13, 16, 13, 18, 15, 13])
                        
variance1, variance2 = np.var(data_group1), np.var(data_group2)
equal_variance = max(variance1, variance2) / min(variance1, variance2) < 4 / 1

alpha = 0.05
result = stats.ttest_ind(a=data_group1, b=data_group2, equal_var=equal_variance)
if result.pvalue > alpha:
	print('We do not have sufficient evidence to say that the mean height of students between the two data groups is different.')

ANOVA (Analysis of variance)

Analysis of variance (ANOVA) is used to analyze the differences among means between groups. The ANOVA is used to test the difference between two or more means.

The null hypothesis in ANOVA is always that there is no difference in means.

Assumptions:

  • The samples are independent.
  • Each sample is from a normally distributed population.
  • The population standard deviations of the groups are all equal. This property is known as homoscedasticity.

The one-way ANOVA tests the null hypothesis that two or more groups have the same population mean. The significance level (α) is 0.05, we can say there is statistical significance (统计显着性) at the 0.05 alpha level.

One-Way ANOVA

The simplest experiment suitable for ANOVA analysis is the completely randomized experiment with a single factor.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from scipy.stats import f_oneway

group1 = [89, 89, 88, 78, 79]
group2 = [93, 92, 94, 89, 88]
group3 = [89, 88, 89, 93, 90]
group4 = [81, 78, 81, 92, 82]

result = f_oneway(group1, group2, group3, group4)
if result.pvalue < 0.05:
	print('The null hypothesis is rejected. This implies that we have sufficient proof to say that there exists a difference in the factor among groups.')

Two-Way ANOVA

ANOVA generalizes to the study of the effects of multiple factors.

1
2
import numpy as np
import pandas as pd

Two factors: fertilizers and watering

1
2
3
4
5
6
df = pd.DataFrame({'Fertilizer': np.repeat(['daily', 'weekly'], 15),
                          'Watering': np.repeat(['daily', 'weekly'], 15),
                          'height': [14, 16, 15, 15, 16, 13, 12, 11, 14, 
                                     15, 16, 16, 17, 18, 14, 13, 14, 14, 
                                     14, 15, 16, 16, 17, 18, 14, 13, 14, 
                                     14, 14, 15]})
1
2
3
4
5
6
7
8
import statsmodels.api as sm
import statsmodels.formula.api as smf
  
# Performing two-way ANOVA
model = smf.ols(
    'height ~ C(Fertilizer) + C(Watering) +\
    C(Fertilizer):C(Watering)', data=df).fit()
aov_table = sm.stats.anova_lm(model, typ=2)

aov_table:

1
2
3
4
5
                                 sum_sq    df             F    PR(>F)
C(Fertilizer)              4.390651e-13   1.0  1.589719e-13  1.000000
C(Watering)                8.630952e-02   1.0  3.125000e-02  0.860956
C(Fertilizer):C(Watering)  3.333333e-02   1.0  1.206897e-02  0.913305
Residual                   7.733333e+01  28.0           NaN       NaN

Chi-Squared Test (卡方检验)

A chi-squared test is a statistical hypothesis test used in the analysis of contingency tables when the sample sizes are large. In simpler terms, this test is primarily used to examine whether two categorical variables (two dimensions of the contingency table) are independent in influencing the test statistic (values within the table).

The null hypothesis of the Chi-Square test is that no relationship exists on the categorical variables in the population; they are independent.

Define a significant level to determine whether the relation between the variables is of considerable significance. Generally, a significant level or alpha value of 0.05 is chosen.

If the p-value for the test comes out to be strictly greater than the alpha value, then we will accept the null hypothesis.

The Contingency Table (列联表) : the number of men and women buying different types of pets

dog cat bird total
men 207 282 241 730
women 234 242 232 708
total 441 524 473 1438
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from scipy import stats

data = [[207, 282, 241], [234, 242, 232]]
chisq, pvalue, df, expected = stats.chi2_contingency(data)

alpha = 0.05
print('p value is {}''.format(pvalue))
if pvalue <= alpha:
    print('Dependent (reject the null hypothesis)')
else:
    print('Independent (the null hypothesis holds true)')

Data Visualization

The more data you have, the more visualizations you’ll need to create to understand how each variable impacts the other.

When you start looking at new data, a popular and valuable tactic is to visualize it. Like time series data, on line charts to understand periodicity or scatter plots to get a good idea of data distribution.

It is your job to discover the important points, trends, biases, and stories that need to be shared, then design data visualizations in an effective way for different audiences. For example, the manufacturing supervisors might need to review time series data plotted out over time to identify the manufacturing delays. Meanwhile, the executive leadership team would likely be more interested in financial impact analysis. Those two data visualizations would need to be designed differently based on the needs of your audience.

In the meantime understand that the balance of words and data visuals in a presentation, impacts the business decisions made from data insights.

However, visualizations can also cause confusion or even misrepresent the data. For instance, imagine a data professional develops a visualization and changes the scale of the axis or the ratio of the graph height and width to make the line chart look flat or steep. It is essential that your visualizations not mislead your audiences.

Remember, all data has stories to tell.

Tableau

Matplotlib

Seaborn

Plotly

Power BI

Data Preparation

Data Cleaning

Dirty data is data that is incomplete, incorrect, or irrelevant to the problem you are trying to solve.

It is always good to be proactive and create your data backup before you start your data clean-up.

Data Cleaning Checklist

Remove Irrelevant Data

Removing Duplicates

Handling Missing Data

Remove Extra Spaces And Blanks

Fixing Misspellings

Standardize Capitalization

Fixing Incorrect Punctuation and Other Typos

Removing Duplicates

Detecting and Filtering Outliers

Cleaning data with SQL

  • DISTINCT

    1
    2
    3
    4
    
    SELECT
    	DISTINCT customer_id
    FROM
    	customer_data.customer_purchase
    
  • LENGHT()

    1
    2
    3
    4
    5
    6
    
    SELECT
    	country
    FROM
    	customer_data.customer_purchase
    WHERE
    	LENGTH(country) > 2
    
  • SUBSTR()

    1
    2
    3
    4
    5
    6
    
    SELECT
    	DISTINCT customer_id
    FROM
    	customer_data.customer_purchase
    WHERE
    	SUBSTR(country,1,2) = 'US'
    
  • TRIM()

    1
    2
    3
    4
    5
    6
    
    SELECT
    	DISTINCT customer_id
    FROM
    	customer_data.customer_purchase
    WHERE
    	TRIM(state) = 'OH'
    
  • CAST()

    1
    2
    3
    4
    5
    6
    7
    
    SELECT
    	CAST(date AS date) AS date_only,
    	purchase_price
    FROM
    	customer_data.customer_purchase
    WHERE
    	date BETWEEN '2020-12-01' AND '2020-12-31'
    
  • CONCAT()

    1
    2
    3
    4
    
    SELECT
    	CONCAT(product_code, product_color) AS new_product_code
    FROM
    	customer_data.customer_purchase
    
  • COALESCE()

    1
    2
    3
    4
    5
    
    -- product if product is not null else product_code
    SELECT
    	COALESCE(product, product_code) AS product_info
    FROM
    	customer_data.customer_purchase
    
  • CASE

    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    SELECT
    	customer_id,
    	CASE
    		WHEN first_name = 'Tnoy' THEN 'Tony'
    		WHEN first_name = 'Tmo' THEN 'Tom'
    		ELSE first_name
    		END AS cleaned_name
    FROM
    	customer_data.customer_name
    

Verifying And Reporting

Verification

It involves rechecking your clean dataset, doing some manual clean ups if needed, and taking a moment to sit back and really think about the original purpose of the project.

See the big picture when verifying data-cleaning:

  • consider the business problem you’re trying to solve with the data
  • consider the goal of the project
  • consider whether your data is capable of solving the problem and meeting the project objectives.

Data Cleaning Report

Reporting your data-cleaning efforts is essential for tracking alterations to the data.

Data-cleaning reports, documenting your cleaning process, and using something called the changelog. A changelog is a file containing a chronologically ordered list of modifications made to a project.

Changelog

A file containing a chronologically ordered list of modifications made to a project modifications made to a project.

Data Transformation

Categorical Data

collinearity treatment

Data Reduction

Dimensionality Reduction Methods

PCA

T-SNE

UMAP

Data Modeling

Model evaluation

Model fine-tuning