The collection, transformation, and organization of data in order to draw conclusions, make predictions, and drive informed decision-making.
Data Collection
Data Understanding
Initial data analysis
A descriptive statistic is a summary statistic that quantitatively describes or summarizes features from a collection of information.
Univariate analysis
Central tendency
- mean
- median
- mode (众数)
Dispersion
-
range
-
quartiles (四分位数)
-
variance
-
standard deviation
Univariate distributions
discrete distributions
- Uniform distribution (discrete)
- Bernoulli distribution
continuous distributions
- Uniform distribution (continuous)
- Normal distribution
Graphical methods
Histograms (直方图)
|
|
|
|
|
|
|
|
|
|
Stem-and-leaf display (茎叶图)
|
|
Bivariate and multivariate analysis
Quantitative measures of dependence
Correlation and Covariance
Descriptions of conditional distributions
Conditional discrete distributions
Conditional continuous distributions
Graphical methods
Cross-tabulations (交叉表) and contingency tables (列联表)
Scatterplots (散点图)
Exploratory Data Analysis (EDA)
The six practices of EDA are iterative and non-sequential.
Discovering
Data professionals familiarize themselves with the data so they can start conceptualizing how to use it. Review the data and ask questions about it.
Structuring
Structuring refers to categorizing and organizing data columns based on the data already in the data set.
For example, categorizing data into months or quarters rather than years, structuring could be categorizing the items into metal and non metal categories and getting a total count for each type.
Cleaning
Missing values, misspellings, duplicate entries or extreme outliers are all fairly common issues that need to be addressed during the data set cleaning.
Joining
The process of augmenting or adjusting data by adding values from other datasets.
Add more value or context to the data by adding more information from other data sources.
Validating
The process of verifying that the data is consistent and high quality. Validating data is the process for checking for misspellings and inconsistent number or date formats. And checking that the data cleaning process didn’t introduce more errors.
Presenting
Making cleaned dataset or data visualizations available to others for analysis or further modeling.
In the workplace as a data professional presenting might look like preparing visuals and a slide presentation to share with your team.
Box plot
Histogram
Multi-vari chart
Run chart
Pareto chart
Quantitative techniques
Median polish
Trimean
Ordination
Confirmatory Data Analysis (CDA)
A statistical hypothesis test (统计假设检验) is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis.
t-Test
A t-test is one of thestatistical hypothesis test / significance tests,, it is most commonly applied when the test statistic would follow a normal distribution.
The t-test’s most common application is to test whether the means of two populations (总体) are different. The t-test quantifies the difference between the arithmetic means of the two samples.
The most frequently used t-tests are one-sample and two independent sample tests.
Assuming the null hypothesis that the samples are drawn from populations with the same population means is true. A p-value larger than a chosen threshold (alpha e.g. 5% or 1%) indicates that our observation is not so unlikely to have occurred by chance. Therefore, we do not reject the null hypothesis of equal population means. If the p-value is smaller than our threshold, then we have evidence against the null hypothesis of equal population means.
Assumptions:
- Whether the two samples data groups are independent.
- Whether the data elements in respective groups follow any normal distribution.
- Whether the given two samples have similar variances. This assumption is also known as the homogeneity assumption.
Before conducting the two-sample T-Test we need to find if the given data groups have the same variance. If the ratio of the larger data groups to the small data group is less than 4:1 then we can consider that the given data groups have equal variance.
|
|
ANOVA (Analysis of variance)
Analysis of variance (ANOVA) is used to analyze the differences among means between groups. The ANOVA is used to test the difference between two or more means.
The null hypothesis in ANOVA is always that there is no difference in means.
Assumptions:
- The samples are independent.
- Each sample is from a normally distributed population.
- The population standard deviations of the groups are all equal. This property is known as homoscedasticity.
The one-way ANOVA tests the null hypothesis that two or more groups have the same population mean. The significance level (α) is 0.05, we can say there is statistical significance (统计显着性) at the 0.05 alpha level.
One-Way ANOVA
The simplest experiment suitable for ANOVA analysis is the completely randomized experiment with a single factor.
|
|
Two-Way ANOVA
ANOVA generalizes to the study of the effects of multiple factors.
|
|
Two factors: fertilizers and watering
|
|
|
|
aov_table:
|
|
Chi-Squared Test (卡方检验)
A chi-squared test is a statistical hypothesis test used in the analysis of contingency tables when the sample sizes are large. In simpler terms, this test is primarily used to examine whether two categorical variables (two dimensions of the contingency table) are independent in influencing the test statistic (values within the table).
The null hypothesis of the Chi-Square test is that no relationship exists on the categorical variables in the population; they are independent.
Define a significant level to determine whether the relation between the variables is of considerable significance. Generally, a significant level or alpha value of 0.05 is chosen.
If the p-value for the test comes out to be strictly greater than the alpha value, then we will accept the null hypothesis.
The Contingency Table (列联表) : the number of men and women buying different types of pets
| dog | cat | bird | total | |
|---|---|---|---|---|
| men | 207 | 282 | 241 | 730 |
| women | 234 | 242 | 232 | 708 |
| total | 441 | 524 | 473 | 1438 |
|
|
Data Visualization
The more data you have, the more visualizations you’ll need to create to understand how each variable impacts the other.
When you start looking at new data, a popular and valuable tactic is to visualize it. Like time series data, on line charts to understand periodicity or scatter plots to get a good idea of data distribution.
It is your job to discover the important points, trends, biases, and stories that need to be shared, then design data visualizations in an effective way for different audiences. For example, the manufacturing supervisors might need to review time series data plotted out over time to identify the manufacturing delays. Meanwhile, the executive leadership team would likely be more interested in financial impact analysis. Those two data visualizations would need to be designed differently based on the needs of your audience.
In the meantime understand that the balance of words and data visuals in a presentation, impacts the business decisions made from data insights.
However, visualizations can also cause confusion or even misrepresent the data. For instance, imagine a data professional develops a visualization and changes the scale of the axis or the ratio of the graph height and width to make the line chart look flat or steep. It is essential that your visualizations not mislead your audiences.
Remember, all data has stories to tell.
Tableau
Matplotlib
Seaborn
Plotly
Power BI
Data Preparation
Data Cleaning
Dirty data is data that is incomplete, incorrect, or irrelevant to the problem you are trying to solve.
It is always good to be proactive and create your data backup before you start your data clean-up.
Data Cleaning Checklist
Remove Irrelevant Data
Removing Duplicates
Handling Missing Data
Remove Extra Spaces And Blanks
Fixing Misspellings
Standardize Capitalization
Fixing Incorrect Punctuation and Other Typos
Removing Duplicates
Detecting and Filtering Outliers
Cleaning data with SQL
-
DISTINCT
1 2 3 4SELECT DISTINCT customer_id FROM customer_data.customer_purchase -
LENGHT()
1 2 3 4 5 6SELECT country FROM customer_data.customer_purchase WHERE LENGTH(country) > 2 -
SUBSTR()
1 2 3 4 5 6SELECT DISTINCT customer_id FROM customer_data.customer_purchase WHERE SUBSTR(country,1,2) = 'US' -
TRIM()
1 2 3 4 5 6SELECT DISTINCT customer_id FROM customer_data.customer_purchase WHERE TRIM(state) = 'OH' -
CAST()
1 2 3 4 5 6 7SELECT CAST(date AS date) AS date_only, purchase_price FROM customer_data.customer_purchase WHERE date BETWEEN '2020-12-01' AND '2020-12-31' -
CONCAT()
1 2 3 4SELECT CONCAT(product_code, product_color) AS new_product_code FROM customer_data.customer_purchase -
COALESCE()
1 2 3 4 5-- product if product is not null else product_code SELECT COALESCE(product, product_code) AS product_info FROM customer_data.customer_purchase -
CASE
1 2 3 4 5 6 7 8 9SELECT customer_id, CASE WHEN first_name = 'Tnoy' THEN 'Tony' WHEN first_name = 'Tmo' THEN 'Tom' ELSE first_name END AS cleaned_name FROM customer_data.customer_name
Verifying And Reporting
Verification
It involves rechecking your clean dataset, doing some manual clean ups if needed, and taking a moment to sit back and really think about the original purpose of the project.
See the big picture when verifying data-cleaning:
- consider the business problem you’re trying to solve with the data
- consider the goal of the project
- consider whether your data is capable of solving the problem and meeting the project objectives.
Data Cleaning Report
Reporting your data-cleaning efforts is essential for tracking alterations to the data.
Data-cleaning reports, documenting your cleaning process, and using something called the changelog. A changelog is a file containing a chronologically ordered list of modifications made to a project.
Changelog
A file containing a chronologically ordered list of modifications made to a project modifications made to a project.
Data Transformation
Categorical Data
collinearity treatment
Data Reduction
Dimensionality Reduction Methods
PCA
T-SNE
UMAP