Libraries
pacman::p_load() loads packages and installs missing ones. This is convenient for now, but in production/reproducible work you might want a fixed renv environment instead. See: renv
Ames Dataset
This problem presents a data set describing the sale of individual residential property in Ames, Iowa from 2006 to 2010. The data set contains 1460 observations and a large number of explanatory variables (nominal, ordinal, discrete, and continuous) involved in assessing home values. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this Kaggle competition challenges you to predict the final price of each home. The full dataset as provided on the Kaggle website has been provided on Vula in the file ames.csv or in the following link:
Kaggle AMES competition
1) Load the data + quick missingness summary (variable-level)
# A tibble: 81 × 3
variable n_miss pct_miss
<chr> <int> <num>
1 PoolQC 1453 99.5
2 MiscFeature 1406 96.3
3 Alley 1369 93.8
4 Fence 1179 80.8
5 FireplaceQu 690 47.3
6 LotFrontage 259 17.7
7 GarageType 81 5.55
8 GarageYrBlt 81 5.55
9 GarageFinish 81 5.55
10 GarageQual 81 5.55
# ℹ 71 more rows
Here we should look for variables with very high percentage missingness which may need special handling (drop, impute, or treat missing as meaningful). Identify patterns like “structural missingness” (e.g., PoolArea missing because most houses have no pool).
2) Visualize missingness patterns (row-level + structure)
If a few variables dominate missingness, you can focus cleaning there. If certain variables always go missing together, it might indicate a data collection rule.
3) SmartEDA: automated overview tables
Are the variable types correct? Are there “coded” missing values (like 0 or 999) that aren’t NA?
Index Variable_Name Variable_Type Sample_n Missing_Count Per_of_Missing
1 1 Id integer 1460 0 0.000
2 2 MSSubClass integer 1460 0 0.000
3 3 MSZoning character 1460 0 0.000
4 4 LotFrontage integer 1201 259 0.177
5 5 LotArea integer 1460 0 0.000
6 6 Street character 1460 0 0.000
7 7 Alley character 91 1369 0.938
8 8 LotShape character 1460 0 0.000
9 9 LandContour character 1460 0 0.000
10 10 Utilities character 1460 0 0.000
11 11 LotConfig character 1460 0 0.000
12 12 LandSlope character 1460 0 0.000
13 13 Neighborhood character 1460 0 0.000
14 14 Condition1 character 1460 0 0.000
15 15 Condition2 character 1460 0 0.000
16 16 BldgType character 1460 0 0.000
17 17 HouseStyle character 1460 0 0.000
18 18 OverallQual integer 1460 0 0.000
19 19 OverallCond integer 1460 0 0.000
20 20 YearBuilt integer 1460 0 0.000
21 21 YearRemodAdd integer 1460 0 0.000
22 22 RoofStyle character 1460 0 0.000
23 23 RoofMatl character 1460 0 0.000
24 24 Exterior1st character 1460 0 0.000
25 25 Exterior2nd character 1460 0 0.000
26 26 MasVnrType character 1452 8 0.005
27 27 MasVnrArea integer 1452 8 0.005
28 28 ExterQual character 1460 0 0.000
29 29 ExterCond character 1460 0 0.000
30 30 Foundation character 1460 0 0.000
31 31 BsmtQual character 1423 37 0.025
32 32 BsmtCond character 1423 37 0.025
33 33 BsmtExposure character 1422 38 0.026
34 34 BsmtFinType1 character 1423 37 0.025
35 35 BsmtFinSF1 integer 1460 0 0.000
36 36 BsmtFinType2 character 1422 38 0.026
37 37 BsmtFinSF2 integer 1460 0 0.000
38 38 BsmtUnfSF integer 1460 0 0.000
39 39 TotalBsmtSF integer 1460 0 0.000
40 40 Heating character 1460 0 0.000
41 41 HeatingQC character 1460 0 0.000
42 42 CentralAir character 1460 0 0.000
43 43 Electrical character 1459 1 0.001
44 44 X1stFlrSF integer 1460 0 0.000
45 45 X2ndFlrSF integer 1460 0 0.000
46 46 LowQualFinSF integer 1460 0 0.000
47 47 GrLivArea integer 1460 0 0.000
48 48 BsmtFullBath integer 1460 0 0.000
49 49 BsmtHalfBath integer 1460 0 0.000
50 50 FullBath integer 1460 0 0.000
51 51 HalfBath integer 1460 0 0.000
52 52 BedroomAbvGr integer 1460 0 0.000
53 53 KitchenAbvGr integer 1460 0 0.000
54 54 KitchenQual character 1460 0 0.000
55 55 TotRmsAbvGrd integer 1460 0 0.000
56 56 Functional character 1460 0 0.000
57 57 Fireplaces integer 1460 0 0.000
58 58 FireplaceQu character 770 690 0.473
59 59 GarageType character 1379 81 0.055
60 60 GarageYrBlt integer 1379 81 0.055
61 61 GarageFinish character 1379 81 0.055
62 62 GarageCars integer 1460 0 0.000
63 63 GarageArea integer 1460 0 0.000
64 64 GarageQual character 1379 81 0.055
65 65 GarageCond character 1379 81 0.055
66 66 PavedDrive character 1460 0 0.000
67 67 WoodDeckSF integer 1460 0 0.000
68 68 OpenPorchSF integer 1460 0 0.000
69 69 EnclosedPorch integer 1460 0 0.000
70 70 X3SsnPorch integer 1460 0 0.000
71 71 ScreenPorch integer 1460 0 0.000
72 72 PoolArea integer 1460 0 0.000
73 73 PoolQC character 7 1453 0.995
74 74 Fence character 281 1179 0.808
75 75 MiscFeature character 54 1406 0.963
76 76 MiscVal integer 1460 0 0.000
77 77 MoSold integer 1460 0 0.000
78 78 YrSold integer 1460 0 0.000
79 79 SaleType character 1460 0 0.000
80 80 SaleCondition character 1460 0 0.000
81 81 SalePrice integer 1460 0 0.000
No_of_distinct_values
1 1460
2 15
3 5
4 110
5 1073
6 2
7 3
8 4
9 4
10 2
11 5
12 3
13 25
14 9
15 8
16 5
17 8
18 10
19 9
20 112
21 61
22 6
23 8
24 15
25 16
26 5
27 327
28 4
29 5
30 6
31 5
32 5
33 5
34 7
35 637
36 7
37 144
38 780
39 721
40 6
41 5
42 2
43 6
44 753
45 417
46 24
47 861
48 4
49 3
50 4
51 3
52 8
53 4
54 4
55 12
56 7
57 4
58 6
59 7
60 97
61 4
62 5
63 441
64 6
65 6
66 3
67 274
68 202
69 120
70 20
71 76
72 8
73 4
74 5
75 5
76 21
77 12
78 5
79 9
80 6
81 663
4) Exporting summary tables to LaTeX (for reports)
This is especially important to make results “publication-ready”. Helps with controlling formatting rather than pasting screenshots.
Or using flextable() package to obtain the summary for numerical variables:
Vname | Group | TN | nNeg | nZero | nPos | NegInf | PosInf | NA_Value | Per_of_Missing | sum | min | max | mean | median | SD | CV | IQR | Skewness | Kurtosis |
|---|
BsmtFinSF1 | All | 1,460 | 0 | 467 | 993 | 0 | 0 | 0 | 0.00 | 647,714 | 0 | 5,644 | 443.64 | 383.5 | 456.10 | 1.03 | 712.25 | 1.68 | 11.08 |
BsmtFinSF2 | All | 1,460 | 0 | 1,293 | 167 | 0 | 0 | 0 | 0.00 | 67,962 | 0 | 1,474 | 46.55 | 0.0 | 161.32 | 3.47 | 0.00 | 4.25 | 20.04 |
BsmtUnfSF | All | 1,460 | 0 | 118 | 1,342 | 0 | 0 | 0 | 0.00 | 828,171 | 0 | 2,336 | 567.24 | 477.5 | 441.87 | 0.78 | 585.00 | 0.92 | 0.47 |
EnclosedPorch | All | 1,460 | 0 | 1,252 | 208 | 0 | 0 | 0 | 0.00 | 32,053 | 0 | 552 | 21.95 | 0.0 | 61.12 | 2.78 | 0.00 | 3.09 | 10.39 |
GarageArea | All | 1,460 | 0 | 81 | 1,379 | 0 | 0 | 0 | 0.00 | 690,551 | 0 | 1,418 | 472.98 | 480.0 | 213.80 | 0.45 | 241.50 | 0.18 | 0.91 |
GarageYrBlt | All | 1,460 | 0 | 0 | 1,379 | 0 | 0 | 81 | 5.55 | 2,728,360 | 1,900 | 2,010 | 1,978.51 | 1,980.0 | 24.69 | 0.01 | 41.00 | -0.65 | -0.42 |
GrLivArea | All | 1,460 | 0 | 0 | 1,460 | 0 | 0 | 0 | 0.00 | 2,212,577 | 334 | 5,642 | 1,515.46 | 1,464.0 | 525.48 | 0.35 | 647.25 | 1.37 | 4.87 |
Id | All | 1,460 | 0 | 0 | 1,460 | 0 | 0 | 0 | 0.00 | 1,066,530 | 1 | 1,460 | 730.50 | 730.5 | 421.61 | 0.58 | 729.50 | 0.00 | -1.20 |
LotArea | All | 1,460 | 0 | 0 | 1,460 | 0 | 0 | 0 | 0.00 | 15,354,569 | 1,300 | 215,245 | 10,516.83 | 9,478.5 | 9,981.26 | 0.95 | 4,048.00 | 12.20 | 202.54 |
LotFrontage | All | 1,460 | 0 | 0 | 1,201 | 0 | 0 | 259 | 17.74 | 84,130 | 21 | 313 | 70.05 | 69.0 | 24.28 | 0.35 | 21.00 | 2.16 | 17.38 |
MasVnrArea | All | 1,460 | 0 | 861 | 591 | 0 | 0 | 8 | 0.55 | 150,551 | 0 | 1,600 | 103.69 | 0.0 | 181.07 | 1.75 | 166.00 | 2.67 | 10.04 |
OpenPorchSF | All | 1,460 | 0 | 656 | 804 | 0 | 0 | 0 | 0.00 | 68,124 | 0 | 547 | 46.66 | 25.0 | 66.26 | 1.42 | 68.00 | 2.36 | 8.46 |
SalePrice | All | 1,460 | 0 | 0 | 1,460 | 0 | 0 | 0 | 0.00 | 264,144,946 | 34,900 | 755,000 | 180,921.20 | 163,000.0 | 79,442.50 | 0.44 | 84,025.00 | 1.88 | 6.51 |
ScreenPorch | All | 1,460 | 0 | 1,344 | 116 | 0 | 0 | 0 | 0.00 | 21,989 | 0 | 480 | 15.06 | 0.0 | 55.76 | 3.70 | 0.00 | 4.12 | 18.37 |
TotalBsmtSF | All | 1,460 | 0 | 37 | 1,423 | 0 | 0 | 0 | 0.00 | 1,543,847 | 0 | 6,110 | 1,057.43 | 991.5 | 438.71 | 0.41 | 502.50 | 1.52 | 13.20 |
WoodDeckSF | All | 1,460 | 0 | 761 | 699 | 0 | 0 | 0 | 0.00 | 137,597 | 0 | 857 | 94.24 | 0.0 | 125.34 | 1.33 | 168.00 | 1.54 | 2.98 |
X1stFlrSF | All | 1,460 | 0 | 0 | 1,460 | 0 | 0 | 0 | 0.00 | 1,697,435 | 334 | 4,692 | 1,162.63 | 1,087.0 | 386.59 | 0.33 | 509.25 | 1.38 | 5.72 |
X2ndFlrSF | All | 1,460 | 0 | 829 | 631 | 0 | 0 | 0 | 0.00 | 506,609 | 0 | 2,065 | 346.99 | 0.0 | 436.53 | 1.26 | 728.00 | 0.81 | -0.56 |
YearBuilt | All | 1,460 | 0 | 0 | 1,460 | 0 | 0 | 0 | 0.00 | 2,878,051 | 1,872 | 2,010 | 1,971.27 | 1,973.0 | 30.20 | 0.02 | 46.00 | -0.61 | -0.44 |
YearRemodAdd | All | 1,460 | 0 | 0 | 1,460 | 0 | 0 | 0 | 0.00 | 2,897,904 | 1,950 | 2,010 | 1,984.87 | 1,994.0 | 20.65 | 0.01 | 37.00 | -0.50 | -1.27 |
5) Numeric summaries with SmartEDA + display with flextable
See what summary information there is available for you. PPeek at column names of the summary output so you can select/format them.
[1] "Vname" "Group" "TN" "nNeg"
[5] "nZero" "nPos" "NegInf" "PosInf"
[9] "NA_Value" "Per_of_Missing" "sum" "min"
[13] "max" "mean" "median" "SD"
[17] "CV" "IQR" "Skewness" "Kurtosis"
Select only a few informative columns (Vname, min, max, CV) and customise header row for readability. You’re modelling good communication: don’t overwhelm people with 20 columns.
Variable | Minimum | Maximum | Coefficient of Variation |
|---|
BsmtFinSF1 | 0 | 5,644 | 1.03 |
BsmtFinSF2 | 0 | 1,474 | 3.47 |
BsmtUnfSF | 0 | 2,336 | 0.78 |
EnclosedPorch | 0 | 552 | 2.78 |
GarageArea | 0 | 1,418 | 0.45 |
GarageYrBlt | 1,900 | 2,010 | 0.01 |
GrLivArea | 334 | 5,642 | 0.35 |
Id | 1 | 1,460 | 0.58 |
LotArea | 1,300 | 215,245 | 0.95 |
LotFrontage | 21 | 313 | 0.35 |
MasVnrArea | 0 | 1,600 | 1.75 |
OpenPorchSF | 0 | 547 | 1.42 |
SalePrice | 34,900 | 755,000 | 0.44 |
ScreenPorch | 0 | 480 | 3.70 |
TotalBsmtSF | 0 | 6,110 | 0.41 |
WoodDeckSF | 0 | 857 | 1.33 |
X1stFlrSF | 334 | 4,692 | 0.33 |
X2ndFlrSF | 0 | 2,065 | 1.26 |
YearBuilt | 1,872 | 2,010 | 0.02 |
YearRemodAdd | 1,950 | 2,010 | 0.01 |
6) Summary tables per group (group-aware EDA)
You can have your summary tables per group. Some EDA questions are conditional: e.g., “Do houses with different heating types have different distributions for some numeric variable?”
You need to decide to use Group based on which variable? Are group differences meaningful, or just noise? Do missing values differ by group (potential bias)?
Variable | Heating Group | Minimum | Maximum | Coefficient of Variation |
|---|
BsmtFinSF1 | All | 0 | 5,644 | 1.03 |
BsmtFinSF2 | All | 0 | 1,474 | 3.47 |
BsmtUnfSF | All | 0 | 2,336 | 0.78 |
EnclosedPorch | All | 0 | 552 | 2.78 |
GarageArea | All | 0 | 1,418 | 0.45 |
GarageYrBlt | All | 1,900 | 2,010 | 0.01 |
GrLivArea | All | 334 | 5,642 | 0.35 |
Id | All | 1 | 1,460 | 0.58 |
LotArea | All | 1,300 | 215,245 | 0.95 |
LotFrontage | All | 21 | 313 | 0.35 |
MasVnrArea | All | 0 | 1,600 | 1.75 |
OpenPorchSF | All | 0 | 547 | 1.42 |
SalePrice | All | 34,900 | 755,000 | 0.44 |
ScreenPorch | All | 0 | 480 | 3.70 |
TotalBsmtSF | All | 0 | 6,110 | 0.41 |
WoodDeckSF | All | 0 | 857 | 1.33 |
X1stFlrSF | All | 334 | 4,692 | 0.33 |
X2ndFlrSF | All | 0 | 2,065 | 1.26 |
YearBuilt | All | 1,872 | 2,010 | 0.02 |
YearRemodAdd | All | 1,950 | 2,010 | 0.01 |
If LotFrontage is missing more often for low- or high-priced houses, the missingness may not be random (important for modelling decisions).
8) Example: zeros / rare values (PoolArea)
Understand the distribution of PoolArea. Often PoolArea is 0 for most houses (no pool) and positive for a few.
PoolArea n
1 0 1453
2 480 1
3 512 1
4 519 1
5 555 1
6 576 1
7 648 1
8 738 1
9) Faceting to compare relationships across subgroups (PoolArea)
Does the LotFrontage–SalePrice relationship differ by PoolArea category? Faceting is a quick “small multiples” strategy to explore subgroup structure.
10) Missingness by group + missingness by factor
If missingness differs across groups, analyses can become biased.
# A tibble: 160 × 4
# Groups: Street [2]
Street variable n_miss pct_miss
<chr> <chr> <int> <num>
1 Pave PoolQC 1447 99.5
2 Pave MiscFeature 1403 96.5
3 Pave Alley 1363 93.7
4 Pave Fence 1173 80.7
5 Pave FireplaceQu 686 47.2
6 Pave LotFrontage 258 17.7
7 Pave GarageType 80 5.50
8 Pave GarageYrBlt 80 5.50
9 Pave GarageFinish 80 5.50
10 Pave GarageQual 80 5.50
# ℹ 150 more rows