Chapter 4 Cluster Analysis - SA Crime dataset

4.1 Data cleaning and getting ready for analysis

4.2 Exploratory data analysis

This first section prepares the spatial and crime data for clustering.

#setwd("C:/Users/01438475/OneDrive - University of Cape Town/UCTcourses/Honours/Analytics/UnsupervisedLearning/Cluster/crimestats")

###################### DATA ######################
# polygon police
policeboundary = st_read("C:/Users/01438475/OneDrive - University of Cape Town/UCTcourses/Honours/Analytics/UnsupervisedLearning/Cluster/crimestats/station_boundaries_points/station_boundaries/Police_bounds.shp")
## Reading layer `Police_bounds' from data source 
##   `C:\Users\01438475\OneDrive - University of Cape Town\UCTcourses\Honours\Analytics\UnsupervisedLearning\Cluster\crimestats\station_boundaries_points\station_boundaries\Police_bounds.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 1172 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 16.45189 ymin: -34.83305 xmax: 32.89099 ymax: -22.12503
## Geodetic CRS:  WGS 84
policeboundary = policeboundary |> clean_names()
tm_shape(policeboundary)+tm_polygons()

# point police
policepoints = st_read("C:/Users/01438475/OneDrive - University of Cape Town/UCTcourses/Honours/Analytics/UnsupervisedLearning/Cluster/crimestats/station_boundaries_points/station_points/Police_points.shp")
## Reading layer `Police_points' from data source 
##   `C:\Users\01438475\OneDrive - University of Cape Town\UCTcourses\Honours\Analytics\UnsupervisedLearning\Cluster\crimestats\station_boundaries_points\station_points\Police_points.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 1172 features and 5 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 16.49218 ymin: -34.80059 xmax: 32.74348 ymax: -22.34331
## Geodetic CRS:  WGS 84
policepoints = policepoints |> clean_names()
tm_shape(policepoints)+tm_dots()

# pop polygon
pop = st_read("C:/Users/01438475/OneDrive - University of Cape Town/ASDA/DataSets/census-2011-population/Census 2011 WardLvl Pop Labour Force_region.shp")
## Reading layer `Census 2011 WardLvl Pop Labour Force_region' from data source 
##   `C:\Users\01438475\OneDrive - University of Cape Town\ASDA\DataSets\census-2011-population\Census 2011 WardLvl Pop Labour Force_region.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 4277 features and 66 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 16.45189 ymin: -34.83417 xmax: 32.94498 ymax: -22.12503
## Geodetic CRS:  WGS 84
pop = pop |> clean_names() |> 
  mutate(coct = as.numeric(str_sub(ward_id, 1, 3)))

tm_shape(pop)+tm_polygons()

popCoct = pop |> filter(coct == 191)
tm_shape(popCoct)+tm_polygons(fill = "pop")

# za boundary
zaboundary = st_read("C:/Users/01438475/OneDrive - University of Cape Town/UCTcourses/Honours/Analytics/UnsupervisedLearning/Cluster/crimestats/zaf_admin_boundaries.shp/zaf_admin4.shp")
## Reading layer `zaf_admin4' from data source 
##   `C:\Users\01438475\OneDrive - University of Cape Town\UCTcourses\Honours\Analytics\UnsupervisedLearning\Cluster\crimestats\zaf_admin_boundaries.shp\zaf_admin4.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 4392 features and 39 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 16.45189 ymin: -34.83417 xmax: 32.94498 ymax: -22.12503
## Geodetic CRS:  WGS 84
tm_shape(zaboundary)+tm_polygons()

# ward boundary
wardsboundary = st_read("C:/Users/01438475/OneDrive - University of Cape Town/UCTcourses/Honours/Analytics/UnsupervisedLearning/Cluster/crimestats/Wards/Wards.shp")
## Reading layer `Wards' from data source 
##   `C:\Users\01438475\OneDrive - University of Cape Town\UCTcourses\Honours\Analytics\UnsupervisedLearning\Cluster\crimestats\Wards\Wards.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 116 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 2037950 ymin: -4077020 xmax: 2115607 ymax: -3958026
## Projected CRS: WGS 84 / Pseudo-Mercator
tm_shape(wardsboundary)+tm_polygons()

# crime df
crime2025 <- read_excel("C:/Users/01438475/OneDrive - University of Cape Town/UCTcourses/Honours/Analytics/UnsupervisedLearning/Cluster/crimestats/crime2025.xlsx", sheet = "Sheet5")
head(crime2025)
## # A tibble: 6 × 10
##   `Row Labels`  Assault with the intent to…¹ `Attempted murder` `Common assault`
##   <chr>                                <dbl>              <dbl>            <dbl>
## 1 Albertinia                              19                  1               37
## 2 Ashton                                  23                  1               27
## 3 Athlone                                 46                 25              137
## 4 Atlantis                                51                 12              341
## 5 Barrydale                               20                  0               23
## 6 Beaufort West                           89                  5              164
## # ℹ abbreviated name:
## #   ¹​`Assault with the intent to inflict grievous bodily harm`
## # ℹ 6 more variables: `Common robbery` <dbl>, Murder <dbl>,
## #   `Robbery with aggravating circumstances` <dbl>, `Sexual offences` <dbl>,
## #   district <chr>, province <chr>
crime2025 = crime2025 |> clean_names() 
crime2025 <- crime2025 |>
       mutate(name = str_to_sentence(str_to_lower(row_labels)))

# match with police points
policepoints <- policepoints |>
       mutate(name = str_to_sentence(str_to_lower(compnt_nm)))

policeboundary <- policeboundary |>
       mutate(name = str_to_sentence(str_to_lower(compnt_nm)))

final_boundary <- policeboundary |>
     left_join(crime2025, by = "name")

final_points <- policepoints |>
  left_join(crime2025, by = "name")


head(final_points)
## Simple feature collection with 6 features and 16 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 25.52347 ymin: -33.84109 xmax: 27.62384 ymax: -29.2362
## Geodetic CRS:  WGS 84
##      compnt_nm location_x location_y create_dt version         name row_labels
## 1   BOTSHABELO   26.71606  -29.23620  20251205   1.4.2   Botshabelo       <NA>
## 2 KHUBUSIDRIFT   27.62384  -32.56819  20251205   1.4.2 Khubusidrift       <NA>
## 3  STUTTERHEIM   27.42741  -32.57118  20251205   1.4.2  Stutterheim       <NA>
## 4   MOTHERWELL   25.58419  -33.79664  20251205   1.4.2   Motherwell       <NA>
## 5     KWADWESI   25.52347  -33.84109  20251205   1.4.2     Kwadwesi       <NA>
## 6        NXUBA   25.62571  -32.17560  20251205   1.4.2        Nxuba       <NA>
##   assault_with_the_intent_to_inflict_grievous_bodily_harm attempted_murder
## 1                                                      NA               NA
## 2                                                      NA               NA
## 3                                                      NA               NA
## 4                                                      NA               NA
## 5                                                      NA               NA
## 6                                                      NA               NA
##   common_assault common_robbery murder robbery_with_aggravating_circumstances
## 1             NA             NA     NA                                     NA
## 2             NA             NA     NA                                     NA
## 3             NA             NA     NA                                     NA
## 4             NA             NA     NA                                     NA
## 5             NA             NA     NA                                     NA
## 6             NA             NA     NA                                     NA
##   sexual_offences district province                   geometry
## 1              NA     <NA>     <NA>  POINT (26.71606 -29.2362)
## 2              NA     <NA>     <NA> POINT (27.62384 -32.56819)
## 3              NA     <NA>     <NA> POINT (27.42741 -32.57118)
## 4              NA     <NA>     <NA> POINT (25.58419 -33.79664)
## 5              NA     <NA>     <NA> POINT (25.52347 -33.84109)
## 6              NA     <NA>     <NA>  POINT (25.62571 -32.1756)
final_points = final_points |> filter(!is.na(province))
tm_shape(final_points)+tm_dots()

tm_shape(final_points)+tm_dots(col = "sexual_offences")

tm_shape(final_points)+tm_dots(fill = "sexual_offences")

final_pointsCoCT = final_points |> filter(district=="City of Cape Town District")
tm_shape(final_pointsCoCT)+tm_dots(fill = "sexual_offences")

tm_shape(popCoct)+tm_polygons()+tm_shape(final_pointsCoCT)+tm_dots(fill = "attempted_murder")

tm_shape(popCoct)+tm_polygons()+tm_shape(final_pointsCoCT)+tm_dots()

intersections = st_intersects(final_pointsCoCT, popCoct)
final_pointsCoCT$population <- popCoct[unlist(intersections), "pop",
                       drop = TRUE] # drop geometry
final_pointsCoCT
## Simple feature collection with 63 features and 17 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 18.34645 ymin: -34.19313 xmax: 18.86941 ymax: -33.56479
## Geodetic CRS:  WGS 84
## First 10 features:
##          compnt_nm location_x location_y create_dt version            name
## 1       RONDEBOSCH   18.46833  -33.96262  20251205   1.4.2      Rondebosch
## 2       MUIZENBERG   18.46663  -34.10872  20251205   1.4.2      Muizenberg
## 3          LWANDLE   18.86718  -34.11561  20251205   1.4.2         Lwandle
## 4    SAMORA MACHEL   18.57275  -34.01742  20251205   1.4.2   Samora machel
## 5         PHILIPPI   18.54054  -34.00107  20251205   1.4.2        Philippi
## 6           HARARE   18.66930  -34.05193  20251205   1.4.2          Harare
## 7        KLEINVLEI   18.71879  -33.98820  20251205   1.4.2       Kleinvlei
## 8           STRAND   18.83286  -34.11592  20251205   1.4.2          Strand
## 9  LINGELETHU-WEST   18.66004  -34.04298  20251205   1.4.2 Lingelethu-west
## 10         MOWBRAY   18.47228  -33.95012  20251205   1.4.2         Mowbray
##         row_labels assault_with_the_intent_to_inflict_grievous_bodily_harm
## 1       Rondebosch                                                       0
## 2       Muizenberg                                                      20
## 3          Lwandle                                                      86
## 4    Samora Machel                                                      59
## 5         Philippi                                                      47
## 6           Harare                                                     107
## 7        Kleinvlei                                                      96
## 8           Strand                                                      26
## 9  Lingelethu-West                                                      53
## 10         Mowbray                                                       3
##    attempted_murder common_assault common_robbery murder
## 1                 0              8              3      0
## 2                14            103             31     22
## 3                22            111             20     26
## 4                 9             38             13     18
## 5                16            122             24     31
## 6                13            160             16     36
## 7                27            303             61     25
## 8                 3            129             21      5
## 9                14            185             30     25
## 10                0              7              6      0
##    robbery_with_aggravating_circumstances sexual_offences
## 1                                       5               5
## 2                                      57               9
## 3                                      87              38
## 4                                      88              22
## 5                                      40              22
## 6                                     139              72
## 7                                     115              27
## 8                                      35              21
## 9                                     153              30
## 10                                     11               0
##                      district     province                   geometry
## 1  City of Cape Town District Western Cape POINT (18.46833 -33.96262)
## 2  City of Cape Town District Western Cape POINT (18.46663 -34.10872)
## 3  City of Cape Town District Western Cape POINT (18.86718 -34.11561)
## 4  City of Cape Town District Western Cape POINT (18.57275 -34.01742)
## 5  City of Cape Town District Western Cape POINT (18.54054 -34.00107)
## 6  City of Cape Town District Western Cape  POINT (18.6693 -34.05193)
## 7  City of Cape Town District Western Cape  POINT (18.71879 -33.9882)
## 8  City of Cape Town District Western Cape POINT (18.83286 -34.11592)
## 9  City of Cape Town District Western Cape POINT (18.66004 -34.04298)
## 10 City of Cape Town District Western Cape POINT (18.47228 -33.95012)
##    population
## 1       23734
## 2       24489
## 3       39177
## 4       43695
## 5       46151
## 6       28971
## 7       41077
## 8       33367
## 9       34699
## 10      33076

The main statistical issue is that crime counts are not directly comparable across police areas because each area has a different population size. Therefore, we convert selected crime counts into rates per 100k people.

For police area \(i=1,\ldots,N\) and crime type \(j=1,\ldots,P\), define

\[ \text{Rate}_{ij}=\frac{X_{ij}}{Pop_i}\times 100{,}000, \]

where \(X_{ij}\) is the observed number of crimes of type \(j\) (variable) in area \(i\), and \(Pop_i\) is the population assigned to that police area. This standardises crime counts across areas of different sizes in terms of population. Allows fair comparison between, say Khayelitsha (very large population) and Albertinia (small population). The clustering dataset can therefore be written as

\[ \mathbf{X^*}=\begin{bmatrix} x^*_{11} & x^*_{12} & \cdots & x^*_{1p}\\ x^*_{21} & x^*_{22} & \cdots & x^*_{2p}\\ \vdots & \vdots & \ddots & \vdots\\ x^*_{n1} & x^*_{n2} & \cdots & x^*_{np} \end{bmatrix}, \]

where each row is a police station area and each column is a crime-rate variable (\(X^*\)).

Key takeaway: We should not cluster directly on raw crime counts because high-population areas may appear as high-crime areas simply because more people live there.

tm_shape(final_pointsCoCT)+tm_dots(fill = "population")

final_pointsCoCT = final_pointsCoCT |> 
  mutate(assault_per100k = assault_with_the_intent_to_inflict_grievous_bodily_harm*100000/population,
         attempted_murder_per100k = attempted_murder*100000/population,
         common_assault_per100k = common_assault*100000/population,
         common_robbery_per100k = common_robbery*100000/population,
         murder_per100k = murder*100000/population,
         robbery_per100k = robbery_with_aggravating_circumstances*100000/population,
         sexual_offences_per100k = sexual_offences*100000/population)

tm_shape(final_pointsCoCT)+tm_dots(fill = "assault_per100k")

tm_shape(final_pointsCoCT)+tm_dots(fill = "attempted_murder_per100k")

tm_shape(final_pointsCoCT)+tm_dots(fill = "common_assault_per100k")

tm_shape(final_pointsCoCT)+tm_dots(fill = "common_robbery_per100k")

tm_shape(final_pointsCoCT)+tm_dots(fill = "murder_per100k")

tm_shape(final_pointsCoCT)+tm_dots(fill = "robbery_per100k")

tm_shape(final_pointsCoCT)+tm_dots(fill = "sexual_offences_per100k")

clustervars = final_pointsCoCT |> st_drop_geometry() |> select(c(6,18:24))

4.2.1 Examine the correlations

Before clustering, we examine the spatial distribution of crime variables and the relationships among the crime-rate variables. This helps us understand whether clustering is likely to reveal meaningful structure.

The exploratory questions are:

  1. Do crime rates vary spatially across police areas?
  2. Are some crime types strongly correlated?
  3. Is dimension reduction useful before clustering?

The sample correlation between crime variables \(j\) and \(k\) is

\[ r_{jk}=\frac{\sum_{i=1}^{n}(x_{ij}-\bar{x}_j)(x_{ik}-\bar{x}_k)} {\sqrt{\sum_{i=1}^{n}(x_{ij}-\bar{x}_j)^2}\sqrt{\sum_{i=1}^{n}(x_{ik}-\bar{x}_k)^2}}. \]

The full correlation matrix is

\[ \mathbf{R}=\left(r_{jk}\right)_{p\times p}. \]

Strong positive correlations suggest that some crimes tend to be high in the same areas. This can lead to redundancy in the clustering variables and motivates the use of PCA.

corrplot(cor(clustervars[,2:8], 
             use = "complete.obs"), 
         method = "color")

4.2.2 Perform Principal Component Analysis (PCA)

Principal component analysis (PCA) transforms the original variables into uncorrelated linear combinations. PCA is a way to summarise the main directions of variation before or alongside clustering.

pca_crime <- PCA(clustervars[,2:8], graph = FALSE)

fviz_eig(pca_crime)
## Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
## Ignoring empty aesthetic: `width`.

fviz_pca_var(
  pca_crime,
  col.var = "contrib",
  repel = TRUE
)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggpubr package.
##   Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

4.2.3 Extract PCA scores

The PCA scores are the coordinates of each observation in the new principal-component space. If most variation is captured by the first two components, then plotting PC1 against PC2 gives a useful low-dimensional view of the data. These scores are useful for visualising clusters found by algorithms such as k-means.

pca <- prcomp(clustervars[,2:8], scale. = TRUE, center = TRUE)
pca_scores <- as.data.frame(pca$x)
biplot(pca)

summary(pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.2421 0.9536 0.72301 0.52131 0.33804 0.30030 0.25456
## Proportion of Variance 0.7181 0.1299 0.07468 0.03882 0.01632 0.01288 0.00926
## Cumulative Proportion  0.7181 0.8480 0.92271 0.96154 0.97786 0.99074 1.00000

4.3 Cluster Analysis

Cluster analysis aims to divide the \(n\) police areas into groups that are internally similar and externally different. A clustering solution with \(K\) groups can be written as

\[ C=\{C_1,C_2,\ldots,C_K\}, \]

where each \(C_k\) is a cluster and every observation belongs to one cluster.

The broad goal is:

\[ \text{high similarity within clusters, and low similarity between clusters.} \]

In this lecture, we compare centroid-based K-means, density-based, medoid-based and hierarchical clustering approaches.

nstart = 25

Firstly, we define which variables will be used for cluster analysis:

data <- scale(clustervars[,2:8])

K-means is sensitive to scaling and initial starting values, which is why the data are standardised and nstart is used.

4.3.1 Kmeans

K-means partitions the data into \(K\) clusters by minimising the within-cluster sum of squares:

\[ W_K=\sum_{k=1}^{K}\sum_{i\in C_k}\|\mathbf{x}_i-\boldsymbol{\mu}_k\|^2, \]

where \(\boldsymbol{\mu}_k\) is the centroid of cluster \(k\):

\[ \boldsymbol{\mu}_k=\frac{1}{|C_k|}\sum_{i\in C_k}\mathbf{x}_i. \]

The algorithm alternates between two steps:

  1. Assignment step: assign each observation to its nearest centroid.

\[ C_k=\left\{i: \|\mathbf{x}_i-\boldsymbol{\mu}_k\|^2 \leq \|\mathbf{x}_i-\boldsymbol{\mu}_\ell\|^2, \; \forall \ell\right\}. \]

Therefore, \(C_k\)th cluster is made up of all \(i\)th observations that are assigned to \(k\)th cluster since the distance to \(k\)th cluster is the shortest.

  1. Update step: recompute each centroid using the observations assigned to that cluster.

4.3.1.1 Kmeans with 2 clusters

set.seed(123)
km2 <- kmeans(as.data.frame(data), 2, nstart = nstart)
# k-means group number of each observation
km2$cluster
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
##  1  1  1  1  1  2  2  1  2  1  1  1  1  2  1  1  2  1  2  1  1  2  1  1  1  1 
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 
##  1  1  1  2  1  2  1  2  2  2  2  1  2  1  1  1  1  2  2  1  2  1  1  1  1  1 
## 53 54 55 56 57 58 59 60 61 62 63 
##  1  1  1  1  1  1  1  2  1  2  1
4.3.1.1.1 Visualise on the map

Mapping the cluster labels allows us to assess whether the clusters have a spatial pattern. Although k-means does not explicitly use spatial adjacency here, nearby police areas may still fall into similar clusters if their crime-rate profiles are similar.

final_pointsCoCT$cluster2 = as.factor(km2$cluster)
tm_shape(final_pointsCoCT)+tm_dots(fill = "cluster2")

clustervector = factor(km2$cluster)
crime_kmeans <- clustervars[,2:8] |> 
  mutate(cluster = factor(km2$cluster))

fviz_cluster(
  km2,
  data = data,
  geom = "point"
  #ellipse.type = "norm"
)

summary(pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.2421 0.9536 0.72301 0.52131 0.33804 0.30030 0.25456
## Proportion of Variance 0.7181 0.1299 0.07468 0.03882 0.01632 0.01288 0.00926
## Cumulative Proportion  0.7181 0.8480 0.92271 0.96154 0.97786 0.99074 1.00000
4.3.1.1.2 Cluster Profiling

After clustering, we summarise the variables within each cluster. For cluster \(k\) and variable \(j\), the cluster mean is

\[ \bar{x}_{kj}=\frac{1}{n_k}\sum_{i\in C_k}x_{ij}, \]

where \(n_k=|C_k|\).

These profiles help us move from a numerical cluster label such as “Cluster 1” to a substantive interpretation such as “high violent-crime-rate cluster” or “lower crime-rate cluster”.

crime_kmeans |> 
  group_by(cluster) |> 
  summarise(
    n = n(),
    mean_assault = mean(assault_per100k, na.rm = TRUE),
    mean_attmurder = mean(attempted_murder_per100k, na.rm = TRUE),
    mean_comassault = mean(common_assault_per100k, na.rm = TRUE),
    mean_commrobb = mean(common_robbery_per100k, na.rm = TRUE),
    mean_robb = mean(robbery_per100k, na.rm = TRUE),
    mean_murder = mean(murder_per100k, na.rm = TRUE),
    mean_sexual = mean(sexual_offences_per100k, na.rm = TRUE)
    
  )
## # A tibble: 2 × 9
##   cluster     n mean_assault mean_attmurder mean_comassault mean_commrobb
##   <fct>   <int>        <dbl>          <dbl>           <dbl>         <dbl>
## 1 1          44         68.6           19.2            197.          64.8
## 2 2          19        313.            72.2            677.         232. 
## # ℹ 3 more variables: mean_robb <dbl>, mean_murder <dbl>, mean_sexual <dbl>
# or simply

class.means <- apply(clustervars[,2:8], 2, 
                     function(x) 
                       tapply (x, clustervector, mean))
class.means
##   assault_per100k attempted_murder_per100k common_assault_per100k
## 1        68.61626                 19.16964               196.9548
## 2       312.53703                 72.22646               676.7743
##   common_robbery_per100k murder_per100k robbery_per100k sexual_offences_per100k
## 1               64.82612        15.8700        95.21795                30.45958
## 2              232.25331       108.1782       441.52001               121.54682
plot (c(1,ncol(data)),range(class.means),type="n",xlab="",ylab="Average proportion of protein intake",xaxt="n")
axis (side=1, 1:ncol(data), colnames(data), las=2)
#ensure you list enough colours for the number of clusters
colvec <- c("blue","red")
for (i in 1:nrow(class.means))
  lines (1:ncol(data),class.means[i,],col=colvec[i])

4.3.1.2 Kmeans with 3 Clusters

Increasing the number of clusters allows more detailed profiles. For \(K=3\), the objective becomes

\[ W_3=\sum_{k=1}^{3}\sum_{i\in C_k}\|\mathbf{x}_i-\boldsymbol{\mu}_k\|^2. \]

Compared with \(K=2\), this may split a broad high-crime or low-crime group into more nuanced subgroups.

km3 <- kmeans(data, centers = 3, nstart = nstart)
4.3.1.2.1 Visualise on the map
final_pointsCoCT$cluster3 = as.factor(km3$cluster)
tm_shape(final_pointsCoCT)+tm_dots(fill = "cluster3")

4.3.1.2.2 Visualise on 2D
pca_scores$cluster = as.factor(km3$cluster)
ggplot(pca_scores, aes(x= PC1, y=PC2, colour = cluster))+geom_point()
## Warning: <ggplot> %+% x was deprecated in ggplot2 4.0.0.
## ℹ Please use <ggplot> + x instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

4.3.2 DBSCAN

DBSCAN defines clusters as dense regions of points separated by sparse regions. It depends on two parameters: \(\varepsilon\) and MinPts.

The \(\varepsilon\)-neighbourhood of observation \(i\) is

\[ N_\varepsilon(i)=\{j: d(\mathbf{x}_i,\mathbf{x}_j)\leq \varepsilon\}. \]

A point is a core point if

\[ |N_\varepsilon(i)|\geq \text{MinPts}. \]

A point is a border point if it is not a core point but lies within the \(\varepsilon\)-neighbourhood of a core point. A point is labelled as noise if it is neither a core point nor a border point.

DBSCAN is useful when clusters are irregularly shaped and when outliers are meaningful. In crime analysis, the “noise” points may themselves be interesting unusual police areas.

# add long-lat as well
spatial_data <- final_pointsCoCT |> 
  st_drop_geometry()|>
  select(c(6,2,3,18:24))
#minPts = dim+1 = 9+1
# We inspect the k-NN distance plot for k = minPts - 1
kNNdistplot(scale(spatial_data[,-1]), minPts = 9)

db <- fpc::dbscan(scale(spatial_data[,-1]), eps = 3.5, MinPts = 10)

spatial_data <- spatial_data |> 
  mutate(spatial_cluster = factor(db$cluster))

ggplot(spatial_data, aes(x = location_x, 
                         y = location_y, 
                         colour = spatial_cluster)) +
  geom_point(alpha = 0.7, size = 2) +
  coord_equal() +
  theme_minimal() +
  labs(
    title = "Spatial DBSCAN clusters of crime data",
    colour = "Cluster"
  )

plot(db,scale(spatial_data[,4:10]) , main = "DBSCAN", frame = FALSE)

4.3.3 PAM

PAM is a k-medoids method. Unlike k-means, which represents each cluster by a centroid, PAM represents each cluster by an actual observed data point called a medoid.

The objective is

\[ \min_{m_1,\ldots,m_K}\sum_{k=1}^{K}\sum_{i\in C_k}d(\mathbf{x}_i,\mathbf{m}_k), \]

where \(\mathbf{m}_k\) is the medoid of cluster \(k\).

PAM is usually more robust to outliers than k-means because medoids are actual observations rather than averages.

pam.out <- pam(data, 2)
pam.out$cluster
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
##  1  2  2  1  1  2  2  1  2  1  1  1  1  2  1  1  2  1  2  1  1  2  1  1  1  1 
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 
##  1  1  1  2  1  2  1  2  2  2  2  2  2  1  2  1  1  2  2  1  2  1  1  1  1  1 
## 53 54 55 56 57 58 59 60 61 62 63 
##  1  1  1  1  2  1  1  2  1  2  1

4.3.4 CLARA

CLARA extends PAM by applying k-medoids to several subsamples and then assigning all observations to their nearest medoids. For a given set of medoids, the average cost is

\[ \frac{1}{n}\sum_{i=1}^{n}\min_{k=1,\ldots,K}d(\mathbf{x}_i,\mathbf{m}_k). \]

CLARA is mainly used when PAM becomes computationally expensive.

clarax <- clara(data, 2, samples=10)
# Silhouette plot
plot(silhouette(clarax), col = 2:3, main = "Silhouette plot")

### K-medoids ### Hierarchical

Hierarchical clustering creates a sequence of nested partitions. Agglomerative hierarchical clustering starts with each observation in its own cluster and repeatedly merges the two closest clusters.

At the start:

\[ \mathcal{C}^{(0)}=\{\{1\},\{2\},\ldots,\{n\}\}. \]

At each step, the two closest clusters are merged according to a linkage rule. #### Computing pairwise distance matrices

Most clustering methods require a measure of dissimilarity. For Euclidean distance,

\[ d_{ii'}=\|\mathbf{x}_i-\mathbf{x}_{i'}\|=\sqrt{\sum_{j=1}^{P}(x_{ij}-x_{i'j})^2}. \]

The distance matrix is

\[ \mathbf{D}=\left(d_{ii'}\right)_{n\times n}. \]

dist.out <- dist(data, method = "euclidean")

4.3.4.1 Single linkage

Single linkage defines the distance between two clusters as the minimum pairwise distance:

\[ d(A,B)=\min_{i\in A, i'\in B}d(i,i'). \]

This method can detect elongated clusters, but it is prone to chaining.

out.single.euc <- hclust(daisy(data,metric="euclidean"),method="single") 

# try other metric="euclidean"
plot(out.single.euc)

4.3.4.1.1 Cut the tree
 # decide to cut the tree at height 1.5
out.single.euc <- cutree(out.single.euc, h=1.5)
 # view cluster allocation
names (out.single.euc) <- rownames(data)
sort(out.single.euc)
##  1  2  3  4  5  7  8  9 10 11 12 13 15 16 18 19 20 21 23 24 25 26 27 28 29 30 
##  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 
## 31 33 38 40 41 42 43 46 48 49 50 51 52 53 54 55 56 57 58 59 60 61 63  6 35 14 
##  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  2  2  3 
## 22 17 32 39 34 36 37 44 45 47 62 
##  3  4  5  5  6  7  8  9 10 11 12

4.3.4.2 Complete linkage

Complete linkage defines the distance between two clusters as the maximum pairwise distance:

\[ d(A,B)=\max_{i\in A, i'\in B}d(i,i'). \]

This tends to produce compact clusters.

hc <- hclust(dist.out, method = "complete")
plot(hc, labels = F,-1)
rect.hclust(hc, k = 2, border = 2:3) # Add rectangle around 2 clusters, try with 3?

4.3.4.3 Centroid method

Centroid linkage uses the distance between cluster centroids:

\[ d(A,B)=\|\boldsymbol{\mu}_A-\boldsymbol{\mu}_B\|. \]

where

\[ \boldsymbol{\mu}_A=\frac{1}{|A|}\sum_{i\in A}\mathbf{x}_i. \]

# Centroid clustering
out.centroid.euc <- hclust(daisy(data,metric="euclidean"),method="centroid")
plot(out.centroid.euc)

4.3.4.4 Different distance measures

Different distance measures can lead to different cluster structures. This is important because clustering is not only about the algorithm; it also depends strongly on how similarity is defined.

4.3.4.4.1 Manhattan distance

Manhattan distance is

\[ d_{ii'}=\sum_{j=1}^{p}|x_{ij}-x_{i'j}|. \]

It is less dominated by large squared differences than Euclidean distance.

out.centroid.city <- hclust(daisy(data,metric="manhattan"),method="centroid")
plot(out.centroid.city)

4.3.4.4.2 Correlation as a dissimilarity

A correlation-based dissimilarity can be defined as

\[ d_{ii'}=1-\operatorname{cor}(\mathbf{x}_i,\mathbf{x}_{i'}). \]

This compares the shape or profile of observations rather than their absolute magnitude.

out.centroid.cor <- hclust(as.dist(1-cor(t(data))),method="centroid")
plot(out.centroid.cor)

4.4 Internal validation: assessing cluster quality without external labels

Internal validation uses only the data and the cluster allocation to assess whether the clustering solution is compact and well separated.

4.4.1 Manual calculations from first principles

4.4.1.1 Within-cluster sum of squares

For a fixed number of clusters \(K\), the within-cluster sum of squares is

\[ W_K=\sum_{k=1}^{K}\sum_{i\in C_k}\|\mathbf{x}_i-\boldsymbol{\mu}_k\|^2. \]

For \(K=1\), all observations belong to one cluster, so

\[ W_1=\sum_{i=1}^{n}\|\mathbf{x}_i-\bar{\mathbf{x}}\|^2. \]

As \(K\) increases, \(W_K\) decreases monotonically. The elbow method looks for the point after which increasing \(K\) gives only small improvements.

4.4.1.1.1 Manual
source("C:/Users/01438475/OneDrive - University of Cape Town/UCTcourses/Honours/Analytics/UnsupervisedLearning/Cluster/crimestats/within_variation_calc.R")

cluster = km2$cluster

# different number of clusters:
totwithinss = NA
wss0 <- sum(rowSums((sweep(data, 2, colMeans(data), "-"))^2))
totwithinss[1] = wss0

for (k in 2:10){
  km <- kmeans(data, centers = k, nstart = nstart)
  cluster = km$cluster
  totwithinss[k]= within_variation_manual(data, cluster)$total_within_variation
}


wss_df <- data.frame(
  k = 1:10,
  WSS = totwithinss
)

ggplot(wss_df, aes(x = k, y = WSS)) +
  geom_line() +
  geom_point(size = 3) +
  scale_x_continuous(breaks = 1:10) +
  labs(
    title = "Elbow Plot for K-means Clustering",
    x = "Number of clusters (K)",
    y = "Total within-cluster sum of squares"
  ) +
  theme_minimal()

4.4.1.1.2 Using R function
fviz_nbclust(data, kmeans, method = "wss") +
  labs(title = "Elbow method for k-means")
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the ggpubr package.
##   Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the ggpubr package.
##   Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

4.4.1.2 Silhouette width

For observation \(i\), let \(a(i)\) be the average distance from \(i\) to all other observations in its own cluster, and let \(b(i)\) be the smallest average distance from \(i\) to observations in another cluster.

The silhouette width is

\[ s(i)=\frac{b(i)-a(i)}{\max\{a(i),b(i)\}}. \]

Interpretation:

  • \(s(i)\approx 1\): well assigned;
  • \(s(i)\approx 0\): lies between clusters;
  • \(s(i)<0\): may be assigned to the wrong cluster.

The average silhouette width is

\[ \bar{s}=\frac{1}{n}\sum_{i=1}^{n}s(i). \]

4.4.1.2.1 Manual
source("C:/Users/01438475/OneDrive - University of Cape Town/UCTcourses/Honours/Analytics/UnsupervisedLearning/Cluster/crimestats/silhoutte_score_calc.R")

sil.manual = manual_silhouette(data,km2$cluster)
mean(sil.manual$silhouette)
## [1] 0.504938
sil.df <- as.data.frame(sil.manual[, 1:3])
colnames(sil.df) <- c("observation", "cluster", "silhouette_width")

ggplot(sil.df, aes(
  x = reorder(observation, silhouette_width),
  y = silhouette_width,
  fill = as.factor(cluster)
)) +
  geom_col(show.legend = TRUE) +
  coord_flip() +
  labs(
    title = "Silhouette Plot",
    x = "Observations",
    y = "Silhouette width",
    fill = "Cluster"
  ) +
  theme_minimal()

For silhouette, choose the K with the largest average silhouette width

4.4.1.2.2 Using the cluster results
sil.silhoutte <- silhouette(km2$cluster, dist(data))
mean(sil.silhoutte[, 3])
## [1] 0.504938
k.max <- 10
sil <- rep(0, k.max)
# Compute the average silhouette width for
# k = 2 to k = 10
for(i in 2:k.max){
km.res <- kmeans(data, centers = i, nstart = 50)
ss <- silhouette(km.res$cluster, dist(data))
sil[i] <- mean(ss[, 3])
}
# Plot the average silhouette width
plot(1:k.max, sil, type = "b", pch = 19,
frame = FALSE, xlab = "Number of clusters k")
abline(v = which.max(sil), lty = 2)

4.4.1.2.3 Using R function
############ built in R code ##############
library(cluster)

fviz_nbclust(data, kmeans, method = "silhouette") +
  labs(title = "Average silhouette method")

avg.sil <- numeric(9)

for (k in 2:10) {
  km <- kmeans(data, centers = k, nstart = 50)
  sil <- silhouette(km$cluster, dist(data))
  avg.sil[k] <- mean(sil[, 3])
}

plot(
  1:10, avg.sil[1:10],
  type = "b",
  pch = 19,
  xlab = "Number of clusters (K)",
  ylab = "Average silhouette width",
  main = "Average Silhouette Method"
)

4.4.1.3 Gap statistics

The gap statistic compares the observed within-cluster variation with what would be expected under a reference distribution with no cluster structure.

For \(K\) clusters,

\[ \operatorname{Gap}(K)=E^*\{\log(W_K^*)\}-\log(W_K), \]

where \(W_K\) is the observed within-cluster variation and \(W_K^*\) is the within-cluster variation from simulated reference datasets.

A common decision rule is to choose the smallest \(K\) such that

\[ \operatorname{Gap}(K)\geq \operatorname{Gap}(K+1)-s_{K+1}. \]

Unlike WSS, we do not simply choose the largest gap value. We look for the smallest number of clusters after which improvement becomes small relative to uncertainty.

4.4.1.3.1 Using R function
set.seed(123)

gap <- clusGap(
  data,
  FUNcluster = kmeans,
  K.max = 10,
  B = 100,
  nstart = 25
)

plot(gap, main = "Gap Statistic for K-means")

Here, red bars show the simulation uncertainty (standard error), points are the estimated gap. Unlike WSS and Sil, we don’t choose the max. You choose the smallest K where the gain levels off.

What we see here is a big jump from K=1 to K=2, moderate increase up to K=3,4. After that, increases are gradual from K ≥ 6, improvements are very small relative to uncertainty. So we are looking for the point where, adding another cluster does not give a statistically meaningful improvement. In this one there is a big improvement from K = 2 but, too simple to stop here since K = 3 or 4, we still have meaningful gain. After 5 clusters the gains become marginal. We stop increasing K when the gap stops increasing significantly relative to its uncertainty.

Pick the first point where the curve flattens within the error bars.

4.5 External validation and cluster stability

External validation compares one clustering solution with another. In this lecture, the comparison is made through bootstrap resampling, so the goal is cluster stability rather than agreement with known class labels.

4.5.1 Adjusted Rand Index (ARI)

The Adjusted Rand Index (ARI) compares two partitions of the same observations. Suppose \(U=\{U_1,\ldots,U_R\}\) and \(V=\{V_1,\ldots,V_C\}\) are two clusterings. Let \(n_{rc}\) be the number of observations in both \(U_r\) and \(V_c\).

The Rand-type agreement is based on pairs of observations. The adjusted form is:

\[ \operatorname{ARI}=\frac{ \sum_{r,c}\binom{n_{rc}}{2}- \frac{\sum_r\binom{a_r}{2}\sum_c\binom{b_c}{2}}{\binom{n}{2}} }{ \frac{1}{2}\left[\sum_r\binom{a_r}{2}+\sum_c\binom{b_c}{2}\right] - \frac{\sum_r\binom{a_r}{2}\sum_c\binom{b_c}{2}}{\binom{n}{2}} }, \]

where \(a_r=\sum_c n_{rc}\) and \(b_c=\sum_r n_{rc}\).

Interpretation:

  • \(\operatorname{ARI}=1\): perfect agreement;
  • \(\operatorname{ARI}\approx 0\): agreement close to random;
  • \(\operatorname{ARI}<0\): worse than random agreement.

4.5.1.1 Manual Bootstrapped Data

source("C:/Users/01438475/OneDrive - University of Cape Town/UCTcourses/Honours/Analytics/UnsupervisedLearning/Cluster/crimestats/ari_calc.R")

B <- 100
ari.btstrpvalues <- numeric(B)
data = data

# original clustering solution
km2 <- kmeans(data, centers = 2, nstart = 50)
kmeans.full.clusters <- km2$cluster

for (b in 1:B) {
  
  # Bootstrap with replacement
  idx <- sample(1:nrow(data), size = 63, replace = TRUE)
  x.btstrp <- data[idx, ]
  
  # bootsrap clustering solution
  kmeans.btstrp <- kmeans(x.btstrp, centers = 2, nstart = 50)
  kmeans.btstrp.clusters <- kmeans.btstrp$cluster
  
  # Compare the two on the common observations
  ari.btstrpvalues[b] <- adjustedRandIndex(kmeans.full.clusters[idx], 
                                           kmeans.btstrp.clusters)
  
}

mean(ari.btstrpvalues)
## [1] 0.6867179
sd(ari.btstrpvalues)
## [1] 0.2632329
hist(ari.btstrpvalues, 
     main = "ARI values across subsamples", xlab = "ARI")

4.5.1.2 Using R function

adjustedRandIndex(km2$cluster[idx], kmeans.btstrp.clusters)
## [1] 0.5583426
#[1] 0.2969786
manual_ARI(km2$cluster[idx], kmeans.btstrp.clusters)
## [1] 0.5583426
#[1] 0.2969786

4.5.2 JACCARD

The Jaccard index compares whether pairs of observations are placed together in two clusterings.

Let:

  • \(M_{11}\) be the number of pairs placed together in both clusterings;
  • \(M_{10}\) be the number of pairs placed together in the first clustering but not the second;
  • \(M_{01}\) be the number of pairs placed together in the second clustering but not the first.

Then

\[ J=\frac{M_{11}}{M_{11}+M_{10}+M_{01}}. \]

A larger Jaccard value indicates greater cluster stability or agreement.

4.5.2.1 Manual

source("C:/Users/01438475/OneDrive - University of Cape Town/UCTcourses/Honours/Analytics/UnsupervisedLearning/Cluster/crimestats/jaccard_calc.R")
set.seed(123)
km2 <- kmeans(data, 2, nstart = 50)

manual_jaccard(km2$cluster, kmeans.btstrp.clusters)
## [1] 0.4682081
# Bootstrap stability

set.seed(123)
B <- 20
jaccard_vals <- numeric(B)

n <- nrow(data)

for (b in 1:B) {
  
  idx <- sample(1:n, replace = TRUE)
  
  km1 <- kmeans(data, 2, nstart = 10)
  km2 <- kmeans(data[idx, ], 2, nstart = 10)
  
  common <- intersect(1:n, idx)
  
  jaccard_vals[b] <- manual_jaccard(
    km1$cluster[common],
    km2$cluster[match(common, idx)]
  )
}

mean(jaccard_vals)
## [1] 0.8092595

4.5.2.2 Using R function

library(fpc)
set.seed(123)
out <- clusterboot(
  data,
  B = 100,
  clustermethod = kmeansCBI,
  krange = 2,
  seed = 123
)
## boot 1 
## boot 2 
## boot 3 
## boot 4 
## boot 5 
## boot 6 
## boot 7 
## boot 8 
## boot 9 
## boot 10 
## boot 11 
## boot 12 
## boot 13 
## boot 14 
## boot 15 
## boot 16 
## boot 17 
## boot 18 
## boot 19 
## boot 20 
## boot 21 
## boot 22 
## boot 23 
## boot 24 
## boot 25 
## boot 26 
## boot 27 
## boot 28 
## boot 29 
## boot 30 
## boot 31 
## boot 32 
## boot 33 
## boot 34 
## boot 35 
## boot 36 
## boot 37 
## boot 38 
## boot 39 
## boot 40 
## boot 41 
## boot 42 
## boot 43 
## boot 44 
## boot 45 
## boot 46 
## boot 47 
## boot 48 
## boot 49 
## boot 50 
## boot 51 
## boot 52 
## boot 53 
## boot 54 
## boot 55 
## boot 56 
## boot 57 
## boot 58 
## boot 59 
## boot 60 
## boot 61 
## boot 62 
## boot 63 
## boot 64 
## boot 65 
## boot 66 
## boot 67 
## boot 68 
## boot 69 
## boot 70 
## boot 71 
## boot 72 
## boot 73 
## boot 74 
## boot 75 
## boot 76 
## boot 77 
## boot 78 
## boot 79 
## boot 80 
## boot 81 
## boot 82 
## boot 83 
## boot 84 
## boot 85 
## boot 86 
## boot 87 
## boot 88 
## boot 89 
## boot 90 
## boot 91 
## boot 92 
## boot 93 
## boot 94 
## boot 95 
## boot 96 
## boot 97 
## boot 98 
## boot 99 
## boot 100
out$bootmean
## [1] 0.9355853 0.8327497

4.6 Using a consensus measure to choose the number of clusters

NbClust combines many different internal validation indices. Each index recommends a number of clusters, and the final choice is based on a voting rule.

If index \(m\) recommends \(K_m\), then the consensus choice is

\[ K^*=\arg\max_K \sum_{m=1}^{M} I(K_m=K), \]

where \(I(\cdot)\) is an indicator function.

Consensus methods are useful because no single index is universally best. However, the final choice should still be interpreted in light of the data context and substantive meaning of the clusters. Using the NbClust package which uses a vote to choose the number of clusters. The following example determine the number of clusters using all statistics:

res.nb <- NbClust(data, distance = "euclidean",min.nc = 2, max.nc
= 10, method = "complete", index ="all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 6 proposed 2 as the best number of clusters 
## * 7 proposed 3 as the best number of clusters 
## * 5 proposed 4 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 1 proposed 6 as the best number of clusters 
## * 3 proposed 9 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

When all statistics in the NbClust package are allowed to vote, the majority (in this case 7 out of 24) propose that the ‘optimal’ number of clusters should be 3.

4.7 Final lecture summary

The main conceptual comparison across the methods is:

Method Main idea Strength Limitation
K-means Minimise within-cluster variance Simple and interpretable Sensitive to scaling and outliers
DBSCAN Find dense regions Finds noise and irregular shapes Sensitive to eps and MinPts
PAM / k-medoids Use real observations as centres More robust than k-means More computationally expensive
CLARA Sample-based k-medoids Scales PAM to larger datasets Depends on samples
Hierarchical Build nested cluster structure Dendrogram is interpretable No single automatic best cut

A good cluster analysis should combine:

  1. careful variable construction;
  2. standardisation;
  3. more than one clustering algorithm;
  4. internal validation;
  5. stability assessment;
  6. substantive interpretation of cluster profiles.

4.8 Interpreting Crime Clusters in the City of Cape Town

4.8.1 What do the clusters represent?

Clusters group police stations with similar crime profiles:

\[ \text{Cluster} = \text{areas with similar multivariate crime rates} \]

Each cluster is defined by its centroid (average crime profile):

\[ \bar{x}_{k,j} = \frac{1}{|C_k|} \sum_{i \in C_k} x_{ij} \]

4.9 Cluster 1: High-Intensity Violent Crime Areas

Characteristics: - High: - Murder - Attempted murder - Assault (GBH) - Moderate: - Robbery - Lower: - Property-related crimes

Interpretation: - Likely high social vulnerability areas - Possible links to: - Dense population - Socioeconomic stress - Informal settlements

Policy implication: - Targeted violence prevention - Increased visible policing - Social interventions (youth, employment)

4.10 Cluster 2: Mixed / Transitional Crime Areas

Characteristics: - Moderate across most crime types - No dominant crime category

Interpretation: - “Middle ground” areas - Possibly: - Transitional neighbourhoods - Mixed land use

Policy implication: - Balanced policing strategy - Early intervention to prevent escalation

4.11 Cluster 3 (if K = 3): Property & Opportunistic Crime Areas

Characteristics: - Higher: - Robbery - Common assault - Lower: - Murder / serious violent crime

Interpretation: - Likely: - Commercial zones - Transport hubs - Tourist areas

Policy implication: - Focus on: - Surveillance - Urban design (lighting, visibility) - Theft prevention strategies

4.12 Spatial Insight

  • Clusters are not random
  • Spatial grouping suggests underlying structure

\[ \text{Crime} = f(\text{space}, \text{socioeconomics}, \text{mobility}) \]

4.13 Important Teaching Point

Clusters are descriptive, not causal

  • They summarise patterns
  • They do not explain why patterns occur

4.14 Key Takeaway

  • Clustering helps us:
    • Reduce complexity
    • Identify types of crime environments
    • Inform targeted policy

4.15 In-class Discussion

class.means
##   assault_per100k attempted_murder_per100k common_assault_per100k
## 1        68.61626                 19.16964               196.9548
## 2       312.53703                 72.22646               676.7743
##   common_robbery_per100k murder_per100k robbery_per100k sexual_offences_per100k
## 1               64.82612        15.8700        95.21795                30.45958
## 2              232.25331       108.1782       441.52001               121.54682

Question for students:

Which cluster would you prioritise for policing resources — and why?