K-means cluster analysis

We run a k-means clustering algorithm on our database of firms to identity groups of similar firms across different dimensions. Using this analysis we can spot patterns that identify similar firms that may not be possible to spot through exploratory data analysis.

K-means clustering is the most commonly used clustering technique. Given a number of clusters \( k \) in which to split the data it will identify a set of centroids and the data points closest to each centroid that minimizes variance within clusters. The data asigned to each centroid constitutes a cluster. We will repeat this analysis for many values of \( k \) and visually determine the optimal value based on the decrease in variance that comes with each additional cluster. Naturally, the variance will always decrease with larger values of \( k \), but we expect to see natural value after which the decrease is negligible.

Another common method for identifying groups of similar observations is hierarchical clustering, which is in some aspects superior to \( k \)-means clustering. However, this technique requires computing a full \( n\times n \) matrix of pair-wise distances. With our 200 thousand observations this would be unfeasible.

After assigning each firm to a cluster we will then try to understand what each cluster represents by comparing summary statistics. This will allow us to characterize the clusters as typical firm types.

require(data.table, quietly = TRUE)
setwd("~/Dubrovnik/")
load("APR-processed-v4.rda")  ## loads d
names(d)  ## print column names as a reference
##  [1] "year"             "EURRSD"           "deflator"        
##  [4] "id"               "capital"          "costs"           
##  [7] "empl"             "fixedAssets"      "foreignOwners"   
## [10] "interest"         "liabLT"           "liabST"          
## [13] "liabSTfin"        "loansLT"          "revenues"        
## [16] "totalAssets"      "wages"            "state.connection"
## [19] "state.owner.type" "state.share"      "state.legal.form"
## [22] "priv.method"      "reg.code"         "region"          
## [25] "district.code"    "district"         "town.code"       
## [28] "town"             "muni.code"        "muni"            
## [31] "estab.date"       "removal.date"     "sector.mod"      
## [34] "ind.code"         "trade.sector"     "trade.name"      
## [37] "legal.form"       "status"           "sitc"            
## [40] "compet.cms"       "market.growing"   "entrepreneur"    
## [43] "restructuring"    "export.rank"      "export"          
## [46] "is.sme"
setkey(d, "id")

Before we start the clustering analysis, we remove non-SME companies. Companies that have never reported have already been removed.

table(d$is.sme)  ## how many observations (firm-year) are there?
## 
##  FALSE   TRUE 
##   5640 854777
d <- d[is.sme == TRUE]  ## only keep SMEs

As a first step we need to choose which variables (features) of firms we will use as dimensions along which to perform the clustering. The variables are:

At the same time, we need to decide whether the clustering of firms will be done at the firm-year level or whether the cluster is firm-specific and it does not change with time. For the purpose of this analysis we will consider each firm in a given year as a distinct observation. This will allow us to see whether this classification is permanent throughout the life of a firm in the observation period and, if not how firms change from one year to the other.

# we add 1 to denominators to avoid Inf and -Inf
d.clust <- d[, list(id, year, lrev = log(revenues + 1), empl, lassets = log(totalAssets + 
    1), pmargin = (revenues - costs)/(revenues + 1), lfape = log(fixedAssets + 
    1) - log(empl + 1))]
rm(d)  ## removing large dataset to free memory

Before clustering we need to normalize all dimensions by subtracting their mean and scaling their standard deviation to 1.

# get names of all columns except ID
sc.cols <- setdiff(names(d.clust), c("id", "year"))
d.clust <- d.clust[, `:=`((sc.cols), lapply(.SD, scale)), .SDcols = sc.cols]

We can now proceed with the clustering analysis where we try a varying number of clusters, starting with 3 until 15 and visually inspect the reduction in within-cluster sum of squares. The algorithm does not guarantee a global minimum because it does not consider all combinations, so we set it to perform each clustering 15 times and report only the best.

sc.cols <- setdiff(names(d.clust), c("id", "year"))
set.seed(62014)  ## for replicability
k.set <- 3:10
# create vector of within-cluster sum of squares
within.ss <- numeric(length(k.set))
names(within.ss) <- k.set
# create data.table with clusters
clusters <- data.table(id = d.clust$id, year = d.clust$year, matrix(nrow = nrow(d.clust), 
    ncol = length(k.set)))
setnames(clusters, c("id", "year", paste("k", k.set, sep = ".")))

# loop over k.set values
for (k in k.set) {
    kmc <- kmeans(d.clust[, sc.cols, with = FALSE], centers = k, nstart = 25, 
        iter.max = 15)
    c.name = paste("k", k, sep = ".")  ## make name for indexing clusters
    clusters[, `:=`(c.name, kmc$cluster), with = FALSE]
    within.ss[as.character(k)] <- kmc$tot.withinss
    print(k)
}
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 42738850)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 42738850)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 42738850)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 42738850)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 42738850)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 42738850)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 42738850)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 42738850)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 42738850)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 42738850)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 42738850)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 42738850)

require(ggplot2, quietly = TRUE)
ggplot(data.frame(k.set, within.ss), aes(x = k.set, y = within.ss/1e+06)) + 
    geom_line() + xlab("Number of clusters") + ylab("Within-cluster sum of squares (M)") + 
    ggtitle("Within-cluster SS from K-means clustering\n          without location data") + 
    scale_x_continuous(breaks = 1:max(k.set)) + theme_bw() + theme(panel.grid.minor.x = element_blank())

plot of chunk unnamed-chunk-5


# save cluster assignments
save(clusters, within.ss, file = "APR-km-clusters-v4.rda")