Information

This R Markdown document was created as part of a group assignment for SMM636 at Bayes Business School, City St George's, University of London in Term 2 2025-26.

Introduction

This report analyses annual spending data from 440 clients of a Portuguese wholesale distributor [1], [2], comprising 298 Hotel/Restaurant/Café (HoReCa) and 142 Retail customers across six product categories: Fresh, Milk, Grocery, Frozen, Detergents_Paper, and Delicassen. Data-driven customer segmentation informs inventory planning, pricing strategy, and sales team deployment, yet the existing classification is a simple binary label (Channel). This analysis uses Principal Component Analysis (PCA) and clustering to investigate whether the existing two-category classification adequately captures the structure of customer spending, or whether the data supports a more nuanced segmentation.

DT::datatable(
  data,
  filter   = "top",
  options  = list(pageLength = 15, scrollX = TRUE),
  caption  = "Wholesale customers"
)

1 Principle component analysis (PCA)

A log-transformation was applied to normalise the distribution and address the right-skewness observed across all six spending variables (skewness > 2.5 for all - see Appendix, Figures A.2 & A.3). The variables were then standardised to ensure equal contribution to the PCA, preventing categories with naturally larger absolute expenditures from dominating the model.

The first two principal components capture 71.3% of the total variance (PC1: 44.1%; PC2: 27.2%; Figure 1.1(a)). The component loadings in Figure 1.1(b) reveal distinct commercial purchasing patterns. PC1 acts as the primary “channel axis,” driven by strong positive loadings for Grocery (0.571), Detergents_Paper (0.551), and Milk (0.542), contrasted with near-zero or negative loadings for Fresh and Frozen. PC2 serves as a “perishable volume axis,” characterised by high positive loadings for Fresh (0.590), Frozen (0.590), and Delicassen (0.530), capturing the scale of fresh food purchasing independent of business type.

Figure 1.1(d) confirms this visually, showing clear separation between HoReCa and Retail customers along PC1. PCA is restricted to continuous inputs, necessitating the exclusion of Channel and Region. A Gower-distance MDS analysis incorporating these variables (Appendix, A.2) confirms that the categorical labels dominate the distance metric, validating the decision to analyse spending variables independently.

44.1% 27.2% 10.7% 10.1% 4.9% 3% 0 20 40 60 1 2 3 4 5 6 Dimensions Percentage of explained variances (a) Scree Plot - Variance Explained by Principle Component (PC) PC1 -0.105 Fresh PC2 0.542 Milk PC3 0.571 Grocery PC4 -0.138 Frozen 0.551 Detergents_Paper 0.214 Delicassen 0.590 0.133 -0.006 0.590 -0.071 0.530 -0.640 -0.074 -0.133 -0.021 -0.200 0.726 0.479 -0.062 -0.097 -0.792 -0.083 0.352 (b) PCA Loadings Fresh Fresh Fresh Fresh Fresh Fresh Fresh Fresh Fresh Fresh Fresh Fresh Fresh Fresh Fresh Fresh Fresh Milk Milk Milk Milk Milk Milk Milk Milk Milk Milk Milk Milk Milk Milk Milk Milk Milk Grocery Grocery Grocery Grocery Grocery Grocery Grocery Grocery Grocery Grocery Grocery Grocery Grocery Grocery Grocery Grocery Grocery Frozen Frozen Frozen Frozen Frozen Frozen Frozen Frozen Frozen Frozen Frozen Frozen Frozen Frozen Frozen Frozen Frozen Detergents_Paper Detergents_Paper Detergents_Paper Detergents_Paper Detergents_Paper Detergents_Paper Detergents_Paper Detergents_Paper Detergents_Paper Detergents_Paper Detergents_Paper Detergents_Paper Detergents_Paper Detergents_Paper Detergents_Paper Detergents_Paper Detergents_Paper Delicassen Delicassen Delicassen Delicassen Delicassen Delicassen Delicassen Delicassen Delicassen Delicassen Delicassen Delicassen Delicassen Delicassen Delicassen Delicassen Delicassen -6 -3 0 3 -6 -3 0 3 6 Dim1 (44.1%) Dim2 (27.2%) (c) PCA Biplot (log-transformed, standardised data) -6 -3 0 3 -5.0 -2.5 0.0 2.5 Dim1 (44.1%) Dim2 (27.2%) Channel 1 2 (d) PCA - Individuals Coloured by Channel (1=HoReCa, 2=Retail)

Figure 1.1: Principal Component Analysis of log-transformed, standardised spending data. (a) Scree plot showing the percentage of variance explained by each component. (b) Loadings of the original variables on the first four principal components. Values exceeding 0.4 in magnitude indicate strong contributions. (c) Biplot overlaying variable loadings (red arrows) on individual observations. (d) Individuals projected onto PC1-PC2, coloured by distribution channel (1 = HoReCa, 2 = Retail) with 95% concentration ellipses.

2 Customer clustering

76% 24% 40% 60% 34% 66% 83% 17% 19% 81% 63% 37% 55% 45% 0 25 50 75 100 Share of total category spend (%) Segment Retail (n=142, 32%) HoReCa (n=298, 68%) (a) Baseline: Channel 48% 47% 4% 17% 61% 23% 15% 56% 29% 51% 45% 4% 7% 58% 35% 28% 62% 9% 31% 53% 16% 0 25 50 75 100 Share of total category spend (%) Segment Retail Dominant (n=147, 33%) Mixed Basket (n=81, 18%) HoReCa Dominant (n=212, 48%) (b) K-Means (k = 3) 3% 22% 52% 13% 10% 18% 42% 16% 12% 13% 24% 43% 14% 9% 9% 2% 15% 59% 10% 14% 32% 53% 6% 5% 5% 9% 31% 32% 12% 15% 13% 33% 32% 11% 10% 0 25 50 75 100 Fresh Milk Grocery Frozen Detergents_Paper Delicassen Total Share of total category spend (%) Segment Heavy Retail (n=51, 12%) Diversified Retail (n=101, 23%) Low-spend / Mixed (n=52, 12%) Balanced HoReCa (n=81, 18%) High-volume HoReCa (n=155, 35%) (c) Gaussian Mixture Modelling (k = 5)

Figure 2.1: Revenue concentration by product category under three segmentation approaches. Top: the distributor’s existing binary Channel classification. Middle: K-Means (k=3) spending-based clusters. Bottom: Gaussian Mixture Modelling (GMM) (k=5) spending-based clusters. Crucially, the K-Means and GMM segmentations were derived solely from log-transformed, standardised spending data without access to Channel labels. Each bar shows the percentage of total category spend attributable to each segment and the Total column shows each segment’s share of overall revenue.

K-Means clustering was applied to the same log-transformed, standardised spending data used for PCA. Silhouette analysis identifies k=2 as optimal (0.290), which recovers the known Channel distinction and validates the clustering approach. However, the silhouette score for k=3 remains strong (0.259), and the marginal reduction in cluster quality yields a substantially richer segmentation. Beyond k=3, scores plateau around 0.19 with no meaningful recovery (Appendix, Figure A.5), suggesting additional clusters fragment genuine structure rather than reveal new segments.

K-Means was chosen for its interpretability and computational efficiency on this dataset. As robustness checks, both Ward’s hierarchical clustering and Gaussian Mixture Modelling (GMM) were applied to the same input. Ward’s method (k=3) agrees with K-Means on the dominant cluster structure, with disagreement confined to boundary observations between the smaller segments (Appendix A.4). GMM relaxes K-Means’ spherical cluster assumption: the BIC selected a 5-component VVE model (Appendix, Figure A.6), splitting both channels into finer sub-segments. Cross-tabulation against Channel and Region (Appendix, Tables A.1 & A.3) confirms that these sub-segments are spending-driven rather than geographic, with Region distributed roughly proportionally within each cluster.

Figure 2.1 presents the commercial impact of each segmentation. Under the existing binary Channel label, HoReCa customers account for 76% of Fresh revenue but only 68% of customers. K-Means (k=3) identifies a “Mixed Basket” segment (81 customers, 18%) with spending patterns that cut across both channels. GMM further distinguishes Heavy Retail from Diversified Retail, and separates High-volume HoReCa from Balanced HoReCa. Figure 2.2 traces how spending flows between these segmentations, revealing that GMM-3 captures high Fresh and Frozen spenders from multiple K-Means clusters.

Delicassen Detergents_Paper Fresh Frozen Grocery Milk GMM-2 Diversified Retail GMM-3 High-volume HoReCa GMM-4 Balanced HoReCa GMM-5 Low-spend / Mixed GMM-1 Heavy Retail KM-1 HoReCa Dominant KM-2 Retail Dominant KM-3 Mixed Basket 4.6% 7.3% 8.0% 10.4% 3.3% 5.4% 18.7% 4.8% 3.5% 3.1% 5.7% 4.1% 7.3% 19.0% 4.3% 13.1% 31.7% 12.0% 0% 25% 50% 75% 100% Category GMM Cluster KMeans Cluster Percentage of total spending

Figure 2.2: Sankey diagram of spending across clusters. Each stacked bar shows the percentages of total spendings broken down into product categories (left), Gaussian Mixed Mixture clusters (middle), and KMeans (k=3) clusters (right). The flows on the left describes how each GMM cluster is comprised of spendings in different product categories. The flows on the right represent how the spendings are segmented differently by KMeans. Only flows with percentage value greater than 3% are labelled to minimise clutter.

The strength of this approach is that it operates entirely without access to Channel or Region labels. The near-perfect recovery of the Channel distinction validates the segmentation as behaviourally grounded, while the emergence of additional clusters reveals sub-structure the existing classification does not capture. However, the commercial value of a 5-segment model over a 3-segment model cannot be determined from a single cross-sectional snapshot. Validation would require longitudinal purchasing data, margin information, and ideally a controlled trial of differentiated service strategies across segments.

3 Repeating 1 & 2 using LLMs

Two Large Language Models (LLMs), Claude (Sonnet 4.6) and ChatGPT (GPT-5.3), were evaluated. To ensure comparability, identical prompts were used for both models and neither was given the raw csv file. The dataset was described in the first prompt and the LLMs were asked to produce R code for independent execution. This design choice was deliberate: withholding the data replicated a commercial setting in which an organisation may not wish to upload proprietary customer data to a third-party platform. It also ensured that neither model could adapt its recommendations to the observed distributions, testing whether each model’s default methodological choices were appropriate for this dataset. Four prompts were used, included in full in Appendix B.

As shown in Table 3.1, both LLMs correctly excluded Channel and Region, recommended log-transformation and standardisation, and converged on k=3, all without access to the data. Neither model interpreted the PCA loadings with specificity. Claude labelled PC1 as a “general spending scale” rather than the channel axis revealed by the actual loadings, while ChatGPT produced plots without interpreting them. Robustness checks and limitations were not mentioned until explicitly prompted. When asked in the final prompt, both models acknowledged that the data is from a well-known public dataset whose typical outcomes appear extensively in their training data, raising the question of whether the LLMs were performing genuine analytical reasoning or pattern-matching against memorised results.

ChatGPT produced more concise, modular code that executed without errors, while Claude generated comprehensive scripts with richer methodological commentary but several execution errors requiring manual debugging. LLMs demonstrate clear value for code generation and methodological structuring, but their inability to verify outputs against actual data distributions (and their tendency to present memorised patterns as derived findings) limits their current utility and analytical value.

Table 3.1: Table 3.2: Comparison of analytical steps for our approach vs  two LLMs. \(\checkmark\)~= addressed unprompted; \(\circ\)~= partial; ~= missed entirely.
Analysis Step Our Analysis ChatGPT (GPT-5.3) Claude (Sonnet 4.6)
Preprocessing
Excluded Channel & Region from PCA/clustering \(\checkmark\) \(\checkmark\) \(\checkmark\)
Identified skewness / recommended log transform \(\checkmark\) \(\checkmark\) \(\checkmark\)
Standardised variables before PCA \(\checkmark\) \(\checkmark\) \(\checkmark\)
PCA
Produced interpretable scree plot / variance \(\checkmark\) \(\checkmark\) \(\checkmark\)
Interpreted PC loadings with specificity \(\checkmark\) \(\circ\)
Validated PCA against Channel label \(\checkmark\) \(\checkmark\) \(\checkmark\)
Clustering
Justified choice of clustering method \(\checkmark\) \(\circ\) \(\checkmark\)
Used formal \(k\) selection (silhouette / elbow) \(\checkmark\) \(\checkmark\) \(\circ\)
Identified structure beyond \(k{=}2\) \(\checkmark\) \(\checkmark\) \(\checkmark\)
Applied robustness checks unprompted \(\checkmark\)
Interpretation
Commercial framing of segments \(\checkmark\) \(\circ\) \(\checkmark\)
Acknowledged limitations unprompted \(\checkmark\)
Generated code executed without errors \(\checkmark\) \(\checkmark\)
13/13 8/13 (2 partial) 8/13 (2 partial)

Appendices

A Supplementary methodology

A.1 Raw data distribution, correlations, and skewness

box_data <- reshape(
  data[, spending.cols],
  direction = "long",
  varying = spending.cols,
  v.names = "Spend",
  timevar = "Category",
  times = spending.cols
)

cat_totals <- data.frame(
  Category = spending.cols,
  Total    = colSums(data[, spending.cols]),
  Median   = apply(data[, spending.cols], 2, median)
)

# option to sort by median vs total; 
# BE note: have chosen median here to highlight the difference in median spend
# despite large total gap between bottom two categories
cat_order <- levels(reorder(cat_totals$Category, cat_totals$Median))
# cat_order <- levels(reorder(cat_totals$Category, cat_totals$Total))
box_data$Category <- factor(box_data$Category, levels = cat_order)
cat_totals$Category <- factor(cat_totals$Category, levels = cat_order)
row.names(box_data) <- NULL


p_box <- ggplot(
  box_data,
  aes(x = Category, y = Spend)
) +
  geom_boxplot(fill = "steelblue", alpha = 0.7) +
  coord_flip() +
  scale_y_continuous(
    labels = scales::comma,
    limits = c(0, NA),
    expand = expansion(mult = c(0, 0.05))
  ) +
  labs(
    title = "(a) Spending Distribution per Customer",
    y = "Annual Spend (m.u.)",
    x = ""
  ) +
  theme_minimal() +
  theme(
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank()
  )

p_bar <- ggplot(cat_totals, aes(x = Category, y = Total)) +
  geom_col(fill = "steelblue", alpha = 0.7, width = 0.7) +
  geom_text(
    aes(label = format(round(Total), big.mark = ",")),
    hjust = -0.1,
    size = 3.5
  ) +
  coord_flip() +
  scale_y_continuous(
    labels = scales::comma,
    expand = expansion(mult = c(0, 0.15))
  ) +
  labs(
    title = "(b) Total Category Spend",
    y = "Total Spend (m.u.)",
    x = ""
  ) +
  theme_minimal() +
  theme(
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank()
  )

p_box / p_bar
Detergents_Paper Delicassen Frozen Milk Grocery Fresh 0 30,000 60,000 90,000 Annual Spend (m.u.) (a) Spending Distribution per Customer 5,280,131 2,550,357 3,498,562 1,351,650 1,267,857 670,943 Detergents_Paper Delicassen Frozen Milk Grocery Fresh 0 2,000,000 4,000,000 6,000,000 Total Spend (m.u.) (b) Total Category Spend

Figure A.1: Customer spending profiles across six wholesale categories. Top: boxplots showing the highly skewed distribution of individual annual spend. Bottom: the total aggregate volume, highlighting Fresh and Grocery as the largest product segments. Both plots are sorted by the median spend.

Figure A.1 provides a baseline overview of purchasing behaviour across the six categories. Median spend is relatively modest in all categories, but every distribution exhibits a long right tail of high-value customers. At the aggregate level, Fresh and Grocery dominate total revenue, while Delicassen remains a niche segment.

Frozen Grocery Milk Delicassen Detergents_Paper Fresh 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0 2 4 6 0 3 6 9 0 5 10 15 0 5 10 0 5 10 15 0 5 10 Normalised Spend (0-1 scaled within each group) Density Transform Raw Log-transformed Shape Comparison: Raw vs Log-transformed Spending

Figure A.2: Shape comparison of raw and log-transformed spending distributions by product category. To facilitate direct comparison of the distribution shapes, all spend values were min-max scaled to a 0–1 range within their respective categories. The raw data (blue) demonstrates a right skew across all categories, influenced by extreme outliers. The log1p transformation (red) effectively compresses this long tail, yielding more symmetric distributions that ensure variance-based methods (PCA) and distance-based methods (clustering) are not dominated by extreme outliers.

This skewness is examined more closely in Figure A.2, which overlays raw and log-transformed distributions for each category. The raw spending figures exhibit a strong rightward skew across all six product categories, indicating that while most customers spend relatively small amounts, a long tail of high-spending customers dominates the scale. Applying a log transformation (log1p) successfully ‘pulls in’ these extreme values, centring the bulk of the data and producing much more symmetric distributions.

par(mfrow = c(1, 2))

soft_palette <- colorRampPalette(c("#D73027", "#F46D43", "#FFFFFF", "#74ADD1", "#4575B4"))(200)

corrplot(
  cor(spending),
  method      = "color",
  type        = "upper",
  col         = soft_palette,     
  addCoef.col = "black",          
  number.cex  = 0.8,
  tl.col      = "black",
  title       = "Raw spending",
  mar         = c(0, 0, 2, 0)
)

corrplot(
  cor(spending_log),
  method      = "color",
  type        = "upper",
  col         = soft_palette, 
  addCoef.col = "black",        
  number.cex  = 0.8,
  tl.col      = "black",
  title       = "Log-transformed spending",
  mar         = c(0, 0, 2, 0)
)
1.00 0.10 1.00 -0.01 0.73 1.00 0.35 0.12 -0.04 1.00 -0.10 0.66 0.92 -0.13 1.00 0.24 0.41 0.21 0.39 0.07 1.00 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Fresh Milk Grocery Frozen Detergents_Paper Delicassen Fresh Milk Grocery Frozen Detergents_Paper Delicassen Raw spending 1.00 -0.02 1.00 -0.13 0.76 1.00 0.39 -0.06 -0.16 1.00 -0.16 0.68 0.80 -0.21 1.00 0.26 0.34 0.24 0.26 0.17 1.00 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Fresh Milk Grocery Frozen Detergents_Paper Delicassen Fresh Milk Grocery Frozen Detergents_Paper Delicassen Log-transformed spending

Figure A.3: Pairwise correlation matrices of the six product categories, comparing linear relationships using raw spending data (left) versus log-transformed spending data (right).

The side-by-side correlation plots reveal the underlying purchasing relationships between product categories before and after adjusting for scale. Both matrices highlight a distinct cluster of highly correlated spending: Grocery, Detergents_Paper, and Milk show strong positive linear relationships, suggesting they are frequently purchased together in bulk. While the log transformation preserves this core structure, it slightly softens the most extreme correlations (for instance, the correlation between Grocery and Detergents_Paper drops from 0.92 to 0.80). This indicates that in the raw data, a few extreme outlier purchases were artificially inflating the strength of that linear relationship.

A.2 Multi-dimensional scaling

-0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 -0.2 -0.1 0.0 0.1 0.2 (a) Coloured by Channel MDS1 MDS2 HoReCa Retail -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 -0.2 -0.1 0.0 0.1 0.2 (b) Coloured by Region MDS1 MDS2 Lisbon Oporto Other

Figure A.4: Non-metric MDS using Gower’s distance across all eight variables (six log-transformed continuous spending categories plus Channel and Region). Points are coloured by Channel (left) and Region (right). Unlike PCA, Gower’s distance forces distinct spatial separation based on categorical mismatches, resulting in six artificial clusters corresponding to the cross-tabulation of the two Channels and three Regions.

Figure A.4 presents a non-metric MDS solution computed from Gower’s distance across all eight variables, incorporating the categorical Channel and Region columns excluded from the PCA. Because Gower’s metric treats any categorical mismatch as a maximal distance contribution on that variable, the resulting spatial map is dominated by these labels rather than underlying spending behaviour.

Panel (a) shows separation by Channel along the primary axis (MDS1), while Panel (b) reveals further stratification by Region along MDS2. The result is a forced stratification into six spatial clusters reflecting the Channel \(\times\) Region cross-tabulation. While there is minor regional intermingling (e.g., two Oporto customers mapping near the Lisbon cluster), the overall spatial map merely reflects the cross-tabulation of the categorical labels rather than organic, dataset-wide purchasing patterns. The solution’s stress value of 14.5 further indicates a mediocre fit, falling above the conventional threshold of 10 for reliable low-dimensional representation. This confirms that the PCA approach, relying purely on continuous spending variables, is better suited to uncovering natural commercial structure in this dataset.

A.3 Clustering diagnostics

A.3.1 Optimal cluster selection

0.0 0.1 0.2 0.3 1 2 3 4 5 6 7 8 9 10 Number of clusters k Average silhouette width Optimal number of clusters (method = "silhouette")

Figure A.5: Silhouette analysis for K-Means clustering (k=2 to 10). The optimal number of clusters is k=2 followed by k=3 as a strong second. Beyond k=3, scores plateau around 0.19.

-7600 -7200 -6800 -6400 Number of components BIC 1 2 3 4 5 6 7 8 9 EII VII EEI VEI EVI VVI EEE VEE EVE VVE EEV VEV EVV VVV

Figure A.6: BIC values across component counts for all model types evaluated by mclust. Higher (less negative) BIC indicates better fit. The BIC peaks at 5 components under the VVE (ellipsoidal, variable volume and shape, equal orientation) model.

Silhouette analysis (Figure A.5) identifies k=2 as optimal (0.290), with k=3 as a close second (0.259). Beyond k=3, scores plateau around 0.19.

The GMM’s BIC (Figure A.6) peaks at 5 components (-6,126.0), compared to -6,212.0 for four and -6,186.6 for six. Interestingly k=4 is worse than k=3, suggesting the data supports either 3 or 5 components but not an intermediate value. Both methods agree that k=3 is a meaningful threshold and the disagreement on whether to proceed to k=5 reflects a genuine trade-off between parsimony and granularity.

A.3.2 Channel and Region cross-tabulations

Table A.1: Table A.2: K-Means (k=3) cluster assignments vs  Channel (1=HoReCa, 2=Retail) and Region.
Channel Lisbon Oporto Other
KM-1 (HoReCa Dominant)
1 39 25 145
2 1 1 1
KM-2 (Retail Dominant)
1 13 2 41
2 10 9 72
KM-3 (Mixed Basket)
1 7 1 25
2 7 9 32
Table A.3: Table A.4: GMM (k=5) cluster assignments vs  Channel (1=HoReCa, 2=Retail) and Region.
Channel Lisbon Oporto Other
GMM-1 (Heavy Retail)
1 1 0 6
2 7 8 29
GMM-2 (Diversified Retail)
1 5 0 13
2 9 10 64
GMM-3 (High-volume HoReCa)
1 34 15 100
2 1 1 4
GMM-4 (Balanced HoReCa)
1 12 10 55
2 1 0 3
GMM-5 (Low-spend / Mixed)
1 7 3 37
2 0 0 5

Table A.1 shows that K-Means (k=3) produces one cluster that is almost exclusively HoReCa (KM-1, “HoReCa Dominant”: 209 of 212 customers), one that leans Retail (KM-2, “Retail Dominant”: 91 of 147), and a third (KM-3, “Mixed Basket”) that draws substantially from both channels (33 HoReCa, 48 Retail). The “Retail Dominant” label reflects the spending profile rather than exclusive Channel membership: these customers concentrate on Grocery and Detergents_Paper regardless of their administrative Channel label.

A.3.3 Cluster spending profiles

Table A.5: Table A.6: Cluster spending profiles (median annual spending, m.u.)
Cluster Label \(n\) Fresh Milk Grocery Frozen Detergents_Paper Delicassen
K-Means (k = 3)
KM-1 HoReCa Dominant 212 9,635 1,598 2,151 2,151 274 686
KM-2 Retail Dominant 147 12,126 7,184 9,965 2,005 3,378 2,005
KM-3 Mixed Basket 81 1,454 6,264 10,487 388 4,196 436
Gaussian Mixture Model (k = 5)
GMM-1 Heavy Retail 51 1,454 7,845 13,567 283 6,830 716
GMM-2 Diversified Retail 101 7,842 7,243 11,238 1,389 4,621 1,625
GMM-3 High-volume HoReCa 155 14,276 1,786 2,405 3,383 325 933
GMM-4 Balanced HoReCa 81 5,969 3,218 3,133 907 637 656
GMM-5 Low-spend / Mixed 52 2,270 1,572 3,712 892 174 284
Table A.7: Table A.8: Cluster spending profiles (% of median basket)
Cluster Label Fresh Milk Grocery Frozen Detergents_Paper Delicassen
K-Means (k = 3)
KM-1 HoReCa Dominant 58.4 9.7 13.0 13.0 1.7 4.2
KM-2 Retail Dominant 33.1 19.6 27.2 5.5 9.2 5.5
KM-3 Mixed Basket 6.3 27.0 45.2 1.7 18.1 1.9
Gaussian Mixture Model (k = 5)
GMM-1 Heavy Retail 4.7 25.6 44.2 0.9 22.3 2.3
GMM-2 Diversified Retail 23.1 21.3 33.1 4.1 13.6 4.8
GMM-3 High-volume HoReCa 61.8 7.7 10.4 14.6 1.4 4.0
GMM-4 Balanced HoReCa 41.1 22.2 21.6 6.2 4.4 4.5
GMM-5 Low-spend / Mixed 25.5 17.7 41.7 10.0 2.0 3.2
Table A.9: Table A.10: Revenue concentration: each cluster’s share (%) of total category spend across all 440 customers.
Cluster Label Fresh Milk Grocery Frozen Detergents_Paper Delicassen
K-Means (k = 3)
KM-1 HoReCa Dominant 48.2 16.6 15.1 51.4 7.1 28.2
KM-2 Retail Dominant 47.4 60.9 56.0 44.9 57.8 62.4
KM-3 Mixed Basket 4.4 22.5 28.8 3.6 35.1 9.4
Gaussian Mixture Model (k = 5)
GMM-1 Heavy Retail 3.1 18.0 24.0 2.1 31.5 9.0
GMM-2 Diversified Retail 22.2 42.1 43.4 15.0 52.9 31.5
GMM-3 High-volume HoReCa 51.9 15.7 13.9 58.6 5.7 31.7
GMM-4 Balanced HoReCa 13.2 11.5 9.3 10.5 5.1 12.4
GMM-5 Low-spend / Mixed 9.6 12.8 9.4 13.8 4.7 15.4

Table A.5, A.7, and A.9 present the cluster spending profiles from three perspectives.

  1. Table A.7 shows the percentage composition of each cluster’s median basket examining how each segment allocates its spending across categories.
  2. Table A.5 reports the raw median annual spending (monetary units [m.u.]) per category, revealing the absolute scale differences between segments.
  3. Table A.9 reports revenue concentration and corresponds directly to the stacked bars in Figure 2.1. Note that because Table A.5 reports the median annual spending, it is not directly linked to Table A.9.

The K-Means “HoReCa Dominant” cluster (KM-1) allocates 58.4% of its median basket (Table A.7) to Fresh and accounts for 48.2% of total Fresh revenue (Table A.9), while “Retail Dominant” (KM-2) concentrates on Grocery (27.2%) and Detergents_Paper (9.2%) but comprises only 147 customers (33% of the base). At the GMM level, the distinction sharpens further: the median GMM-3 customer (“High-volume HoReCa”) dedicates 61.8% of its basket to Fresh and, despite representing 35% of customers, captures 51.9% of total Fresh revenue. The Retail split is equally informative: the typical GMM-1 customer (“Heavy Retail”) concentrates on Grocery (44.2%) and Detergents_Paper (22.3%), while the typical GMM-2 customer (“Diversified Retail”) spreads spend more evenly across categories including Fresh (23.1%), suggesting a convenience-store or diversified retailer profile.

A.3.4 Cluster assignments in PCA space

-5.0 -2.5 0.0 2.5 -5.0 -2.5 0.0 2.5 PC1 (44.1%) PC2 (27.2%) Cluster 1 2 3 (a) K-Means (k = 3) -5.0 -2.5 0.0 2.5 -5.0 -2.5 0.0 2.5 PC1 (44.1%) PC2 (27.2%) Cluster 1 2 3 4 5 (b) GMM (k = 5)

Figure A.7: Cluster assignments from K-Means (k=3, top) and GMM (k=5, bottom) projected onto the first two principal components, which together capture 71.3 percent of total variance. All clustering was performed in the full six-dimensional standardised log-spending space. Apparent overlap between clusters may reflect the loss of separating information in the remaining 28.7 percent of variance captured by PC3–PC6, rather than genuine ambiguity in cluster assignment.

Figure A.7 projects the K-Means and GMM cluster assignments onto PCA space. Both panels show that the dominant cluster boundaries divide the data along PC1 (the “channel axis” identified in 1), confirming that the clustering and PCA analyses converge on the same dominant structure. The GMM panel shows finer subdivision, particularly along PC2, consistent with GMM’s ability to model ellipsoidal clusters of varying density.

A.4 Ward’s hierarchical clustering

Table A.11: Table A.12: K-Means vs Ward’s hierarchical clustering assignments (k=3).
Ward 1 Ward 2 Ward 3
K-Means 1 0 201 11
K-Means 2 91 50 6
K-Means 3 34 11 36

Table A.11 cross-tabulates K-Means (k=3) and Ward’s hierarchical clustering (k=3) assignments. The largest cluster shows near-perfect agreement (201 of 212 observations). Disagreement is concentrated among boundary customers between the two smaller segments, reflecting the different optimisation strategies. K-Means minimises within-cluster sum of squares globally, while Ward’s method does so greedily through successive agglomeration.

B Prompts and LLM outputs

A consistent visual format is used to distinguish between prompts sent to the LLM, its responses, and our commentary.

Prompts to the LLM - shown with a right-hand border. These are the instructions and questions we provided to the model.

LLM Response - shown with a left-hand border, forming a conversational flow with the prompt. Responses may include a mix of explanation and code:

summary(mtcars$mpg)

Our group insight and critical evaluation - observations or key findings drawn from the output (i.e., what we found).

Caution: points requiring verification - potential concerns, limitations, or caveats with the generated output (i.e., what we question).

The R code generated by each LLM was executed programmatically within this document using a sandboxed runtime environment. Each script was run in an isolated R subprocess via the callr package, ensuring that no libraries, variables, or side effects from one script could contaminate the main document environment or subsequent scripts. Console output and diagnostic messages were captured and reproduced verbatim, while graphical output was intercepted via a PNG device and embedded automatically. Any runtime errors were caught and reported inline rather than halting the document compilation.

This approach was used to ensure that the displayed output faithfully reflected what each LLM-generated script produced when executed from a clean R session.

The full chat history is available via:

B.1 Prompt 1: PCA

Role : You are a data scientist working in a company that does wholesale distribution to clients located in Portugal.

task : Your company is looking to implement a Data-driven customer segmentation to inform inventory planning, pricing strategy, and sales team deployment.

I have a dataset called Wholesale_customers_data.csv located at .\Wholesale_customers_data.csv. It contains 440 rows and 8 columns: Channel (1=Hotel/Restaurant/Café, 2=Retail), Region (1=Lisbon, 2=Oporto, 3=Other), and six continuous spending variables (Fresh, Milk, Grocery, Frozen, Detergents_Paper, Delicassen) representing annual spending in monetary units. Please write R code to perform PCA on this dataset. Include any preprocessing you think is necessary.

B.1.1 Response 1: Claude (Sonnet 4.6)

> Check for relevant skills

No file uploaded, but I have a clear description of the dataset. Let me write comprehensive R code for PCA on this wholesale customers dataset.

> Created a file, read a file

Here’s a walkthrough of every decision made in the script:

Pre-processing Rationale

Why separate Channel & Region from the PCA? They are categorical identifiers, not spending measurements. Mixing measurement scales would corrupt the covariance structure. They are kept as labels to colour the biplots afterwards, which is far more analytically useful.

Why log1p-transform before PCA? Wholesale spend data is almost always severely right-skewed (a few very large clients dominate). PCA is sensitive to scale and variance, so without this step a single high-spend outlier can hijack the first principal component. log1p (log(x+1)) compresses the long tail and handles any zero values gracefully.

Why scale. = TRUE inside prcomp? Even after log-transformation the six variables have different magnitudes (e.g., Fresh spending is typically much larger than Delicassen). Standardising to unit variance ensures each variable contributes equally to the decomposition.

Why run KMO & Bartlett’s tests? These confirm that the correlation structure is strong enough to make PCA meaningful. Bartlett’s test (H₀: correlation matrix = identity) should be rejected (p < 0.05), and KMO > 0.6 indicates adequate sampling adequacy.

What the Script Produces

  • Section 2: Missing value check, outlier boxplots, skewness table…
  • Section 3: prcomp PCA object, full rotation matrix & summary
  • Section 4: Eigenvalue table, Scree plot with Kaiser line…
  • Section 5: Loading heatmap for retained PCs, cos² contribution plot
  • Section 6: Biplots coloured by Channel and Region separately…
  • Section 7: PC scores merged back → exported as wholesale_pca_scores.csv

What to Expect Analytically

Based on known patterns in this dataset:

  • PC1 (~44% variance) - a general spending scale axis; high scores = large/high-value clients
  • PC2 (~22% variance) - a channel contrast axis; Fresh loads negatively (HoReCa), Grocery/Detergents_Paper positively (Retail)
  • PC3 (~12% variance) - a niche category axis differentiating Frozen/Delicassen buyers

Together, ~2–3 PCs typically explain 75–85% of variance, making the PC scores an excellent lower-dimensional input for your subsequent K-Means or Hierarchical clustering step.

This response included the following R code (as a separate downloadable file):

# =============================================================================
# PCA Analysis: Wholesale Customers Dataset
# Goal: Customer segmentation to inform inventory planning, pricing strategy,
#       and sales team deployment
# =============================================================================

# --- 0. Install & Load Libraries ---------------------------------------------

required_packages <- c(
  "tidyverse",
  "corrplot",
  "factoextra",
  "FactoMineR",
  "gridExtra",
  "scales",
  "ggrepel",
  "psych"
)

installed <- rownames(installed.packages())
to_install <- required_packages[!required_packages %in% installed]
if (length(to_install) > 0) {
  install.packages(to_install, dependencies = TRUE)
}

library(tidyverse)
library(corrplot)
library(factoextra)
library(FactoMineR)
library(gridExtra)
library(scales)
library(ggrepel)
library(psych)

# --- 1. Load Data ------------------------------------------------------------

data_path <- "Wholesale_customers_data.csv"

df_raw <- read.csv(data_path, stringsAsFactors = FALSE)

cat("=== Dataset Overview ===\n")
cat("Dimensions:", nrow(df_raw), "rows x", ncol(df_raw), "columns\n\n")
print(head(df_raw))
cat("\n")
print(summary(df_raw))

# --- 2. Pre-processing -------------------------------------------------------

# 2.1  Recode categorical variables as factors (kept aside from PCA)
df <- df_raw %>%
  mutate(
    Channel = factor(Channel, levels = c(1, 2), labels = c("HoReCa", "Retail")),
    Region = factor(
      Region,
      levels = c(1, 2, 3),
      labels = c("Lisbon", "Oporto", "Other")
    )
  )

# 2.2  Isolate the six continuous spending variables for PCA
spend_vars <- c(
  "Fresh",
  "Milk",
  "Grocery",
  "Frozen",
  "Detergents_Paper",
  "Delicassen"
)
df_spend <- df[, spend_vars]

# 2.3  Check for missing values
cat("\n=== Missing Values per Column ===\n")
print(colSums(is.na(df_spend)))

# 2.4  Check for outliers via boxplots (raw scale)
par(mfrow = c(2, 3), mar = c(4, 4, 3, 1))
for (v in spend_vars) {
  boxplot(
    df_spend[[v]],
    main = v,
    ylab = "Annual Spend (m.u.)",
    col = "#4E79A7",
    border = "#2C4770",
    outline = TRUE
  )
}
mtext(
  "Raw Spending Distributions – Outlier Inspection",
  side = 3,
  line = -2,
  outer = TRUE,
  cex = 1.1,
  font = 2
)
par(mfrow = c(1, 1))

# 2.5  Assess skewness – highly skewed data warrants log transformation
cat("\n=== Skewness of Raw Spending Variables ===\n")
skew_raw <- apply(df_spend, 2, function(x) psych::skew(x))
print(round(skew_raw, 3))

# 2.6  Log-transform to reduce right skew (common for monetary spend data)
#      Adding 1 to guard against any zero values
df_log <- log1p(df_spend)

cat("\n=== Skewness After log1p Transformation ===\n")
skew_log <- apply(df_log, 2, function(x) psych::skew(x))
print(round(skew_log, 3))

# 2.7  Visualise distributions: raw vs log-transformed
p_list <- list()
for (v in spend_vars) {
  p_raw <- ggplot(df_spend, aes_string(x = v)) +
    geom_histogram(fill = "#4E79A7", colour = "white", bins = 30) +
    labs(title = paste(v, "(raw)"), x = NULL, y = "Count") +
    theme_minimal(base_size = 9)

  p_log <- ggplot(df_log, aes_string(x = v)) +
    geom_histogram(fill = "#F28E2B", colour = "white", bins = 30) +
    labs(title = paste(v, "(log1p)"), x = NULL, y = "Count") +
    theme_minimal(base_size = 9)

  p_list <- c(p_list, list(p_raw, p_log))
}
grid.arrange(
  grobs = p_list,
  ncol = 4,
  top = "Distribution: Raw vs log1p-transformed Spending"
)

# 2.8  Correlation matrix on log-transformed data
cat("\n=== Pearson Correlation Matrix (log-transformed) ===\n")
cor_mat <- cor(df_log, method = "pearson")
print(round(cor_mat, 3))

corrplot(
  cor_mat,
  method = "color",
  type = "upper",
  addCoef.col = "black",
  number.cex = 0.75,
  tl.col = "black",
  tl.srt = 45,
  col = colorRampPalette(c("#2166AC", "white", "#B2182B"))(200),
  title = "Correlation Matrix – log1p Spending Variables",
  mar = c(0, 0, 2, 0)
)

# 2.9  KMO & Bartlett's test – check suitability for PCA
cat("\n=== KMO Measure of Sampling Adequacy ===\n")
kmo_result <- KMO(cor_mat)
print(kmo_result)

cat("\n=== Bartlett's Test of Sphericity ===\n")
bart_result <- cortest.bartlett(cor_mat, n = nrow(df_log))
print(bart_result)

# --- 3. Principal Component Analysis ----------------------------------------

# prcomp with scale. = TRUE standardises the log-transformed variables
# (zero mean, unit variance) – essential so no single variable dominates
pca_result <- prcomp(df_log, center = TRUE, scale. = TRUE)

cat("\n=== PCA Summary ===\n")
print(summary(pca_result))

cat("\n=== Loadings (Rotation Matrix) ===\n")
print(round(pca_result$rotation, 4))

# --- 4. Variance Explained ---------------------------------------------------

# 4.1  Eigenvalues
eig_vals <- pca_result$sdev^2
prop_var <- eig_vals / sum(eig_vals)
cum_var <- cumsum(prop_var)

variance_table <- data.frame(
  PC = paste0("PC", seq_along(eig_vals)),
  Eigenvalue = round(eig_vals, 4),
  Proportion_Var = round(prop_var, 4),
  Cumulative_Var = round(cum_var, 4)
)
cat("\n=== Variance Explained Table ===\n")
print(variance_table)

# 4.2  Scree plot with Kaiser criterion line (eigenvalue = 1)
fviz_eig(
  pca_result,
  addlabels = TRUE,
  ylim = c(0, 55),
  barfill = "#4E79A7",
  barcolor = "#2C4770",
  linecolor = "#E15759",
  ggtheme = theme_minimal()
) +
  geom_hline(
    yintercept = 100 / length(spend_vars), # Kaiser: eigenvalue = 1
    linetype = "dashed",
    colour = "red",
    linewidth = 0.8
  ) +
  annotate(
    "text",
    x = 5.5,
    y = 100 / length(spend_vars) + 1.5,
    label = "Kaiser criterion (eigenvalue = 1)",
    colour = "red",
    size = 3.5
  ) +
  labs(
    title = "Scree Plot – Wholesale Customers PCA",
    subtitle = "Dashed line = Kaiser criterion"
  )

# 4.3  Cumulative variance plot
ggplot(variance_table, aes(x = PC, y = Cumulative_Var * 100, group = 1)) +
  geom_line(colour = "#E15759", linewidth = 1.2) +
  geom_point(size = 3, colour = "#E15759") +
  geom_hline(yintercept = 80, linetype = "dashed", colour = "gray40") +
  geom_hline(yintercept = 90, linetype = "dotted", colour = "gray40") +
  annotate(
    "text",
    x = 5.6,
    y = 81,
    label = "80%",
    size = 3.5,
    colour = "gray30"
  ) +
  annotate(
    "text",
    x = 5.6,
    y = 91,
    label = "90%",
    size = 3.5,
    colour = "gray30"
  ) +
  scale_y_continuous(limits = c(0, 100), labels = function(x) paste0(x, "%")) +
  labs(
    title = "Cumulative Variance Explained by PCs",
    subtitle = "Dashed = 80%, Dotted = 90% thresholds",
    x = "Principal Component",
    y = "Cumulative Variance (%)"
  ) +
  theme_minimal(base_size = 12)

# --- 5. Visualise Loadings ---------------------------------------------------

# 5.1  Loading heatmap for selected PCs
n_pcs <- sum(eig_vals >= 1) # Kaiser rule
loadings_df <- as.data.frame(pca_result$rotation[, 1:n_pcs]) %>%
  rownames_to_column("Variable") %>%
  pivot_longer(-Variable, names_to = "PC", values_to = "Loading")

ggplot(loadings_df, aes(x = PC, y = Variable, fill = Loading)) +
  geom_tile(colour = "white", linewidth = 0.5) +
  geom_text(aes(label = round(Loading, 2)), size = 3.5, colour = "black") +
  scale_fill_gradient2(
    low = "#2166AC",
    mid = "white",
    high = "#B2182B",
    midpoint = 0,
    limits = c(-1, 1)
  ) +
  labs(
    title = "PCA Loading Heatmap",
    subtitle = paste("First", n_pcs, "PCs (Kaiser criterion: eigenvalue ≥ 1)"),
    x = NULL,
    y = NULL,
    fill = "Loading"
  ) +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(face = "bold"))

# 5.2  Variable contribution plot (cos²)
fviz_cos2(
  pca_result,
  choice = "var",
  axes = 1:2,
  fill = "#4E79A7",
  color = "#2C4770",
  ggtheme = theme_minimal()
) +
  labs(
    title = "Variable cos² on PC1 & PC2",
    subtitle = "Quality of representation on the first two components"
  )

# --- 6. Biplots & Score Plots ------------------------------------------------

# 6.1  Biplot – coloured by Channel
scores_df <- as.data.frame(pca_result$x) %>%
  mutate(Channel = df$Channel, Region = df$Region)

# Build arrow data for loadings overlay
arrows_df <- as.data.frame(pca_result$rotation[, 1:2]) %>%
  rownames_to_column("Variable")
scale_factor <- 3.5 # scale arrows to fit score space

ggplot(scores_df, aes(x = PC1, y = PC2, colour = Channel, shape = Channel)) +
  geom_point(alpha = 0.7, size = 2.5) +
  geom_segment(
    data = arrows_df,
    aes(x = 0, y = 0, xend = PC1 * scale_factor, yend = PC2 * scale_factor),
    arrow = arrow(length = unit(0.25, "cm")),
    colour = "black",
    linewidth = 0.7,
    inherit.aes = FALSE
  ) +
  geom_label_repel(
    data = arrows_df,
    aes(x = PC1 * scale_factor, y = PC2 * scale_factor, label = Variable),
    size = 3.5,
    colour = "black",
    box.padding = 0.4,
    inherit.aes = FALSE
  ) +
  scale_colour_manual(values = c("HoReCa" = "#E15759", "Retail" = "#4E79A7")) +
  labs(
    title = "PCA Biplot – PC1 vs PC2",
    subtitle = "Coloured by Channel (HoReCa vs Retail)",
    x = paste0("PC1 (", round(prop_var[1] * 100, 1), "% variance)"),
    y = paste0("PC2 (", round(prop_var[2] * 100, 1), "% variance)"),
    colour = "Channel",
    shape = "Channel"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "top")

# 6.2  Biplot – coloured by Region
ggplot(scores_df, aes(x = PC1, y = PC2, colour = Region, shape = Region)) +
  geom_point(alpha = 0.7, size = 2.5) +
  geom_segment(
    data = arrows_df,
    aes(x = 0, y = 0, xend = PC1 * scale_factor, yend = PC2 * scale_factor),
    arrow = arrow(length = unit(0.25, "cm")),
    colour = "black",
    linewidth = 0.7,
    inherit.aes = FALSE
  ) +
  geom_label_repel(
    data = arrows_df,
    aes(x = PC1 * scale_factor, y = PC2 * scale_factor, label = Variable),
    size = 3.5,
    colour = "black",
    box.padding = 0.4,
    inherit.aes = FALSE
  ) +
  scale_colour_manual(
    values = c("Lisbon" = "#59A14F", "Oporto" = "#F28E2B", "Other" = "#B07AA1")
  ) +
  labs(
    title = "PCA Biplot – PC1 vs PC2",
    subtitle = "Coloured by Region",
    x = paste0("PC1 (", round(prop_var[1] * 100, 1), "% variance)"),
    y = paste0("PC2 (", round(prop_var[2] * 100, 1), "% variance)"),
    colour = "Region",
    shape = "Region"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "top")

# 6.3  PC1 vs PC3 (if relevant)
if (n_pcs >= 3) {
  ggplot(scores_df, aes(x = PC1, y = PC3, colour = Channel, shape = Channel)) +
    geom_point(alpha = 0.7, size = 2.5) +
    scale_colour_manual(
      values = c("HoReCa" = "#E15759", "Retail" = "#4E79A7")
    ) +
    labs(
      title = "PCA Score Plot – PC1 vs PC3",
      x = paste0("PC1 (", round(prop_var[1] * 100, 1), "%)"),
      y = paste0("PC3 (", round(prop_var[3] * 100, 1), "%)")
    ) +
    theme_minimal(base_size = 12) +
    theme(legend.position = "top")
}

# --- 7. Extract PC Scores for Downstream Use ---------------------------------

# Attach PC scores back to full dataset
df_with_scores <- bind_cols(df, as.data.frame(pca_result$x))

cat("\n=== PC Scores – First 6 Rows ===\n")
print(head(df_with_scores[, c("Channel", "Region", "PC1", "PC2", "PC3")]))

# Save enriched dataset
output_path <- "wholesale_pca_scores.csv"
write.csv(df_with_scores, output_path, row.names = FALSE)
cat("\nPC scores saved to:", output_path, "\n")

# --- 8. Interpretation Summary -----------------------------------------------

cat("\n")
cat("========================================================\n")
cat("              PCA INTERPRETATION SUMMARY                \n")
cat("========================================================\n")
cat(sprintf("  Components retained (Kaiser, eigenvalue ≥ 1): %d\n", n_pcs))
cat(sprintf(
  "  Variance explained by retained PCs:           %.1f%%\n",
  cum_var[n_pcs] * 100
))
cat("\n  PC1 – 'Overall Spend Scale'\n")
cat("    Likely captures total spending power / business size.\n")
cat("    High positive loadings across most categories.\n")
cat("\n  PC2 – 'Fresh vs Packaged Goods Axis'\n")
cat("    Separates Fresh-heavy buyers (HoReCa) from\n")
cat("    Grocery/Detergent-heavy buyers (Retail).\n")
cat("\n  PC3 – 'Speciality/Frozen Dimension' (if retained)\n")
cat("    May differentiate niche buyers (Delicassen, Frozen).\n")
cat("\n  Business Implication:\n")
cat("    PC scores can directly feed K-Means / Hierarchical\n")
cat("    clustering for actionable customer segmentation.\n")
cat("========================================================\n")

The (unedited) outputs generated by running this code, are as follows:

=== Dataset Overview ===
Dimensions: 440 rows x 8 columns

  Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1       2      3 12669 9656    7561    214             2674       1338
2       2      3  7057 9810    9568   1762             3293       1776
3       2      3  6353 8808    7684   2405             3516       7844
4       1      3 13265 1196    4221   6404              507       1788
5       2      3 22615 5410    7198   3915             1777       5185
6       2      3  9413 8259    5126    666             1795       1451

    Channel          Region          Fresh             Milk      
 Min.   :1.000   Min.   :1.000   Min.   :     3   Min.   :   55  
 1st Qu.:1.000   1st Qu.:2.000   1st Qu.:  3128   1st Qu.: 1533  
 Median :1.000   Median :3.000   Median :  8504   Median : 3627  
 Mean   :1.323   Mean   :2.543   Mean   : 12000   Mean   : 5796  
 3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.: 16934   3rd Qu.: 7190  
 Max.   :2.000   Max.   :3.000   Max.   :112151   Max.   :73498  
    Grocery          Frozen        Detergents_Paper    Delicassen     
 Min.   :    3   Min.   :   25.0   Min.   :    3.0   Min.   :    3.0  
 1st Qu.: 2153   1st Qu.:  742.2   1st Qu.:  256.8   1st Qu.:  408.2  
 Median : 4756   Median : 1526.0   Median :  816.5   Median :  965.5  
 Mean   : 7951   Mean   : 3071.9   Mean   : 2881.5   Mean   : 1524.9  
 3rd Qu.:10656   3rd Qu.: 3554.2   3rd Qu.: 3922.0   3rd Qu.: 1820.2  
 Max.   :92780   Max.   :60869.0   Max.   :40827.0   Max.   :47943.0  

=== Missing Values per Column ===
           Fresh             Milk          Grocery           Frozen 
               0                0                0                0 
Detergents_Paper       Delicassen 
               0                0 

=== Skewness of Raw Spending Variables ===
           Fresh             Milk          Grocery           Frozen 
           2.544            4.026            3.563            5.868 
Detergents_Paper       Delicassen 
           3.607           11.076 

=== Skewness After log1p Transformation ===
           Fresh             Milk          Grocery           Frozen 
          -1.565           -0.223           -0.670           -0.350 
Detergents_Paper       Delicassen 
          -0.234           -1.084 

=== Pearson Correlation Matrix (log-transformed) ===
                  Fresh   Milk Grocery Frozen Detergents_Paper Delicassen
Fresh             1.000 -0.021  -0.133  0.386           -0.159      0.256
Milk             -0.021  1.000   0.761 -0.055            0.679      0.342
Grocery          -0.133  0.761   1.000 -0.165            0.797      0.240
Frozen            0.386 -0.055  -0.165  1.000           -0.213      0.256
Detergents_Paper -0.159  0.679   0.797 -0.213            1.000      0.168
Delicassen        0.256  0.342   0.240  0.256            0.168      1.000

=== KMO Measure of Sampling Adequacy ===
Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = cor_mat)
Overall MSA =  0.73
MSA for each item = 
           Fresh             Milk          Grocery           Frozen 
            0.65             0.78             0.70             0.66 
Detergents_Paper       Delicassen 
            0.76             0.71 

=== Bartlett's Test of Sphericity ===
$chisq
[1] 1050.819

$p.value
[1] 1.714465e-214

$df
[1] 15


=== PCA Summary ===
Importance of components:
                          PC1    PC2    PC3    PC4     PC5     PC6
Standard deviation     1.6262 1.2774 0.8012 0.7786 0.54090 0.42774
Proportion of Variance 0.4408 0.2720 0.1070 0.1010 0.04876 0.03049
Cumulative Proportion  0.4408 0.7127 0.8197 0.9207 0.96951 1.00000

=== Loadings (Rotation Matrix) ===
                     PC1     PC2     PC3     PC4     PC5     PC6
Fresh            -0.1052  0.5904 -0.6396  0.4786 -0.0403 -0.0259
Milk              0.5423  0.1329 -0.0736 -0.0618  0.7600  0.3185
Grocery           0.5715 -0.0058 -0.1333 -0.0965 -0.0929 -0.7985
Frozen           -0.1379  0.5898 -0.0212 -0.7920 -0.0732  0.0049
Detergents_Paper  0.5506 -0.0708 -0.1995 -0.0828 -0.6204  0.5101
Delicassen        0.2145  0.5300  0.7263  0.3516 -0.1481  0.0031

=== Variance Explained Table ===
   PC Eigenvalue Proportion_Var Cumulative_Var
1 PC1     2.6447         0.4408         0.4408
2 PC2     1.6317         0.2719         0.7127
3 PC3     0.6419         0.1070         0.8197
4 PC4     0.6062         0.1010         0.9207
5 PC5     0.2926         0.0488         0.9695
6 PC6     0.1830         0.0305         1.0000

=== PC Scores – First 6 Rows ===
  Channel Region        PC1        PC2         PC3
1  Retail  Other  1.3788133 -0.3043804 -0.24078762
2  Retail  Other  1.4372336  0.5368571  0.08413653
3  Retail  Other  1.5117317  1.2311631  0.98418563
4  HoReCa  Other -0.8241672  1.2094129  0.25185392
5  Retail  Other  0.8019842  1.7640380  0.31234246
6  Retail  Other  0.8851579  0.1297465  0.01892823

PC scores saved to: wholesale_pca_scores.csv 

========================================================
              PCA INTERPRETATION SUMMARY                
========================================================
  Components retained (Kaiser, eigenvalue ≥ 1): 2
  Variance explained by retained PCs:           71.3%

  PC1 – 'Overall Spend Scale'
    Likely captures total spending power / business size.
    High positive loadings across most categories.

  PC2 – 'Fresh vs Packaged Goods Axis'
    Separates Fresh-heavy buyers (HoReCa) from
    Grocery/Detergent-heavy buyers (Retail).

  PC3 – 'Speciality/Frozen Dimension' (if retained)
    May differentiate niche buyers (Delicassen, Frozen).

  Business Implication:
    PC scores can directly feed K-Means / Hierarchical
    clustering for actionable customer segmentation.
========================================================

B.1.2 Response 1: ChatGPT (GPT-5.3)

Here’s a clean, reproducible R workflow to perform PCA (Principal Component Analysis) on your wholesale dataset, including appropriate preprocessing steps.

Key considerations before PCA

  • Categorical variables (Channel, Region) should NOT be used directly in PCA (since PCA is for continuous variables).
  • Scaling is essential because spending variables are on very different scales.
  • Optional: log transformation can help reduce skewness (common in spending data).

R Code for PCA Moved to below this response box

What this gives you

  • Principal components capturing key spending patterns
  • Explained variance to decide how many components to retain
  • Customer visualization (potential clusters/segments)
  • Variable importance (which spending categories drive differences)

Next Steps (for your segmentation task) After PCA, you can:

  1. Use the first few principal components (e.g., PC1–PC3)
  2. Apply clustering (e.g., kmeans, hclust)
  3. Profile segments by:
    • Channel (HORECA vs Retail)
    • Region
    • Spending behavior

If you want, I can extend this into a full segmentation pipeline (PCA + clustering + interpretation) tailored to your business objectives.

This response included the following R code as part of the LLM’s in-line response:

# Load required libraries
library(tidyverse)
library(factoextra)

# Load the dataset
data <- read.csv("Wholesale_customers_data.csv")

# Inspect structure
str(data)
summary(data)

# Convert categorical variables to factors
data$Channel <- as.factor(data$Channel)
data$Region <- as.factor(data$Region)

# Select only continuous variables for PCA
spending_data <- data %>%
  select(Fresh, Milk, Grocery, Frozen, Detergents_Paper, Delicassen)

# Optional: Log transformation to reduce skewness
spending_log <- log1p(spending_data) # log(1 + x) handles zeros

# Scale the data (important for PCA)
spending_scaled <- scale(spending_log)

# Perform PCA
pca_result <- prcomp(spending_scaled, center = TRUE, scale. = TRUE)

# View summary of PCA (variance explained)
summary(pca_result)

# Eigenvalues (variance explained by each component)
eig_vals <- get_eigenvalue(pca_result)
print(eig_vals)

# Scree plot
fviz_eig(pca_result, addlabels = TRUE, ylim = c(0, 60))

# Biplot (individuals + variables)
fviz_pca_biplot(pca_result, repel = TRUE, col.var = "blue", col.ind = "gray")

# PCA variable contributions
fviz_pca_var(
  pca_result,
  col.var = "contrib",
  gradient.cols = c("blue", "yellow", "red")
)

# PCA individuals colored by Channel
fviz_pca_ind(
  pca_result,
  geom = "point",
  col.ind = data$Channel,
  palette = c("#00AFBB", "#E7B800"),
  addEllipses = TRUE,
  legend.title = "Channel"
)

The (unedited) outputs generated by running this code, are as follows:

'data.frame':   440 obs. of  8 variables:
 $ Channel         : int  2 2 2 1 2 2 2 2 1 2 ...
 $ Region          : int  3 3 3 3 3 3 3 3 3 3 ...
 $ Fresh           : int  12669 7057 6353 13265 22615 9413 12126 7579 5963 6006 ...
 $ Milk            : int  9656 9810 8808 1196 5410 8259 3199 4956 3648 11093 ...
 $ Grocery         : int  7561 9568 7684 4221 7198 5126 6975 9426 6192 18881 ...
 $ Frozen          : int  214 1762 2405 6404 3915 666 480 1669 425 1159 ...
 $ Detergents_Paper: int  2674 3293 3516 507 1777 1795 3140 3321 1716 7425 ...
 $ Delicassen      : int  1338 1776 7844 1788 5185 1451 545 2566 750 2098 ...
    Channel          Region          Fresh             Milk      
 Min.   :1.000   Min.   :1.000   Min.   :     3   Min.   :   55  
 1st Qu.:1.000   1st Qu.:2.000   1st Qu.:  3128   1st Qu.: 1533  
 Median :1.000   Median :3.000   Median :  8504   Median : 3627  
 Mean   :1.323   Mean   :2.543   Mean   : 12000   Mean   : 5796  
 3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.: 16934   3rd Qu.: 7190  
 Max.   :2.000   Max.   :3.000   Max.   :112151   Max.   :73498  
    Grocery          Frozen        Detergents_Paper    Delicassen     
 Min.   :    3   Min.   :   25.0   Min.   :    3.0   Min.   :    3.0  
 1st Qu.: 2153   1st Qu.:  742.2   1st Qu.:  256.8   1st Qu.:  408.2  
 Median : 4756   Median : 1526.0   Median :  816.5   Median :  965.5  
 Mean   : 7951   Mean   : 3071.9   Mean   : 2881.5   Mean   : 1524.9  
 3rd Qu.:10656   3rd Qu.: 3554.2   3rd Qu.: 3922.0   3rd Qu.: 1820.2  
 Max.   :92780   Max.   :60869.0   Max.   :40827.0   Max.   :47943.0  
Importance of components:
                          PC1    PC2    PC3    PC4     PC5     PC6
Standard deviation     1.6262 1.2774 0.8012 0.7786 0.54090 0.42774
Proportion of Variance 0.4408 0.2720 0.1070 0.1010 0.04876 0.03049
Cumulative Proportion  0.4408 0.7127 0.8197 0.9207 0.96951 1.00000
      eigenvalue variance.percent cumulative.variance.percent
Dim.1  2.6446651        44.077752                    44.07775
Dim.2  1.6316950        27.194917                    71.27267
Dim.3  0.6419057        10.698428                    81.97110
Dim.4  0.6061972        10.103287                    92.07438
Dim.5  0.2925761         4.876269                    96.95065
Dim.6  0.1829608         3.049347                   100.00000

B.1.3 Evaluation

  • Both LLMs correctly exclude Channel and Region, apply log1p, and standardise before PCA (matching our approach)
  • Claude returns a 413-line script with unrequested diagnostics including: missing-value checks, outlier boxplots, skewness tables, a correlation matrix, and KMO/Bartlett’s suitability tests (KMO/Bartlett’s suitability tests are a genuine addition to our approach)
  • Claude’s scree plot, histograms, correlation matrix, and biplot mirror our own figures
  • ChatGPT produces a compact 57-line script that also runs cleanly and includes a Channel-coloured PCA individuals plot unprompted, visually validating the PCA against the known business classification
  • The code produced by both models execute without errors on this prompt
  • Neither model interprets the PCA results with specificity
  • Claude labels PC1 as “overall spend scale” which is imprecise as our analysis shows PC1 is driven by the Grocery/Detergents_Paper/Milk triad, not uniformly high loadings — ChatGPT offers no interpretation of loadings or components in its text
  • An analyst reviewing only the commentary would not learn that PC1 captures the channel distinction without inspecting the output independently
  • ChatGPT’s biplot is bizzare and difficult to interpret
  • ChatGPT double-standardises the data (scale() then prcomp(..., scale. = TRUE)), which is harmless but suggests the model is not tracking what each preprocessing step does
  • Both outputs require substantial review
  • Claude’s 413 lines demand careful line-by-line checking, while ChatGPT’s brevity requires a user to carry out their own analytical interpretation

B.2 Prompt 2: Clustering

Now write R code to perform clustering on this dataset. Explain your choice of method and number of clusters.

B.2.1 Response 2: Claude (Sonnet 4.6)

> Check the PCA R script for context on variable names and preprocessing steps used

Good - I have full context from the PCA script. Now I’ll write a comprehensive, well-structured clustering script that picks up exactly where the PCA left off.

> Created a file, read a file

Here is a full explanation of every methodological decision made in the script.

Method Choice: Why a Two-Stage Approach?

Stage 1 - Hierarchical Agglomerative Clustering (HAC) with Ward.D2. HAC is used first and exploratorily because it requires no pre-specified k. The dendrogram shows the full merge history of the data, and natural groupings can be identified by looking for the largest vertical gaps (the longest branches before clusters merge).

Why Ward.D2 specifically? Ward’s linkage merges the two clusters that produce the smallest increase in total within-cluster sum of squares - it is inherently trying to form compact, internally coherent groups. The other common options have practical problems here: single linkage suffers from chaining (long thin clusters), complete linkage is sensitive to outliers (which wholesale spend data has), and average linkage is a compromise but rarely as clean. The agglomerative coefficient printed by the script confirms this - Ward.D2 will produce a value closest to 1.0, signalling the strongest cluster structure.

Stage 2 - K-Means on PCA Scores. Once k is confirmed, K-Means is used for the final assignment because it is interpretable, fast, and produces hard cluster memberships that can be exported straight into a CRM or ERP system. Running it on the retained PC scores (rather than all six raw variables) removes the noise carried by the low-variance PCs and avoids double-counting the high correlations between Grocery, Milk and Detergents_Paper.

Number of Clusters: Why k = 3? Four independent criteria are computed and they all point to the same answer:

  • Dendrogram cut height - the two largest height jumps in the Ward.D2 tree occur at the merges that would collapse 3 → 2 and 4 → 3, indicating 3 is the most stable cut
  • Elbow (WSS) - the rate of WSS reduction flattens sharply after k = 3
  • Average Silhouette Width - peaks at k = 3, meaning observations are most clearly assigned to their own cluster at that point
  • Gap Statistic - k = 3 satisfies the firstSEmax rule (first k within one SE of the maximum gap)
  • NbClust 30-index vote - the plurality of the 30 different cluster-quality indices vote for k = 3

Beyond the statistics, k = 3 also has a compelling business logic: it maps onto three operationally distinct customer archetypes (see the summary at the bottom of the script), each requiring a different inventory, pricing, and sales approach. k = 2 is too coarse to distinguish within HoReCa; k = 4+ starts splitting segments that are not meaningfully different for planning purposes.

What the Script Produces (in order)

  • Section 4: Agglomerative coefficients table, full dendrogram, k = 3 coloured dendrogram
  • Section 5: Elbow plot, Silhouette width plot, Gap Statistic plot, NbClust majority-vote table
  • Section 6: Final K-Means solution, cluster sizes, between/total SS ratio
  • Section 7: Silhouette plot for final solution, Calinski-Harabasz index
  • Section 8: PCA biplot with 90% ellipses, cluster × Channel overlay, cluster × Region overlay
  • Section 9: Cluster profile table, radar chart, grouped bar chart, per-category boxplots
  • Section 10: HAC vs K-Means agreement check (expect >85%)
  • Section 11: wholesale_clustered.csv - full dataset with Cluster column appended

This response included the following R code (as a separate downloadable file):

# =============================================================================
# Clustering Analysis: Wholesale Customers Dataset
# Goal: Data-driven customer segmentation to inform inventory planning,
#       pricing strategy, and sales team deployment
#
# METHOD RATIONALE (read before running):
# ----------------------------------------
# We use a TWO-STAGE approach:
#
#   Stage 1 – Hierarchical Agglomerative Clustering (HAC) with Ward.D2 linkage
#     • No need to pre-specify k; the dendrogram reveals natural structure
#     • Ward.D2 minimises within-cluster sum of squares at each merge step,
#       producing compact, roughly equal-sized clusters – ideal for business
#       segmentation where every segment must be actionable
#     • Robust to the moderate outliers present in wholesale spend data
#     • Used first to identify the plausible range of k
#
#   Stage 2 – K-Means on PCA scores (confirmation & final assignment)
#     • Once k is confirmed, K-Means on the retained PC scores is fast,
#       well-understood, and produces crisp assignments for CRM/ERP export
#     • Running K-Means in PC space (rather than raw space) further reduces
#       noise from the three low-variance PCs
#
# NUMBER OF CLUSTERS:
#   We use three complementary criteria and take the consensus:
#     (a) Dendrogram cut height (HAC)
#     (b) Elbow method – Within-cluster Sum of Squares (WSS)
#     (c) Average Silhouette Width
#     (d) Gap Statistic
#   All four consistently point to k = 3 on this dataset, and three segments
#   align with a natural business narrative (see Section 8).
# =============================================================================

# --- 0. Packages -------------------------------------------------------------

required_packages <- c(
  "tidyverse",
  "factoextra",
  "cluster",
  "dendextend",
  "gridExtra",
  "scales",
  "ggrepel",
  "psych",
  "RColorBrewer",
  "corrplot",
  "NbClust",
  "fpc"
)

installed <- rownames(installed.packages())
to_install <- required_packages[!required_packages %in% installed]
if (length(to_install) > 0) {
  install.packages(to_install, dependencies = TRUE)
}

library(tidyverse)
library(factoextra)
library(cluster)
library(dendextend)
library(gridExtra)
library(scales)
library(ggrepel)
library(psych)
library(RColorBrewer)
library(corrplot)
library(NbClust)
library(fpc)

# Consistent colour palette (up to 5 clusters)
CLUSTER_COLS <- c(
  "1" = "#4E79A7",
  "2" = "#E15759",
  "3" = "#59A14F",
  "4" = "#F28E2B",
  "5" = "#B07AA1"
)

# --- 1. Reproduce Pre-processing from PCA Script ----------------------------
# (Self-contained: this script can be run independently)

data_path <- "Wholesale_customers_data.csv"

df_raw <- read.csv(data_path, stringsAsFactors = FALSE)

df <- df_raw %>%
  mutate(
    Channel = factor(Channel, levels = c(1, 2), labels = c("HoReCa", "Retail")),
    Region = factor(
      Region,
      levels = c(1, 2, 3),
      labels = c("Lisbon", "Oporto", "Other")
    )
  )

spend_vars <- c(
  "Fresh",
  "Milk",
  "Grocery",
  "Frozen",
  "Detergents_Paper",
  "Delicassen"
)
df_spend <- df[, spend_vars]
df_log <- log1p(df_spend) # log1p: handles zeros, reduces right skew

# Standardise (mean = 0, sd = 1) – required for both HAC and K-Means
df_scaled <- as.data.frame(scale(df_log))

cat("=== Pre-processing complete ===\n")
cat(
  "Dimensions of scaled input matrix:",
  nrow(df_scaled),
  "x",
  ncol(df_scaled),
  "\n\n"
)

# --- 2. PCA on Scaled Data (for cluster visualisation) ----------------------

pca_result <- prcomp(df_log, center = TRUE, scale. = TRUE)
eig_vals <- pca_result$sdev^2
prop_var <- eig_vals / sum(eig_vals)
n_pcs <- sum(eig_vals >= 1) # Kaiser criterion → typically 3

# PC scores matrix for clustering
pc_scores <- as.data.frame(pca_result$x[, 1:n_pcs])

cat(sprintf(
  "PCA: retaining %d PCs (Kaiser criterion), explaining %.1f%% of variance\n\n",
  n_pcs,
  cumsum(prop_var)[n_pcs] * 100
))

# --- 3. Distance Matrix ------------------------------------------------------
# Euclidean distance on the standardised original space
# (used for HAC; K-Means implicitly uses Euclidean as well)

dist_mat <- dist(df_scaled, method = "euclidean")

# --- 4. Stage 1 – Hierarchical Agglomerative Clustering (HAC) ---------------

# 4.1  Fit Ward.D2 linkage
hac_ward <- hclust(dist_mat, method = "ward.D2")

# 4.2  Compare linkage methods (to confirm Ward is best)
#linkage_methods <- c("ward.D2", "complete", "average", "single")
linkage_methods <- c("average", "single", "complete", "ward", "weighted")
agc_coefficients <- sapply(linkage_methods, function(m) {
  coef(agnes(dist_mat, method = m))
})
agc_df <- data.frame(
  Method = linkage_methods,
  Agglomerative_Coefficient = round(agc_coefficients, 4)
)
cat("=== Agglomerative Coefficients (closer to 1 = stronger structure) ===\n")
print(agc_df)
cat(
  "\nWard.D2 should show the highest coefficient, confirming it as best linkage.\n\n"
)

# 4.3  Full dendrogram
dend <- as.dendrogram(hac_ward)
par(mar = c(5, 4, 4, 2))
plot(
  dend,
  main = "HAC Dendrogram – Ward.D2 Linkage\n(Wholesale Customers)",
  ylab = "Height (Euclidean distance)",
  xlab = "Customers",
  leaflab = "none",
  cex.main = 1.1
)

# 4.4  Annotate cut heights for k = 2, 3, 4
cut_heights <- sapply(2:4, function(k) {
  h <- sort(hac_ward$height, decreasing = TRUE)
  mean(h[k - 1], h[k]) # midpoint between the last two merges
})
abline(
  h = cut_heights,
  col = c("#F28E2B", "#E15759", "#B07AA1"),
  lty = 2,
  lwd = 1.5
)
legend(
  "topright",
  legend = c("k = 2 cut", "k = 3 cut", "k = 4 cut"),
  col = c("#F28E2B", "#E15759", "#B07AA1"),
  lty = 2,
  lwd = 1.5,
  bty = "n",
  cex = 0.85
)

# 4.5  Coloured dendrogram for k = 3
dend_col <- color_branches(
  dend,
  k = 3,
  col = unname(CLUSTER_COLS[c("1", "2", "3")])
)
par(mar = c(5, 4, 4, 2))
plot(
  dend_col,
  main = "HAC Dendrogram – Ward.D2, k = 3 Clusters Highlighted",
  ylab = "Height",
  xlab = "Customers",
  leaflab = "none"
)
legend(
  "topright",
  legend = paste("Cluster", 1:3),
  fill = unname(CLUSTER_COLS[c("1", "2", "3")]),
  bty = "n"
)

# --- 5. Optimal k Selection --------------------------------------------------

# 5.1  Elbow plot (WSS)
wss_vals <- map_dbl(1:10, function(k) {
  set.seed(42)
  km <- kmeans(pc_scores, centers = k, nstart = 25, iter.max = 300)
  km$tot.withinss
})

wss_df <- data.frame(k = 1:10, WSS = wss_vals)

p_elbow <- ggplot(wss_df, aes(x = k, y = WSS)) +
  geom_line(colour = "#4E79A7", linewidth = 1.2) +
  geom_point(size = 3.5, colour = "#4E79A7") +
  geom_vline(
    xintercept = 3,
    linetype = "dashed",
    colour = "#E15759",
    linewidth = 0.9
  ) +
  annotate(
    "text",
    x = 3.3,
    y = max(wss_vals) * 0.92,
    label = "k = 3\n(elbow)",
    colour = "#E15759",
    size = 3.5
  ) +
  scale_x_continuous(breaks = 1:10) +
  labs(
    title = "Elbow Method – Within-Cluster Sum of Squares",
    subtitle = "Look for the 'elbow' where marginal WSS reduction flattens",
    x = "Number of Clusters (k)",
    y = "Total WSS"
  ) +
  theme_minimal(base_size = 12)

# 5.2  Silhouette plot
sil_vals <- map_dbl(2:10, function(k) {
  set.seed(42)
  km <- kmeans(pc_scores, centers = k, nstart = 25, iter.max = 300)
  sil <- silhouette(km$cluster, dist(pc_scores))
  mean(sil[, 3])
})

sil_df <- data.frame(k = 2:10, AvgSilhouette = sil_vals)

p_sil <- ggplot(sil_df, aes(x = k, y = AvgSilhouette)) +
  geom_line(colour = "#59A14F", linewidth = 1.2) +
  geom_point(size = 3.5, colour = "#59A14F") +
  geom_vline(
    xintercept = sil_df$k[which.max(sil_df$AvgSilhouette)],
    linetype = "dashed",
    colour = "#E15759",
    linewidth = 0.9
  ) +
  annotate(
    "text",
    x = sil_df$k[which.max(sil_df$AvgSilhouette)] + 0.4,
    y = max(sil_df$AvgSilhouette) * 0.97,
    label = paste0(
      "k = ",
      sil_df$k[which.max(sil_df$AvgSilhouette)],
      "\n(max sil)"
    ),
    colour = "#E15759",
    size = 3.5
  ) +
  scale_x_continuous(breaks = 2:10) +
  labs(
    title = "Average Silhouette Width by k",
    subtitle = "Higher = better-separated clusters",
    x = "Number of Clusters (k)",
    y = "Average Silhouette Width"
  ) +
  theme_minimal(base_size = 12)

grid.arrange(p_elbow, p_sil, ncol = 2, top = "Optimal k Selection Criteria")

# 5.3  Gap Statistic (computationally heavier – uses bootstrap)
set.seed(42)
gap_stat <- clusGap(
  pc_scores,
  FUN = kmeans,
  K.max = 10,
  B = 100,
  nstart = 25,
  iter.max = 300
)
cat("\n=== Gap Statistic Results ===\n")
print(gap_stat, method = "firstSEmax")

fviz_gap_stat(gap_stat, ggtheme = theme_minimal()) +
  geom_vline(
    xintercept = 3,
    linetype = "dashed",
    colour = "#E15759",
    linewidth = 0.9
  ) +
  labs(
    title = "Gap Statistic – Optimal k",
    subtitle = "Vertical line marks k = 3"
  )

# 5.4  NbClust – 30 indices majority vote (definitive consensus)
cat("\n=== NbClust: 30-index Majority Vote ===\n")
set.seed(42)
nb_result <- NbClust(
  data = pc_scores,
  distance = "euclidean",
  min.nc = 2,
  max.nc = 8,
  method = "ward.D2",
  index = "all"
)

cat("\n--- NbClust Best Number of Clusters ---\n")
print(table(nb_result$Best.nc[1, ]))

# --- 6. Stage 2 – K-Means on PC Scores (k = 3) ------------------------------

set.seed(42)
K <- 3

km_final <- kmeans(
  pc_scores,
  centers = K,
  nstart = 50, # multiple random starts → stable solution
  iter.max = 500
)

cat("\n=== K-Means Final Solution (k = 3) ===\n")
cat("Cluster sizes:\n")
print(km_final$size)
cat(sprintf(
  "\nBetween-cluster SS / Total SS: %.1f%%\n",
  km_final$betweenss / km_final$totss * 100
))

# Attach cluster labels to full dataset
df_final <- df %>%
  bind_cols(as.data.frame(pca_result$x)) %>%
  mutate(Cluster = factor(km_final$cluster))

# --- 7. Cluster Validation ---------------------------------------------------

# 7.1  Silhouette plot for final k = 3 solution
sil_final <- silhouette(km_final$cluster, dist(pc_scores))
cat("\n=== Silhouette Summary (k = 3) ===\n")
print(summary(sil_final))

fviz_silhouette(
  sil_final,
  palette = unname(CLUSTER_COLS[as.character(1:K)]),
  ggtheme = theme_minimal(),
  print.summary = FALSE
) +
  labs(
    title = "Silhouette Plot – K-Means k = 3",
    subtitle = "Bars above 0 indicate well-classified observations"
  ) +
  theme(legend.position = "top")

# 7.2  Calinski-Harabasz & Davies-Bouldin indices
ch_stat <- calinhara(pc_scores, km_final$cluster)
cat(sprintf(
  "\nCalinski-Harabasz Index: %.2f  (higher = better separated)\n",
  ch_stat
))

# --- 8. Cluster Visualisation ------------------------------------------------

# 8.1  K-Means clusters on PC1 vs PC2
arrows_df <- as.data.frame(pca_result$rotation[, 1:2]) %>%
  rownames_to_column("Variable")
scale_fac <- 3.2

p_biplot <- ggplot(
  df_final,
  aes(x = PC1, y = PC2, colour = Cluster, shape = Cluster)
) +
  geom_point(alpha = 0.75, size = 2.5) +
  geom_segment(
    data = arrows_df,
    aes(x = 0, y = 0, xend = PC1 * scale_fac, yend = PC2 * scale_fac),
    arrow = arrow(length = unit(0.22, "cm")),
    colour = "grey30",
    linewidth = 0.65,
    inherit.aes = FALSE
  ) +
  geom_label_repel(
    data = arrows_df,
    aes(x = PC1 * scale_fac, y = PC2 * scale_fac, label = Variable),
    size = 3.2,
    colour = "grey20",
    box.padding = 0.4,
    inherit.aes = FALSE
  ) +
  stat_ellipse(
    aes(colour = Cluster),
    level = 0.90,
    linewidth = 0.8,
    linetype = "dashed"
  ) +
  scale_colour_manual(values = CLUSTER_COLS) +
  scale_shape_manual(values = c("1" = 16, "2" = 17, "3" = 15)) +
  labs(
    title = "K-Means Clusters (k = 3) on PCA Biplot",
    subtitle = "90% confidence ellipses; arrows show variable loadings",
    x = paste0("PC1 (", round(prop_var[1] * 100, 1), "% variance)"),
    y = paste0("PC2 (", round(prop_var[2] * 100, 1), "% variance)"),
    colour = "Cluster",
    shape = "Cluster"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "top")

print(p_biplot)

# 8.2  Clusters overlaid with Channel membership
p_channel <- ggplot(
  df_final,
  aes(x = PC1, y = PC2, colour = Cluster, shape = Channel)
) +
  geom_point(alpha = 0.75, size = 2.8) +
  scale_colour_manual(values = CLUSTER_COLS) +
  scale_shape_manual(values = c("HoReCa" = 17, "Retail" = 16)) +
  labs(
    title = "Cluster Membership vs Channel Type",
    subtitle = "Shape = Channel; Colour = Cluster",
    x = paste0("PC1 (", round(prop_var[1] * 100, 1), "%)"),
    y = paste0("PC2 (", round(prop_var[2] * 100, 1), "%)")
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "top")

print(p_channel)

# 8.3  Clusters vs Region
p_region <- ggplot(
  df_final,
  aes(x = PC1, y = PC2, colour = Cluster, shape = Region)
) +
  geom_point(alpha = 0.75, size = 2.8) +
  scale_colour_manual(values = CLUSTER_COLS) +
  scale_shape_manual(values = c("Lisbon" = 16, "Oporto" = 17, "Other" = 15)) +
  labs(
    title = "Cluster Membership vs Region",
    subtitle = "Shape = Region; Colour = Cluster",
    x = paste0("PC1 (", round(prop_var[1] * 100, 1), "%)"),
    y = paste0("PC2 (", round(prop_var[2] * 100, 1), "%)")
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "top")

print(p_region)

# --- 9. Cluster Profiling ----------------------------------------------------

# 9.1  Mean spending per cluster (original scale, unlogged)
profile_raw <- df_final %>%
  bind_cols(df_spend) %>%
  group_by(Cluster) %>%
  summarise(
    N = n(),
    Fresh_mean = mean(Fresh),
    Milk_mean = mean(Milk),
    Grocery_mean = mean(Grocery),
    Frozen_mean = mean(Frozen),
    DetPaper_mean = mean(Detergents_Paper),
    Delicassen_mean = mean(Delicassen),
    .groups = "drop"
  )

cat("\n=== Cluster Profiles – Mean Spending (original monetary units) ===\n")
print(profile_raw)

# 9.2  Channel and Region composition per cluster
cat("\n=== Channel Composition by Cluster ===\n")
print(table(df_final$Cluster, df_final$Channel))

cat("\n=== Region Composition by Cluster ===\n")
print(table(df_final$Cluster, df_final$Region))

# 9.3  Radar / Spider chart of normalised mean spend per cluster
profile_norm <- df_final %>%
  bind_cols(df_spend) %>%
  group_by(Cluster) %>%
  summarise(across(all_of(spend_vars), mean), .groups = "drop") %>%
  mutate(across(all_of(spend_vars), ~ rescale(.x, to = c(0, 1))))

profile_long <- profile_norm %>%
  pivot_longer(-Cluster, names_to = "Variable", values_to = "Score")

ggplot(
  profile_long,
  aes(x = Variable, y = Score, colour = Cluster, group = Cluster)
) +
  geom_polygon(aes(fill = Cluster), alpha = 0.15, linewidth = 0.8) +
  geom_line(linewidth = 0.9) +
  geom_point(size = 2.5) +
  coord_polar() +
  scale_colour_manual(values = CLUSTER_COLS) +
  scale_fill_manual(values = CLUSTER_COLS) +
  facet_wrap(~Cluster, labeller = label_both) +
  labs(
    title = "Cluster Spending Profiles – Radar Chart",
    subtitle = "Normalised mean spend per category (0 = min, 1 = max)"
  ) +
  theme_minimal(base_size = 11) +
  theme(
    axis.text.x = element_text(size = 8),
    legend.position = "none",
    strip.text = element_text(face = "bold")
  )

# 9.4  Grouped bar chart of mean spend by cluster
profile_long2 <- profile_raw %>%
  pivot_longer(
    cols = ends_with("_mean"),
    names_to = "Category",
    values_to = "MeanSpend"
  ) %>%
  mutate(Category = str_remove(Category, "_mean"))

ggplot(profile_long2, aes(x = Category, y = MeanSpend, fill = Cluster)) +
  geom_col(position = position_dodge(0.8), width = 0.7) +
  scale_fill_manual(values = CLUSTER_COLS) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Mean Annual Spend by Cluster and Category",
    subtitle = "Monetary units (m.u.)",
    x = NULL,
    y = "Mean Annual Spend (m.u.)",
    fill = "Cluster"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    axis.text.x = element_text(angle = 30, hjust = 1),
    legend.position = "top"
  )

# 9.5  Boxplots – spend distribution per cluster for each category
df_box <- df_final %>%
  bind_cols(df_spend) %>%
  pivot_longer(
    cols = all_of(spend_vars),
    names_to = "Category",
    values_to = "Spend"
  )

ggplot(df_box, aes(x = Cluster, y = Spend, fill = Cluster)) +
  geom_boxplot(outlier.size = 1.2, outlier.alpha = 0.5, alpha = 0.8) +
  scale_fill_manual(values = CLUSTER_COLS) +
  scale_y_continuous(labels = comma) +
  facet_wrap(~Category, scales = "free_y", ncol = 3) +
  labs(
    title = "Spending Distribution by Cluster and Category",
    x = "Cluster",
    y = "Annual Spend (m.u.)"
  ) +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none", strip.text = element_text(face = "bold"))

# --- 10. HAC vs K-Means Agreement Check -------------------------------------

hac_labels_k3 <- cutree(hac_ward, k = 3)

# Align HAC labels to K-Means labels by majority vote (label-switching fix)
conf_mat <- table(HAC = hac_labels_k3, KMeans = km_final$cluster)
cat("\n=== Contingency Table: HAC vs K-Means Assignments ===\n")
print(conf_mat)

agreement <- sum(apply(conf_mat, 1, max)) / nrow(df_final)
cat(sprintf(
  "\nLabel-adjusted agreement between HAC and K-Means: %.1f%%\n",
  agreement * 100
))
cat("(Values above 85%% indicate the two methods recover the same structure)\n")

# --- 11. Export Results ------------------------------------------------------

output_path <- "wholesale_clustered.csv"
write.csv(df_final, output_path, row.names = FALSE)
cat("\nFull dataset with cluster labels saved to:\n", output_path, "\n")

# --- 12. Business Interpretation Summary ------------------------------------

cat("\n")
cat("================================================================\n")
cat("          CUSTOMER SEGMENTATION – BUSINESS SUMMARY             \n")
cat("================================================================\n")
cat("\n  METHOD:  HAC (Ward.D2) to identify structure +\n")
cat("           K-Means (k=3) on PCA scores for final assignment\n")
cat("  INPUT:   log1p-transformed, standardised spend variables\n")
cat("           projected onto retained PCs (Kaiser criterion)\n")
cat("\n  CLUSTER 1 – 'High-Volume Fresh Buyers' (HoReCa dominant)\n")
cat("    • Highest Fresh and Frozen spend\n")
cat("    • Low Grocery, Detergents_Paper\n")
cat("    → Hotels, restaurants, cafés; perishable-led ordering\n")
cat("    IMPLICATION: Prioritise cold-chain logistics; daily/weekly\n")
cat("    replenishment cycles; dedicated HoReCa sales reps\n")
cat("\n  CLUSTER 2 – 'Packaged Goods Retailers' (Retail dominant)\n")
cat("    • Highest Grocery, Milk, Detergents_Paper\n")
cat("    • Low Fresh spend\n")
cat("    → Supermarkets, convenience stores; ambient goods focus\n")
cat("    IMPLICATION: Bulk-order pricing tiers; shelf-stable\n")
cat("    inventory buffers; quarterly promotions on detergents\n")
cat("\n  CLUSTER 3 – 'Small / Mixed Buyers'\n")
cat("    • Below-average spend across all categories\n")
cat("    • Mixed Channel and Region membership\n")
cat("    → Small independent shops, delis, mixed traders\n")
cat("    IMPLICATION: Route-optimised delivery (lower margin);\n")
cat("    upsell programmes; self-service ordering portal\n")
cat("\n  NOTE: Exact cluster labels (1/2/3) may differ from above\n")
cat("  depending on K-Means initialisation; inspect profile_raw\n")
cat("  and the radar/bar charts to assign the business labels.\n")
cat("================================================================\n")

The (unedited) outputs generated by running this code, are as follows:

=== Pre-processing complete ===
Dimensions of scaled input matrix: 440 x 6 

PCA: retaining 2 PCs (Kaiser criterion), explaining 71.3% of variance

=== Agglomerative Coefficients (closer to 1 = stronger structure) ===
           Method Agglomerative_Coefficient
average   average                    0.8761
single     single                    0.8031
complete complete                    0.9204
ward         ward                    0.9734
weighted weighted                    0.8894

Ward.D2 should show the highest coefficient, confirming it as best linkage.


=== Gap Statistic Results ===
Clustering Gap statistic ["clusGap"] from call:
clusGap(x = pc_scores, FUNcluster = kmeans, K.max = 10, B = 100, nstart = 25, iter.max = 300)
B=100 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
 --> Number of clusters (method 'firstSEmax', SE.factor=1): 1
          logW   E.logW       gap     SE.sim
 [1,] 5.633372 6.336062 0.7026902 0.01710636
 [2,] 5.328019 6.046137 0.7181186 0.01843138
 [3,] 5.175220 5.824960 0.6497394 0.01736385
 [4,] 5.049390 5.630825 0.5814342 0.01659374
 [5,] 4.959056 5.523051 0.5639956 0.01628787
 [6,] 4.850967 5.427449 0.5764815 0.01530802
 [7,] 4.787376 5.342180 0.5548043 0.01598955
 [8,] 4.721439 5.263845 0.5424064 0.01541976
 [9,] 4.667722 5.196424 0.5287017 0.01431602
[10,] 4.613041 5.138184 0.5251432 0.01353503

== SOURCE ERROR ==
 unused argument (ggtheme = theme_minimal()) 

B.2.2 Response 2: ChatGPT (GPT-5.3)

Below is a complete R workflow for clustering, building directly on the PCA-ready preprocessing. I’ll also explain why this method and how to choose the number of clusters.

Why this approach?

Method: K-means clustering

  • Works well for continuous, scaled data
  • Efficient for your dataset size (440 rows)
  • Produces clear, actionable segments for business use (inventory, pricing, sales)

Why use PCA first?

  • Reduces noise and multicollinearity (e.g., Grocery ↔︎ Detergents_Paper)
  • Improves clustering stability
  • Makes clusters easier to visualize

Choosing number of clusters (k) We’ll use:

  • Elbow method (within-cluster variance)
  • Silhouette score (cluster separation quality)

Typically, this dataset yields ~3–5 clusters, often:

  • Retail-heavy bulk buyers
  • HORECA fresh-focused buyers
  • Low-volume or niche customers

R Code: Clustering Pipeline Moved to below this response box

Interpretation Guide After running this, you should:

  1. Pick k
    • Look for:
      • “Elbow” bend in WSS plot
      • Peak silhouette score
    • Usually k = 3 or 4 works well here
  2. Interpret clusters (example patterns) You may find:
    • Cluster 1: High Grocery + Detergents → Retail chains
    • Cluster 2: High Fresh + Frozen → HORECA clients
    • Cluster 3: Low across all → Small / infrequent buyers
  3. Translate into business actions
    • Inventory planning: stock Fresh vs Grocery differently per segment
    • Pricing strategy: volume discounts for retail clusters
    • Sales deployment: assign reps by segment type

Optional (better alternative) If you want more robust segmentation:

  • Try Hierarchical clustering (for validation)
  • Or Gaussian Mixture Models (GMM) for probabilistic clusters

If you want, I can help you turn this into a fully polished report with cluster personas and business recommendations-that’s usually what gets top marks.

This response included the following R code as part of the LLM’s in-line response:

# Load libraries
library(tidyverse)
library(factoextra)
library(cluster)

# Load data
data <- read.csv("Wholesale_customers_data.csv")

# Convert categorical variables
data$Channel <- as.factor(data$Channel)
data$Region <- as.factor(data$Region)

# Select continuous variables
spending_data <- data %>%
  select(Fresh, Milk, Grocery, Frozen, Detergents_Paper, Delicassen)

# Log transform to reduce skewness
spending_log <- log1p(spending_data)

# Scale data
spending_scaled <- scale(spending_log)

# -------------------------
# PCA (for dimensionality reduction)
# -------------------------
pca_result <- prcomp(spending_scaled, center = TRUE, scale. = TRUE)

# Extract first few principal components (e.g., first 3)
pca_scores <- as.data.frame(pca_result$x[, 1:3])

# -------------------------
# Determine optimal number of clusters
# -------------------------

# Elbow method
fviz_nbclust(pca_scores, kmeans, method = "wss") +
  ggtitle("Elbow Method")

# Silhouette method
fviz_nbclust(pca_scores, kmeans, method = "silhouette") +
  ggtitle("Silhouette Method")

# -------------------------
# Apply K-means clustering
# -------------------------

set.seed(123) # for reproducibility

k <- 3 # Adjust based on elbow/silhouette results

kmeans_result <- kmeans(pca_scores, centers = k, nstart = 25)

# Add cluster labels to original data
data$Cluster <- as.factor(kmeans_result$cluster)

# -------------------------
# Visualisation
# -------------------------

# Cluster plot
fviz_cluster(
  kmeans_result,
  data = pca_scores,
  geom = "point",
  ellipse.type = "convex",
  palette = "jco",
  ggtheme = theme_minimal()
)

# -------------------------
# Cluster profiling
# -------------------------

# Average spending per cluster
cluster_summary <- data %>%
  group_by(Cluster) %>%
  summarise(across(Fresh:Delicassen, mean))

print(cluster_summary)

# Distribution of Channel by cluster
table(data$Cluster, data$Channel)

# Distribution of Region by cluster
table(data$Cluster, data$Region)

The (unedited) outputs generated by running this code, are as follows:

# A tibble: 3 × 7
  Cluster  Fresh   Milk Grocery Frozen Detergents_Paper Delicassen
  <fct>    <dbl>  <dbl>   <dbl>  <dbl>            <dbl>      <dbl>
1 1       11939.  2006.   2502.  3266.             424.       891.
2 2        2997.  7045.  12505.   608.            5511.       797.
3 3       17182. 10658.  13381.  4181.            5004.      2867.
   
      1   2
  1 210   3
  2  32  50
  3  56  89
   
      1   2   3
  1  40  26 147
  2  14  10  58
  3  23  11 111

B.2.3 Evaluation

  • Claude generates a 633-line two-stage pipeline (HAC then K-Means) with four \(k\)-selection criteria (dendrogram height, elbow, silhouette, and gap statistic) plus a 30-index NbClust majority vote
  • This goes beyond our analysis, which relied primarily on silhouette scores
  • The gap statistic uses bootstrapped reference distributions (B = 100), an approach we did not employ
  • The script includes a coloured dendrogram with annotated cut heights, a Calinski-Harabasz index, and a formal HAC-vs-K-Means agreement check
  • This broadly follows our approach but expanding on specific diagnostics
  • The business summary translates each cluster into operational implications (cold-chain logistics, bulk pricing, route optimisation), matching and in places exceeding our commercial framing. This was generated entirely from prior domain knowledge rather than from the code output, a limitation not clearly communicated to the user
  • ChatGPT’s code runs cleanly without any errors and produces appropriate diagnostics (elbow, silhouette)
  • The decision to cluster on PCA scores is justified with a clear rationale (noise reduction, multicollinearity), and cross-tabulation against Channel and Region is included unprompted
  • Claude’s script fails at runtime: fviz_gap_stat(gap_stat, ggtheme = theme_minimal()) : unused argument (ggtheme = theme_minimal())
  • Claude’s business summary was generated before code execution suggesting that the cluster profiles are predictions from training data, not derived findings
  • ChatGPT presents K-Means as the sole method, with hierarchical clustering and GMM relegated to “optional” afterthoughts. Its cluster descriptions (“Retail chains,” “HORECA clients,” “Small/infrequent buyers”) similarly appear as expected outcomes before any code is run, raising the same memorisation concern as Claude
  • ChatGPT’s elbow and silhouette plots are produced but not interpreted. \(k\) is hardcoded to 3 with the comment “adjust based on results,” leaving the decision entirely to the user
  • Neither model applied robustness checks unprompted

B.3 Prompt 3: Alternative methods & robustness checks

Are there any alternative methods or robustness checks you would recommend?

B.3.1 Response 3: Claude (Sonnet 4.6)

Alternative Clustering Methods

1. Gaussian Mixture Models (GMM / Model-Based Clustering). K-Means assumes clusters are spherical and equally sized - a hard geometric constraint. GMM relaxes this by fitting a probabilistic mixture of multivariate Gaussians, allowing clusters to be elliptical, differently sized, and differently oriented. Each customer gets a soft probability of belonging to each cluster rather than a hard label. In R: mclust::Mclust() - it uses BIC to automatically select both the number of clusters and the covariance structure (e.g., diagonal vs. full, tied vs. variable). On wholesale data, where the HoReCa segment often has a very different spread from the Retail segment, GMM frequently produces more honest assignments. Prefer it over K-Means when the biplot shows clusters that are elongated or unequal in density - which is likely here given the strong Channel separation on PC2.

2. DBSCAN (Density-Based Spatial Clustering). DBSCAN identifies clusters as dense regions of points separated by sparse regions, and critically, it classifies low-density points as noise/outliers rather than forcing them into a cluster. This is valuable for wholesale data because a handful of very large spenders (e.g., a major supermarket chain) can distort K-Means centroids significantly. In R: dbscan::dbscan() - requires tuning two parameters: eps (neighbourhood radius) and minPts (minimum points to form a core). The kNNdistplot() function helps select eps via a knee in the k-NN distance curve. Limitation: DBSCAN struggles with clusters of varying density (common in spending data), and it does not generalise naturally to new observations - a limitation if you want to score new customers joining the database.

3. K-Medoids / PAM (Partitioning Around Medoids). Like K-Means but uses actual data points as cluster centres (medoids) rather than arithmetic means. This makes it substantially more robust to outliers, since a single extreme spender cannot pull a centroid arbitrarily far. It also works with any distance metric, not just Euclidean. In R: cluster::pam() - also produces a silhouette plot natively. A useful direct comparison: if PAM and K-Means produce very similar cluster assignments, it confirms the K-Means solution is not being distorted by outliers.

4. Agglomerative HAC with Alternative Linkages. The script already uses Ward.D2, but for a robustness check you should formally compare it against Complete and Average linkage and compute the Cophenetic Correlation Coefficient for each - this measures how faithfully the dendrogram preserves the original pairwise distances. The linkage with the highest cophenetic correlation is the most geometrically honest representation of the data’s structure.

Robustness Checks

5. Stability Analysis via Bootstrap (clusterboot). The most important robustness check you can run. fpc::clusterboot() repeatedly resamples the data (with replacement), re-clusters each bootstrap sample, and measures how consistently the same clusters re-emerge using the Jaccard similarity coefficient. A cluster with mean Jaccard > 0.75 is considered stable; below 0.5 it is likely an artefact of the particular sample. If the three clusters survive this test, there is strong evidence that the segmentation is a genuine feature of the population of wholesale customers in Portugal, not just of these 440 observations.

6. Sensitivity to Outliers. Wholesale spend data almost always contains extreme values (e.g., a single very large hotel group). Two targeted checks: first, winsorising the top 1–5% of each spending variable before clustering, then comparing cluster assignments to the unwinsorised solution using the Adjusted Rand Index (ARI) - an ARI close to 1.0 means the solution is insensitive to those extreme points. Second, leave-one-out influence analysis: iteratively remove the top 10 highest-spend customers and re-run K-Means. If cluster memberships for the remaining 430 customers shift substantially, the centroids are being pulled by individual outliers.

7. Sensitivity to the Number of PCs. The script retains PCs by the Kaiser criterion (eigenvalue ≥ 1), which typically gives 3 PCs explaining ~80% of variance. A robustness check is to re-run clustering using (a) 2 PCs only and (b) all 6 standardised variables directly (no dimensionality reduction), and compare the three solutions with ARI. Convergence across all three input representations confirms that the structure is not an artefact of the PCA truncation.

8. Multiple Random Initialisations & Consensus Clustering. K-Means with nstart = 50 already guards against poor local minima, but a stronger check is consensus clustering via the ConsensusClusterPlus package. It runs the algorithm hundreds of times on subsampled data and builds a consensus matrix - an N×N matrix where entry (i,j) records the proportion of runs in which customers i and j ended up in the same cluster. A well-separated block structure in the consensus heatmap is strong visual confirmation that the clusters are stable and not sensitive to initialisation.

9. External Validation Against Channel and Region. Since Channel and Region are known labels, they can be used as external validators without having been used in the clustering itself. Compute the Adjusted Rand Index between cluster assignments and Channel labels, and the Normalised Mutual Information between cluster assignments and Region labels. A high ARI against Channel (expect ~0.5–0.7 on this dataset) confirms the unsupervised algorithm is recovering a structure that aligns with a known business reality - a strong sign of face validity.

Summary Recommendation In priority order for a report:

  • Run clusterboot() - the single most persuasive robustness check for any clustering paper
  • Fit a GMM with Mclust() - the most principled alternative; compare ARI against K-Means
  • Run PAM and compare ARI - confirms outlier robustness
  • Compute cophenetic correlations across linkage methods - validates the HAC choice
  • Test sensitivity to PC retention - shows the structure is not an artefact of dimensionality reduction

If the consensus across all five checks points to three clusters with consistent membership, you have a very defensible segmentation - both statistically and for a business audience.

Claude did not return any R code as part of its response.

B.3.2 Response 3: ChatGPT (GPT-5.3)

Yes - and this is exactly the kind of thing that strengthens your analysis. K-means + PCA is a solid baseline, but for a data-driven segmentation, you should always validate that your clusters are stable, meaningful, and not method-dependent. Here are the most important alternatives and robustness checks, with R code.

1. Alternative Clustering Methods

(A) Hierarchical Clustering (Validation + interpretability) Why:

  • No need to pre-specify k upfront
  • Produces a dendrogram (useful for justification in reports)
  • Helps confirm whether k-means structure is real

R code moved to below this response box

If clusters align well with k-means → strong evidence of robustness

(B) Gaussian Mixture Models (GMM) Why:

  • Allows non-spherical clusters (k-means assumes spherical)
  • Provides probabilistic membership (useful for borderline customers)

R code moved to below this response box

If GMM selects similar k → supports your choice. If different → suggests more complex structure

(C) DBSCAN (Density-based) Why:

  • Detects outliers/noise
  • Does not require specifying k

R code moved to below this response box

Useful to identify unusual customers (important for risk or niche segments)

2. Robustness Checks

(A) Stability Analysis (Resampling) Why:

  • Ensures clusters aren’t driven by random variation

R code moved to below this response box

Values:

  • > 0.85 → very stable
  • 0.6–0.85 → moderately stable
  • < 0.6 → weak clusters

(B) Silhouette Analysis (Quality check)

R code moved to below this response box

Interpretation:

  • > 0.5 → strong structure
  • 0.25–0.5 → reasonable
  • < 0.25 → weak clustering

(C) Compare Clustering on Raw vs PCA Data

R code moved to below this response box

If results are similar → PCA did not distort structure. If very different → investigate further

(D) Sensitivity to k

R code moved to below this response box

Ensures your choice of k is not arbitrary

How to Present This in Your Assignment You can frame it like this:

  • Primary method: K-means on PCA-reduced data
  • Validation methods:
    • Hierarchical clustering → confirms structure
    • GMM → checks distributional assumptions
    • Silhouette + Elbow → supports choice of k
    • Stability analysis → confirms robustness

Strong Conclusion (what markers look for) A high-quality conclusion would say:

  • Clusters are consistent across methods
  • k = 3 (or 4) is supported by multiple criteria
  • Segments are stable and interpretable
  • Results are actionable for business decisions

If you want, I can help you turn this into a top-grade “Methodology + Validation” write-up with proper academic language and justification.

This response included the following R code as part of the LLM’s in-line response. This code assumes that variables from the previous response have been loaded. The new code generated is included below from line 87.

# Load libraries
library(tidyverse)
library(factoextra)
library(cluster)

# Load data
data <- read.csv("Wholesale_customers_data.csv")

# Convert categorical variables
data$Channel <- as.factor(data$Channel)
data$Region <- as.factor(data$Region)

# Select continuous variables
spending_data <- data %>%
  select(Fresh, Milk, Grocery, Frozen, Detergents_Paper, Delicassen)

# Log transform to reduce skewness
spending_log <- log1p(spending_data)

# Scale data
spending_scaled <- scale(spending_log)

# -------------------------
# PCA (for dimensionality reduction)
# -------------------------
pca_result <- prcomp(spending_scaled, center = TRUE, scale. = TRUE)

# Extract first few principal components (e.g., first 3)
pca_scores <- as.data.frame(pca_result$x[, 1:3])

# -------------------------
# Determine optimal number of clusters
# -------------------------

# Elbow method
fviz_nbclust(pca_scores, kmeans, method = "wss") +
  ggtitle("Elbow Method")

# Silhouette method
fviz_nbclust(pca_scores, kmeans, method = "silhouette") +
  ggtitle("Silhouette Method")

# -------------------------
# Apply K-means clustering
# -------------------------

set.seed(123) # for reproducibility

k <- 3 # Adjust based on elbow/silhouette results

kmeans_result <- kmeans(pca_scores, centers = k, nstart = 25)

# Add cluster labels to original data
data$Cluster <- as.factor(kmeans_result$cluster)

# -------------------------
# Visualisation
# -------------------------

# Cluster plot
fviz_cluster(
  kmeans_result,
  data = pca_scores,
  geom = "point",
  ellipse.type = "convex",
  palette = "jco",
  ggtheme = theme_minimal()
)

# -------------------------
# Cluster profiling
# -------------------------

# Average spending per cluster
cluster_summary <- data %>%
  group_by(Cluster) %>%
  summarise(across(Fresh:Delicassen, mean))

print(cluster_summary)

# Distribution of Channel by cluster
table(data$Cluster, data$Channel)

# Distribution of Region by cluster
table(data$Cluster, data$Region)

# New code below ====

# Compute distance matrix
dist_mat <- dist(pca_scores)

# Hierarchical clustering (Ward's method)
hc <- hclust(dist_mat, method = "ward.D2")

# Plot dendrogram
plot(hc, labels = FALSE, main = "Dendrogram")

# Cut tree into k clusters
k <- 3
hc_clusters <- cutree(hc, k = k)

# Compare with k-means
table(hc_clusters, kmeans_result$cluster)

library(mclust)

gmm_model <- Mclust(pca_scores)

# Best model and number of clusters (BIC-based)
summary(gmm_model)

# Cluster assignments
gmm_clusters <- gmm_model$classification

# Compare with k-means
table(gmm_clusters, kmeans_result$cluster)

library(dbscan)

# kNN distance plot to choose eps
kNNdistplot(pca_scores, k = 5)
abline(h = 1.5, col = "red") # adjust visually

# Run DBSCAN
db <- dbscan(pca_scores, eps = 1.5, minPts = 5)

# Cluster labels
table(db$cluster)

library(fpc)

cluster_stability <- clusterboot(
  pca_scores,
  B = 100,
  clustermethod = kmeansCBI,
  k = 3
)

cluster_stability$bootmean


library(cluster)

sil <- silhouette(kmeans_result$cluster, dist(pca_scores))

# Plot
fviz_silhouette(sil)

# Average silhouette score
mean(sil[, 3])

# K-means on original scaled data
set.seed(123)
kmeans_raw <- kmeans(spending_scaled, centers = 3, nstart = 25)

# Compare assignments
table(kmeans_raw$cluster, kmeans_result$cluster)

# Try multiple k values
k_values <- 2:6

for (k in k_values) {
  km <- kmeans(pca_scores, centers = k, nstart = 25)
  cat("k =", k, " | Tot.withinss =", km$tot.withinss, "\n")
}

The (unedited) outputs generated by running this code, are as follows:

# A tibble: 3 × 7
  Cluster  Fresh   Milk Grocery Frozen Detergents_Paper Delicassen
  <fct>    <dbl>  <dbl>   <dbl>  <dbl>            <dbl>      <dbl>
1 1       11939.  2006.   2502.  3266.             424.       891.
2 2        2997.  7045.  12505.   608.            5511.       797.
3 3       17182. 10658.  13381.  4181.            5004.      2867.
   
      1   2
  1 210   3
  2  32  50
  3  56  89
   
      1   2   3
  1  40  26 147
  2  14  10  58
  3  23  11 111
           
hc_clusters   1   2   3
          1   1   2 119
          2 212  18  19
          3   0  62   7
---------------------------------------------------- 
Gaussian finite mixture model fitted by EM algorithm 
---------------------------------------------------- 

Mclust VEI (diagonal, equal shape) model with 3 components: 

 log-likelihood   n df       BIC       ICL
      -2013.013 440 16 -4123.414 -4328.873

Clustering table:
  1   2   3 
 62 202 176 
            
gmm_clusters   1   2   3
           1  32  28   2
           2  12  54 136
           3 169   0   7

  0   1 
 16 424 
boot 1 
boot 2 
boot 3 
boot 4 
boot 5 
boot 6 
boot 7 
boot 8 
boot 9 
boot 10 
boot 11 
boot 12 
boot 13 
boot 14 
boot 15 
boot 16 
boot 17 
boot 18 
boot 19 
boot 20 
boot 21 
boot 22 
boot 23 
boot 24 
boot 25 
boot 26 
boot 27 
boot 28 
boot 29 
boot 30 
boot 31 
boot 32 
boot 33 
boot 34 
boot 35 
boot 36 
boot 37 
boot 38 
boot 39 
boot 40 
boot 41 
boot 42 
boot 43 
boot 44 
boot 45 
boot 46 
boot 47 
boot 48 
boot 49 
boot 50 
boot 51 
boot 52 
boot 53 
boot 54 
boot 55 
boot 56 
boot 57 
boot 58 
boot 59 
boot 60 
boot 61 
boot 62 
boot 63 
boot 64 
boot 65 
boot 66 
boot 67 
boot 68 
boot 69 
boot 70 
boot 71 
boot 72 
boot 73 
boot 74 
boot 75 
boot 76 
boot 77 
boot 78 
boot 79 
boot 80 
boot 81 
boot 82 
boot 83 
boot 84 
boot 85 
boot 86 
boot 87 
boot 88 
boot 89 
boot 90 
boot 91 
boot 92 
boot 93 
boot 94 
boot 95 
boot 96 
boot 97 
boot 98 
boot 99 
boot 100 
[1] 0.6982740 0.7161780 0.8350312
  cluster size ave.sil.width
1       1  213          0.39
2       2   82          0.19
3       3  145          0.33
[1] 0.3313364
   
      1   2   3
  1   1  80   0
  2   0   2 145
  3 212   0   0
k = 2  | Tot.withinss = 1365.289 
k = 3  | Tot.withinss = 1075.661 
k = 4  | Tot.withinss = 912.0646 
k = 5  | Tot.withinss = 807.8276 
k = 6  | Tot.withinss = 714.8749 

  • Claude provides a detailed response describing five alternative methods (GMM, DBSCAN, PAM, alternative linkages, consensus clustering) and four robustness checks (bootstrap stability, outlier sensitivity, PC-retention sensitivity, external validation via ARI)
  • The rationale and interpretation threshold of each is stated in the response
  • The recommendation to validate against Channel via ARI (without using Channel in the clustering) mirrors our own approach
  • The priority-ordered summary at the end is practical and well-calibrated, distinguishing essential checks from aspirational ones
  • ChatGPT, unlike Claude, provides executable R code for each method (hierarchical, GMM, DBSCAN, bootstrap stability, silhouette, raw-vs-PCA comparison)
  • ChatGPT’s interpretation thresholds (e.g., Jaccard \(> 0.85\) = very stable) and sensitivity-to-\(k\) loop are practical additions absent from our approach
  • Claude returns text only, but this is likely due to the nature of the prompt.
  • Several of Claude’s suggestions (consensus clustering via ConsensusClusterPlus, leave-one-out influence analysis) are disproportionate for a 440-row dataset and read more as a textbook survey than a targeted recommendation
  • ChatGPT’s code depends on objects from Prompt 2 (pca_scores, kmeans_result, spending_scaled) without documenting this dependency
  • The provided code will note run in isolation,
  • The DBSCAN eps parameter is set to 1.5 with “adjust visually” but no dataset-specific tuning guidance is provided
  • ChatGPT’s “How to Present This in Your Assignment” section reveals the model optimising for marks rather than analytical rigour. Framing advice (“what markers look for”) substitutes presentation strategy for genuine critical evaluation
  • Neither model applied any of these checks unprompted in Prompt 2. Both required an explicit request, suggesting robustness validation is not part of either model’s default analytical workflow

B.4 Prompt 4: Reflective prompt

We used slightly different prompts for each LLM here. The base query was the same, but we pasted in extracts from each LLM’s previous output as part of each prompt.

> Inserting LLM output back into the LLM here

these were your comments and they are correct and apply exactly on the data that i have, so i’m wondering did you somehow analysized the data yourself

B.4.1 Claude (Sonnet 4.6) output used in the prompt

================================================================
          CUSTOMER SEGMENTATION – BUSINESS SUMMARY             
================================================================

  METHOD:  HAC (Ward.D2) to identify structure +
           K-Means (k=3) on PCA scores for final assignment
  INPUT:   log1p-transformed, standardised spend variables
           projected onto retained PCs (Kaiser criterion)

  CLUSTER 1 – 'High-Volume Fresh Buyers' (HoReCa dominant)
    • Highest Fresh and Frozen spend
    • Low Grocery, Detergents_Paper
    → Hotels, restaurants, cafés; perishable-led ordering
    IMPLICATION: Prioritise cold-chain logistics; daily/weekly
    replenishment cycles; dedicated HoReCa sales reps

  CLUSTER 2 – 'Packaged Goods Retailers' (Retail dominant)
    • Highest Grocery, Milk, Detergents_Paper
    • Low Fresh spend
    → Supermarkets, convenience stores; ambient goods focus
    IMPLICATION: Bulk-order pricing tiers; shelf-stable
    inventory buffers; quarterly promotions on detergents

  CLUSTER 3 – 'Small / Mixed Buyers'
    • Below-average spend across all categories
    • Mixed Channel and Region membership
    → Small independent shops, delis, mixed traders
    IMPLICATION: Route-optimised delivery (lower margin);
    upsell programmes; self-service ordering portal

  NOTE: Exact cluster labels (1/2/3) may differ from above
  depending on K-Means initialisation; inspect profile_raw
  and the radar/bar charts to assign the business labels.
================================================================

B.4.2 ChatGPT (GPT-5.3) output used in the prompt

Example Segment Types

  1. High Fresh, Low Grocery Likely restaurants / HORECA
  2. High Grocery & Detergents Likely retail stores
  3. Balanced / low spend Small or low-frequency buyers

B.4.3 Response 4: Claude (Sonnet 4.6)

Did the LLM analyse the actual data? No - the LLM did not have access to the data file. The descriptions it provided were an informed prediction based on two things:

1. This is a well-known public dataset. The UCI Wholesale Customers dataset is one of the most widely used benchmark datasets in clustering tutorials and academic papers. The LLM had seen it discussed extensively in its training data, and therefore knew its typical structure - the HoReCa/Retail split on PC2, the three-cluster solution, and the rough spending profiles each cluster tends to produce.

2. Domain logic for Portuguese wholesale distribution. Even without knowing the specific dataset, the business logic is fairly predictable: restaurants and hotels will dominate Fresh and Frozen spend - that is the nature of their business. Supermarkets and retailers will dominate Grocery, Milk, and Detergents_Paper - ambient shelf goods are their core stock. Any wholesale customer base will contain a long tail of smaller, lower-spend mixed buyers. The cluster descriptions were therefore a combination of prior knowledge about this specific dataset and general domain reasoning about wholesale food distribution.

Implication for interpretation. Since this dataset is publicly known, the “expected” results are well established. What differentiates a strong analysis is not just reproducing the three clusters, but the quality of the methodological justification - why log1p, why Ward.D2, why PCA before clustering, and why the robustness checks matter. That reasoning is what the code and discussion above has been building toward.

B.4.4 Response 4: ChatGPT (GPT-5.3)

Good question-and fair to ask. No, I didn’t actually access or run your specific dataset. I can’t see files on your computer or execute code on them unless you explicitly upload data here. What I did instead is rely on well-known, highly consistent patterns in this exact dataset (Wholesale Customers Data Set, commonly used in clustering/PCA coursework). This dataset has been analyzed many times, and it almost always produces very similar structures because of how the variables behave.

Why my interpretation matched your results There are strong, repeatable patterns in this data:

1. Correlation structure drives clustering. Grocery, Milk, Detergents_Paper → highly correlated → typically forms a “Retail-heavy” cluster. Fresh, Frozen → less correlated with the above → typically forms a “HORECA / fresh-food” cluster.

2. Channel variable reinforces it. Even though clustering is unsupervised: Channel = 2 (Retail) → high Grocery + Detergents. Channel = 1 (HORECA) → higher Fresh/Frozen. So clustering naturally reconstructs this split.

3. Skewed spending distribution. A large group of customers has low spending across categories. This almost always becomes a “small buyers” cluster.

Why this happens (intuitively) Think of your customers as three business types:

  • Retailers - buy packaged goods in bulk → Grocery + Detergents spike
  • Hotels/Restaurants (HORECA) - buy perishable goods → Fresh + Frozen dominate
  • Small clients - low, inconsistent purchasing → low across all

These behavioral differences are strong enough that almost any reasonable clustering method will rediscover them.

Important (for your assignment) Even though your results match expectations, you should not present them as assumed. You should:

  • Show cluster summaries (means per cluster)
  • Compare with Channel distribution
  • Use plots (PCA, boxplots)
  • Support with validation metrics

A stronger way to phrase it in your report Instead of: “Clusters represent retail vs HORECA”

Say something like: “Cluster analysis reveals three distinct customer segments characterized by (i) high Grocery and Detergents spending, (ii) high Fresh and Frozen spending, and (iii) uniformly low spending. Cross-tabulation with the Channel variable shows strong alignment with Retail and HORECA customer types, validating the segmentation.”

Bottom line

  • I didn’t see your data
  • Your results match because the dataset has very stable, well-known structure
  • This is actually a good sign → your pipeline is working correctly

If you want, I can help you go one level deeper-like naming your clusters professionally (e.g., “Bulk Retailers”, “Fresh-Focused HORECA”, “Low-Value Accounts”) and tying them tightly to strategy.

  • This is the response where the outputs from both models align most closely
  • Both models acknowledge that the UCI Wholesale Customers dataset appears extensively in their training data and that their earlier predictions were based on memorised patterns rather than derived analysis
  • Claude’s observation that methodological justification is what differentiates a strong analysis is a useful point
  • Neither model’s outputs for this dataset can be distinguished from pattern-matching against known results
  • Both frame memorisation as a minor point without acknowledging the risk this poses when applied to unseen data
  • ChatGPT’s suggestion to phrase results as discovered (“Cluster analysis reveals”) rather than predicted is problematic and misleading
  • ChatGPT’s reassurance that “your pipeline is working correctly” assumes results match the expected pattern, which could discourage investigation of genuinely unexpected outcomes

C Reproducibility, accessibility & declarations

C.1 Reproducibility & accessibility

An accessible HTML version of this report is available via a public GitHub page:

This report was created in R Markdown. The source code is open source and is available via:

Changes made to this document were tracked using Git and are also available via the same repository (https://github.com/ytterbiu/SMM636-ML-group-coursework-03-g01/commits/main/). Please note that this commit history reflects updates made to the project repository and is not a comprehensive timeline of all contributions by group members.

C.2 Word count

The total word count for the core sections of this report is 997 words (below the 1000 word limit).

The calculation was performed programmatically directly from the source .rmd file. An R script was used to read the raw document lines and strip out non-text elements, including the YAML front matter, R code chunks, sub-headers, markdown formatting (tables and figures), and LaTeX/HTML syntax. The script isolates lines falling within #intro, #qone, #qtwo, and #qthr. It then calculates the final tally using the stringi package.

The code chunk that performs this calculation is executed silently at the beginning of this document, but is reproduced below for transparency:

rmd_lines <- readLines("01SMM363-assessment-03.rmd", warn = FALSE)

# ---- Remove YAML front matter ----
yaml_bounds <- which(rmd_lines == "---")
if (length(yaml_bounds) >= 2) {
  rmd_lines <- rmd_lines[-(yaml_bounds[1]:yaml_bounds[2])]
}

# ---- Remove code chunks ----
in_chunk <- FALSE
keep <- rep(TRUE, length(rmd_lines))
for (i in seq_along(rmd_lines)) {
  if (grepl("^```\\{", rmd_lines[i])) {
    in_chunk <- TRUE
    keep[i] <- FALSE
    next
  }
  if (in_chunk && grepl("^```\\s*$", rmd_lines[i])) {
    in_chunk <- FALSE
    keep[i] <- FALSE
    next
  }
  if (in_chunk) keep[i] <- FALSE
}
rmd_lines <- rmd_lines[keep]

# ---- Identify which lines fall inside counted sections ----
# Sections to include (level-1 headers only)
counted_ids <- c("intro", "qone", "qtwo", "qthr")
in_counted_section <- FALSE
include <- rep(FALSE, length(rmd_lines))

for (i in seq_along(rmd_lines)) {
  line <- rmd_lines[i]
  # Detect any level-1 header (# but not ##)
  if (grepl("^#\\s+[^#]", line)) {
    # Check if this header belongs to a counted section
    in_counted_section <- any(sapply(counted_ids, function(id) {
      # [^}]* allows any characters (like spaces, hyphens, classes) before the 
      # closing }
      # this is required to count the words in the unnumbered introduction
      grepl(paste0("\\{\\s*#", id, "[^}]*\\}"), line)
    }))
    next  # skip the header line itself
  }
  if (in_counted_section) include[i] <- TRUE
}

counted_lines <- rmd_lines[include]

# ---- Remove sub-headers (##, ###, etc.) ----
counted_lines <- counted_lines[!grepl("^#{2,}", counted_lines)]

# ---- Remove table lines (markdown pipes) ----
counted_lines <- counted_lines[!grepl("^\\|", counted_lines)]

# ---- Remove figure captions: ![caption](...) ----
counted_lines <- counted_lines[!grepl("^!\\[", counted_lines)]

# ---- Remove fenced div markers (::: name) ----
counted_lines <- counted_lines[!grepl("^:::", counted_lines)]

# ---- Remove LaTeX commands ----
counted_lines <- counted_lines[!grepl("^\\\\", counted_lines)]

# ---- Remove HTML comments ----
counted_lines <- counted_lines[!grepl("^<!--", counted_lines)]

# ---- Remove inline R references like \@ref() ----
counted_lines <- gsub("\\\\@ref\\([^)]*\\)", "", counted_lines)

# ---- Count ----
word_count <- sum(stringi::stri_count_words(counted_lines), na.rm = TRUE)

message("Word count (sections qone, qtwo, qthr, body text only): ", word_count)

# Write for LaTeX preamble
writeLines(
  paste0("\\newcommand{\\wordcount}{",
         formatC(word_count, big.mark = ","), "}"),
  "wordcount.tex"
)

C.3 Generative AI Declaration

Generative AI tools (Claude Opus 4.6, Sonnet 4.6, ChatGPT GPT-5.3, Google Gemini Pro) were used throughout this project to assist with code generation, debugging, methodological validation, and report structuring.

The final analysis, statistical interpretation, and written content of this report are the work of all group members.


References

[1]
M. Cardoso, “Wholesale customers.” UCI Machine Learning Repository, 2013. doi: 10.24432/C5030X.
[2]
N. Abreu, “Análise do perfil do cliente recheio e desenvolvimento de um sistema promocional,” Mestrado em Marketing, ISCTE-IUL, Lisbon, Portugal, 2011. Available: https://www.iscte-iul.pt/agenda/thesis?thesis_id=2527