Information

This R Markdown document was created as part of a group assignment for SMM636 at Bayes Business School, City St George's, University of London in Term 2 2025-26.

Introduction

This report analyses annual spending data from 440 clients of a Portuguese wholesale distributor [1], [2], comprising 298 Hotel/Restaurant/Café (HoReCa) and 142 Retail customers across six product categories: Fresh, Milk, Grocery, Frozen, Detergents_Paper, and Delicassen. Data-driven customer segmentation informs inventory planning, pricing strategy, and sales team deployment, yet the existing classification is a simple binary label (Channel). This analysis uses Principal Component Analysis (PCA) and clustering to investigate whether the existing two-category classification adequately captures the structure of customer spending, or whether the data supports a more nuanced segmentation.

DT::datatable(
  data,
  filter   = "top",
  options  = list(pageLength = 15, scrollX = TRUE),
  caption  = "Wholesale customers"
)

1 Principle component analysis (PCA)

A log-transformation was applied to normalise the distribution and address the right-skewness observed across all six spending variables (skewness > 2.5 for all - see Appendix, Figures A.2 & A.3). The variables were then standardised to ensure equal contribution to the PCA, preventing categories with naturally larger absolute expenditures from dominating the model.

The first two principal components capture 71.3% of the total variance (PC1: 44.1%; PC2: 27.2%; Figure 1.1(a)). The component loadings in Figure 1.1(b) reveal distinct commercial purchasing patterns. PC1 acts as the primary “channel axis,” driven by strong positive loadings for Grocery (0.571), Detergents_Paper (0.551), and Milk (0.542), contrasted with near-zero or negative loadings for Fresh and Frozen. PC2 serves as a “perishable volume axis,” characterised by high positive loadings for Fresh (0.590), Frozen (0.590), and Delicassen (0.530), capturing the scale of fresh food purchasing independent of business type.

Figure 1.1(d) confirms this visually, showing clear separation between HoReCa and Retail customers along PC1. PCA is restricted to continuous inputs, necessitating the exclusion of Channel and Region. A Gower-distance MDS analysis incorporating these variables (Appendix, A.2) confirms that the categorical labels dominate the distance metric, validating the decision to analyse spending variables independently.

Figure 1.1: Principal Component Analysis of log-transformed, standardised spending data. (a) Scree plot showing the percentage of variance explained by each component. (b) Loadings of the original variables on the first four principal components. Values exceeding 0.4 in magnitude indicate strong contributions. (c) Biplot overlaying variable loadings (red arrows) on individual observations. (d) Individuals projected onto PC1-PC2, coloured by distribution channel (1 = HoReCa, 2 = Retail) with 95% concentration ellipses.

2 Customer clustering

Figure 2.1: Revenue concentration by product category under three segmentation approaches. Top: the distributor’s existing binary Channel classification. Middle: K-Means (k=3) spending-based clusters. Bottom: Gaussian Mixture Modelling (GMM) (k=5) spending-based clusters. Crucially, the K-Means and GMM segmentations were derived solely from log-transformed, standardised spending data without access to Channel labels. Each bar shows the percentage of total category spend attributable to each segment and the Total column shows each segment’s share of overall revenue.

K-Means clustering was applied to the same log-transformed, standardised spending data used for PCA. Silhouette analysis identifies k=2 as optimal (0.290), which recovers the known Channel distinction and validates the clustering approach. However, the silhouette score for k=3 remains strong (0.259), and the marginal reduction in cluster quality yields a substantially richer segmentation. Beyond k=3, scores plateau around 0.19 with no meaningful recovery (Appendix, Figure A.5), suggesting additional clusters fragment genuine structure rather than reveal new segments.

K-Means was chosen for its interpretability and computational efficiency on this dataset. As robustness checks, both Ward’s hierarchical clustering and Gaussian Mixture Modelling (GMM) were applied to the same input. Ward’s method (k=3) agrees with K-Means on the dominant cluster structure, with disagreement confined to boundary observations between the smaller segments (Appendix A.4). GMM relaxes K-Means’ spherical cluster assumption: the BIC selected a 5-component VVE model (Appendix, Figure A.6), splitting both channels into finer sub-segments. Cross-tabulation against Channel and Region (Appendix, Tables A.1 & A.3) confirms that these sub-segments are spending-driven rather than geographic, with Region distributed roughly proportionally within each cluster.

Figure 2.1 presents the commercial impact of each segmentation. Under the existing binary Channel label, HoReCa customers account for 76% of Fresh revenue but only 68% of customers. K-Means (k=3) identifies a “Mixed Basket” segment (81 customers, 18%) with spending patterns that cut across both channels. GMM further distinguishes Heavy Retail from Diversified Retail, and separates High-volume HoReCa from Balanced HoReCa. Figure 2.2 traces how spending flows between these segmentations, revealing that GMM-3 captures high Fresh and Frozen spenders from multiple K-Means clusters.

Figure 2.2: Sankey diagram of spending across clusters. Each stacked bar shows the percentages of total spendings broken down into product categories (left), Gaussian Mixed Mixture clusters (middle), and KMeans (k=3) clusters (right). The flows on the left describes how each GMM cluster is comprised of spendings in different product categories. The flows on the right represent how the spendings are segmented differently by KMeans. Only flows with percentage value greater than 3% are labelled to minimise clutter.

The strength of this approach is that it operates entirely without access to Channel or Region labels. The near-perfect recovery of the Channel distinction validates the segmentation as behaviourally grounded, while the emergence of additional clusters reveals sub-structure the existing classification does not capture. However, the commercial value of a 5-segment model over a 3-segment model cannot be determined from a single cross-sectional snapshot. Validation would require longitudinal purchasing data, margin information, and ideally a controlled trial of differentiated service strategies across segments.

3 Repeating 1 & 2 using LLMs

Two Large Language Models (LLMs), Claude (Sonnet 4.6) and ChatGPT (GPT-5.3), were evaluated. To ensure comparability, identical prompts were used for both models and neither was given the raw csv file. The dataset was described in the first prompt and the LLMs were asked to produce R code for independent execution. This design choice was deliberate: withholding the data replicated a commercial setting in which an organisation may not wish to upload proprietary customer data to a third-party platform. It also ensured that neither model could adapt its recommendations to the observed distributions, testing whether each model’s default methodological choices were appropriate for this dataset. Four prompts were used, included in full in Appendix B.

As shown in Table 3.1, both LLMs correctly excluded Channel and Region, recommended log-transformation and standardisation, and converged on k=3, all without access to the data. Neither model interpreted the PCA loadings with specificity. Claude labelled PC1 as a “general spending scale” rather than the channel axis revealed by the actual loadings, while ChatGPT produced plots without interpreting them. Robustness checks and limitations were not mentioned until explicitly prompted. When asked in the final prompt, both models acknowledged that the data is from a well-known public dataset whose typical outcomes appear extensively in their training data, raising the question of whether the LLMs were performing genuine analytical reasoning or pattern-matching against memorised results.

ChatGPT produced more concise, modular code that executed without errors, while Claude generated comprehensive scripts with richer methodological commentary but several execution errors requiring manual debugging. LLMs demonstrate clear value for code generation and methodological structuring, but their inability to verify outputs against actual data distributions (and their tendency to present memorised patterns as derived findings) limits their current utility and analytical value.

Table 3.1: Table 3.2: Comparison of analytical steps for our approach vs two LLMs. \(\checkmark\)~= addressed unprompted; \(\circ\)~= partial; ~= missed entirely.
Analysis Step	Our Analysis	ChatGPT (GPT-5.3)	Claude (Sonnet 4.6)
Preprocessing
Excluded Channel & Region from PCA/clustering	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)
Identified skewness / recommended log transform	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)
Standardised variables before PCA	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)
PCA
Produced interpretable scree plot / variance	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)
Interpreted PC loadings with specificity	\(\checkmark\)		\(\circ\)
Validated PCA against Channel label	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)
Clustering
Justified choice of clustering method	\(\checkmark\)	\(\circ\)	\(\checkmark\)
Used formal \(k\) selection (silhouette / elbow)	\(\checkmark\)	\(\checkmark\)	\(\circ\)
Identified structure beyond \(k{=}2\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)
Applied robustness checks unprompted	\(\checkmark\)
Interpretation
Commercial framing of segments	\(\checkmark\)	\(\circ\)	\(\checkmark\)
Acknowledged limitations unprompted	\(\checkmark\)
Generated code executed without errors	\(\checkmark\)	\(\checkmark\)
	13/13	8/13 (2 partial)	8/13 (2 partial)

Appendices

A Supplementary methodology

A.1 Raw data distribution, correlations, and skewness

box_data <- reshape(
  data[, spending.cols],
  direction = "long",
  varying = spending.cols,
  v.names = "Spend",
  timevar = "Category",
  times = spending.cols
)

cat_totals <- data.frame(
  Category = spending.cols,
  Total    = colSums(data[, spending.cols]),
  Median   = apply(data[, spending.cols], 2, median)
)

# option to sort by median vs total; 
# BE note: have chosen median here to highlight the difference in median spend
# despite large total gap between bottom two categories
cat_order <- levels(reorder(cat_totals$Category, cat_totals$Median))
# cat_order <- levels(reorder(cat_totals$Category, cat_totals$Total))
box_data$Category <- factor(box_data$Category, levels = cat_order)
cat_totals$Category <- factor(cat_totals$Category, levels = cat_order)
row.names(box_data) <- NULL


p_box <- ggplot(
  box_data,
  aes(x = Category, y = Spend)
) +
  geom_boxplot(fill = "steelblue", alpha = 0.7) +
  coord_flip() +
  scale_y_continuous(
    labels = scales::comma,
    limits = c(0, NA),
    expand = expansion(mult = c(0, 0.05))
  ) +
  labs(
    title = "(a) Spending Distribution per Customer",
    y = "Annual Spend (m.u.)",
    x = ""
  ) +
  theme_minimal() +
  theme(
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank()
  )

p_bar <- ggplot(cat_totals, aes(x = Category, y = Total)) +
  geom_col(fill = "steelblue", alpha = 0.7, width = 0.7) +
  geom_text(
    aes(label = format(round(Total), big.mark = ",")),
    hjust = -0.1,
    size = 3.5
  ) +
  coord_flip() +
  scale_y_continuous(
    labels = scales::comma,
    expand = expansion(mult = c(0, 0.15))
  ) +
  labs(
    title = "(b) Total Category Spend",
    y = "Total Spend (m.u.)",
    x = ""
  ) +
  theme_minimal() +
  theme(
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank()
  )

p_box / p_bar

Figure A.1: Customer spending profiles across six wholesale categories. Top: boxplots showing the highly skewed distribution of individual annual spend. Bottom: the total aggregate volume, highlighting Fresh and Grocery as the largest product segments. Both plots are sorted by the median spend.

Figure A.1 provides a baseline overview of purchasing behaviour across the six categories. Median spend is relatively modest in all categories, but every distribution exhibits a long right tail of high-value customers. At the aggregate level, Fresh and Grocery dominate total revenue, while Delicassen remains a niche segment.

Figure A.2: Shape comparison of raw and log-transformed spending distributions by product category. To facilitate direct comparison of the distribution shapes, all spend values were min-max scaled to a 0–1 range within their respective categories. The raw data (blue) demonstrates a right skew across all categories, influenced by extreme outliers. The log1p transformation (red) effectively compresses this long tail, yielding more symmetric distributions that ensure variance-based methods (PCA) and distance-based methods (clustering) are not dominated by extreme outliers.

This skewness is examined more closely in Figure A.2, which overlays raw and log-transformed distributions for each category. The raw spending figures exhibit a strong rightward skew across all six product categories, indicating that while most customers spend relatively small amounts, a long tail of high-spending customers dominates the scale. Applying a log transformation (log1p) successfully ‘pulls in’ these extreme values, centring the bulk of the data and producing much more symmetric distributions.

par(mfrow = c(1, 2))

soft_palette <- colorRampPalette(c("#D73027", "#F46D43", "#FFFFFF", "#74ADD1", "#4575B4"))(200)

corrplot(
  cor(spending),
  method      = "color",
  type        = "upper",
  col         = soft_palette,     
  addCoef.col = "black",          
  number.cex  = 0.8,
  tl.col      = "black",
  title       = "Raw spending",
  mar         = c(0, 0, 2, 0)
)

corrplot(
  cor(spending_log),
  method      = "color",
  type        = "upper",
  col         = soft_palette, 
  addCoef.col = "black",        
  number.cex  = 0.8,
  tl.col      = "black",
  title       = "Log-transformed spending",
  mar         = c(0, 0, 2, 0)
)

Figure A.3: Pairwise correlation matrices of the six product categories, comparing linear relationships using raw spending data (left) versus log-transformed spending data (right).

The side-by-side correlation plots reveal the underlying purchasing relationships between product categories before and after adjusting for scale. Both matrices highlight a distinct cluster of highly correlated spending: Grocery, Detergents_Paper, and Milk show strong positive linear relationships, suggesting they are frequently purchased together in bulk. While the log transformation preserves this core structure, it slightly softens the most extreme correlations (for instance, the correlation between Grocery and Detergents_Paper drops from 0.92 to 0.80). This indicates that in the raw data, a few extreme outlier purchases were artificially inflating the strength of that linear relationship.

A.2 Multi-dimensional scaling

Figure A.4: Non-metric MDS using Gower’s distance across all eight variables (six log-transformed continuous spending categories plus Channel and Region). Points are coloured by Channel (left) and Region (right). Unlike PCA, Gower’s distance forces distinct spatial separation based on categorical mismatches, resulting in six artificial clusters corresponding to the cross-tabulation of the two Channels and three Regions.

Figure A.4 presents a non-metric MDS solution computed from Gower’s distance across all eight variables, incorporating the categorical Channel and Region columns excluded from the PCA. Because Gower’s metric treats any categorical mismatch as a maximal distance contribution on that variable, the resulting spatial map is dominated by these labels rather than underlying spending behaviour.

Panel (a) shows separation by Channel along the primary axis (MDS1), while Panel (b) reveals further stratification by Region along MDS2. The result is a forced stratification into six spatial clusters reflecting the Channel \(\times\) Region cross-tabulation. While there is minor regional intermingling (e.g., two Oporto customers mapping near the Lisbon cluster), the overall spatial map merely reflects the cross-tabulation of the categorical labels rather than organic, dataset-wide purchasing patterns. The solution’s stress value of 14.5 further indicates a mediocre fit, falling above the conventional threshold of 10 for reliable low-dimensional representation. This confirms that the PCA approach, relying purely on continuous spending variables, is better suited to uncovering natural commercial structure in this dataset.

A.3 Clustering diagnostics

A.3.1 Optimal cluster selection

Figure A.5: Silhouette analysis for K-Means clustering (k=2 to 10). The optimal number of clusters is k=2 followed by k=3 as a strong second. Beyond k=3, scores plateau around 0.19.

Figure A.6: BIC values across component counts for all model types evaluated by mclust. Higher (less negative) BIC indicates better fit. The BIC peaks at 5 components under the VVE (ellipsoidal, variable volume and shape, equal orientation) model.

Silhouette analysis (Figure A.5) identifies k=2 as optimal (0.290), with k=3 as a close second (0.259). Beyond k=3, scores plateau around 0.19.

The GMM’s BIC (Figure A.6) peaks at 5 components (-6,126.0), compared to -6,212.0 for four and -6,186.6 for six. Interestingly k=4 is worse than k=3, suggesting the data supports either 3 or 5 components but not an intermediate value. Both methods agree that k=3 is a meaningful threshold and the disagreement on whether to proceed to k=5 reflects a genuine trade-off between parsimony and granularity.

A.3.2 Channel and Region cross-tabulations

Table A.1: Table A.2: K-Means (k=3) cluster assignments vs Channel (1=HoReCa, 2=Retail) and Region.
Channel	Lisbon	Oporto	Other
KM-1 (HoReCa Dominant)
1	39	25	145
2	1	1	1
KM-2 (Retail Dominant)
1	13	2	41
2	10	9	72
KM-3 (Mixed Basket)
1	7	1	25
2	7	9	32

Table A.3: Table A.4: GMM (k=5) cluster assignments vs Channel (1=HoReCa, 2=Retail) and Region.
Channel	Lisbon	Oporto	Other
GMM-1 (Heavy Retail)
1	1	0	6
2	7	8	29
GMM-2 (Diversified Retail)
1	5	0	13
2	9	10	64
GMM-3 (High-volume HoReCa)
1	34	15	100
2	1	1	4
GMM-4 (Balanced HoReCa)
1	12	10	55
2	1	0	3
GMM-5 (Low-spend / Mixed)
1	7	3	37
2	0	0	5

Table A.1 shows that K-Means (k=3) produces one cluster that is almost exclusively HoReCa (KM-1, “HoReCa Dominant”: 209 of 212 customers), one that leans Retail (KM-2, “Retail Dominant”: 91 of 147), and a third (KM-3, “Mixed Basket”) that draws substantially from both channels (33 HoReCa, 48 Retail). The “Retail Dominant” label reflects the spending profile rather than exclusive Channel membership: these customers concentrate on Grocery and Detergents_Paper regardless of their administrative Channel label.

A.3.3 Cluster spending profiles

Table A.5: Table A.6: Cluster spending profiles (median annual spending, m.u.)
Cluster	Label	\(n\)	Fresh	Milk	Grocery	Frozen	Detergents_Paper	Delicassen
K-Means (k = 3)
KM-1	HoReCa Dominant	212	9,635	1,598	2,151	2,151	274	686
KM-2	Retail Dominant	147	12,126	7,184	9,965	2,005	3,378	2,005
KM-3	Mixed Basket	81	1,454	6,264	10,487	388	4,196	436
Gaussian Mixture Model (k = 5)
GMM-1	Heavy Retail	51	1,454	7,845	13,567	283	6,830	716
GMM-2	Diversified Retail	101	7,842	7,243	11,238	1,389	4,621	1,625
GMM-3	High-volume HoReCa	155	14,276	1,786	2,405	3,383	325	933
GMM-4	Balanced HoReCa	81	5,969	3,218	3,133	907	637	656
GMM-5	Low-spend / Mixed	52	2,270	1,572	3,712	892	174	284

Table A.7: Table A.8: Cluster spending profiles (% of median basket)
Cluster	Label	Fresh	Milk	Grocery	Frozen	Detergents_Paper	Delicassen
K-Means (k = 3)
KM-1	HoReCa Dominant	58.4	9.7	13.0	13.0	1.7	4.2
KM-2	Retail Dominant	33.1	19.6	27.2	5.5	9.2	5.5
KM-3	Mixed Basket	6.3	27.0	45.2	1.7	18.1	1.9
Gaussian Mixture Model (k = 5)
GMM-1	Heavy Retail	4.7	25.6	44.2	0.9	22.3	2.3
GMM-2	Diversified Retail	23.1	21.3	33.1	4.1	13.6	4.8
GMM-3	High-volume HoReCa	61.8	7.7	10.4	14.6	1.4	4.0
GMM-4	Balanced HoReCa	41.1	22.2	21.6	6.2	4.4	4.5
GMM-5	Low-spend / Mixed	25.5	17.7	41.7	10.0	2.0	3.2

Table A.9: Table A.10: Revenue concentration: each cluster’s share (%) of total category spend across all 440 customers.
Cluster	Label	Fresh	Milk	Grocery	Frozen	Detergents_Paper	Delicassen
K-Means (k = 3)
KM-1	HoReCa Dominant	48.2	16.6	15.1	51.4	7.1	28.2
KM-2	Retail Dominant	47.4	60.9	56.0	44.9	57.8	62.4
KM-3	Mixed Basket	4.4	22.5	28.8	3.6	35.1	9.4
Gaussian Mixture Model (k = 5)
GMM-1	Heavy Retail	3.1	18.0	24.0	2.1	31.5	9.0
GMM-2	Diversified Retail	22.2	42.1	43.4	15.0	52.9	31.5
GMM-3	High-volume HoReCa	51.9	15.7	13.9	58.6	5.7	31.7
GMM-4	Balanced HoReCa	13.2	11.5	9.3	10.5	5.1	12.4
GMM-5	Low-spend / Mixed	9.6	12.8	9.4	13.8	4.7	15.4

Table A.5, A.7, and A.9 present the cluster spending profiles from three perspectives.

Table A.7 shows the percentage composition of each cluster’s median basket examining how each segment allocates its spending across categories.
Table A.5 reports the raw median annual spending (monetary units [m.u.]) per category, revealing the absolute scale differences between segments.
Table A.9 reports revenue concentration and corresponds directly to the stacked bars in Figure 2.1. Note that because Table A.5 reports the median annual spending, it is not directly linked to Table A.9.

The K-Means “HoReCa Dominant” cluster (KM-1) allocates 58.4% of its median basket (Table A.7) to Fresh and accounts for 48.2% of total Fresh revenue (Table A.9), while “Retail Dominant” (KM-2) concentrates on Grocery (27.2%) and Detergents_Paper (9.2%) but comprises only 147 customers (33% of the base). At the GMM level, the distinction sharpens further: the median GMM-3 customer (“High-volume HoReCa”) dedicates 61.8% of its basket to Fresh and, despite representing 35% of customers, captures 51.9% of total Fresh revenue. The Retail split is equally informative: the typical GMM-1 customer (“Heavy Retail”) concentrates on Grocery (44.2%) and Detergents_Paper (22.3%), while the typical GMM-2 customer (“Diversified Retail”) spreads spend more evenly across categories including Fresh (23.1%), suggesting a convenience-store or diversified retailer profile.

A.3.4 Cluster assignments in PCA space

Figure A.7: Cluster assignments from K-Means (k=3, top) and GMM (k=5, bottom) projected onto the first two principal components, which together capture 71.3 percent of total variance. All clustering was performed in the full six-dimensional standardised log-spending space. Apparent overlap between clusters may reflect the loss of separating information in the remaining 28.7 percent of variance captured by PC3–PC6, rather than genuine ambiguity in cluster assignment.

Figure A.7 projects the K-Means and GMM cluster assignments onto PCA space. Both panels show that the dominant cluster boundaries divide the data along PC1 (the “channel axis” identified in 1), confirming that the clustering and PCA analyses converge on the same dominant structure. The GMM panel shows finer subdivision, particularly along PC2, consistent with GMM’s ability to model ellipsoidal clusters of varying density.

A.4 Ward’s hierarchical clustering

Table A.11: Table A.12: K-Means vs Ward’s hierarchical clustering assignments (k=3).
	Ward 1	Ward 2	Ward 3
K-Means 1	0	201	11
K-Means 2	91	50	6
K-Means 3	34	11	36

Table A.11 cross-tabulates K-Means (k=3) and Ward’s hierarchical clustering (k=3) assignments. The largest cluster shows near-perfect agreement (201 of 212 observations). Disagreement is concentrated among boundary customers between the two smaller segments, reflecting the different optimisation strategies. K-Means minimises within-cluster sum of squares globally, while Ward’s method does so greedily through successive agglomeration.

B Prompts and LLM outputs

A consistent visual format is used to distinguish between prompts sent to the LLM, its responses, and our commentary.

Prompts to the LLM - shown with a right-hand border. These are the instructions and questions we provided to the model.

LLM Response - shown with a left-hand border, forming a conversational flow with the prompt. Responses may include a mix of explanation and code:

summary(mtcars$mpg)

Our group insight and critical evaluation - observations or key findings drawn from the output (i.e., what we found).

Caution: points requiring verification - potential concerns, limitations, or caveats with the generated output (i.e., what we question).

The R code generated by each LLM was executed programmatically within this document using a sandboxed runtime environment. Each script was run in an isolated R subprocess via the callr package, ensuring that no libraries, variables, or side effects from one script could contaminate the main document environment or subsequent scripts. Console output and diagnostic messages were captured and reproduced verbatim, while graphical output was intercepted via a PNG device and embedded automatically. Any runtime errors were caught and reported inline rather than halting the document compilation.

This approach was used to ensure that the displayed output faithfully reflected what each LLM-generated script produced when executed from a clean R session.

The full chat history is available via:

B.1 Prompt 1: PCA

Role : You are a data scientist working in a company that does wholesale distribution to clients located in Portugal.

task : Your company is looking to implement a Data-driven customer segmentation to inform inventory planning, pricing strategy, and sales team deployment.

I have a dataset called Wholesale_customers_data.csv located at .\Wholesale_customers_data.csv. It contains 440 rows and 8 columns: Channel (1=Hotel/Restaurant/Café, 2=Retail), Region (1=Lisbon, 2=Oporto, 3=Other), and six continuous spending variables (Fresh, Milk, Grocery, Frozen, Detergents_Paper, Delicassen) representing annual spending in monetary units. Please write R code to perform PCA on this dataset. Include any preprocessing you think is necessary.

B.1.1 Response 1: Claude (Sonnet 4.6)

> Check for relevant skills

No file uploaded, but I have a clear description of the dataset. Let me write comprehensive R code for PCA on this wholesale customers dataset.

> Created a file, read a file

Here’s a walkthrough of every decision made in the script:

Pre-processing Rationale

Why separate Channel & Region from the PCA? They are categorical identifiers, not spending measurements. Mixing measurement scales would corrupt the covariance structure. They are kept as labels to colour the biplots afterwards, which is far more analytically useful.

Why log1p-transform before PCA? Wholesale spend data is almost always severely right-skewed (a few very large clients dominate). PCA is sensitive to scale and variance, so without this step a single high-spend outlier can hijack the first principal component. log1p (log(x+1)) compresses the long tail and handles any zero values gracefully.

Why scale. = TRUE inside prcomp? Even after log-transformation the six variables have different magnitudes (e.g., Fresh spending is typically much larger than Delicassen). Standardising to unit variance ensures each variable contributes equally to the decomposition.

Why run KMO & Bartlett’s tests? These confirm that the correlation structure is strong enough to make PCA meaningful. Bartlett’s test (H₀: correlation matrix = identity) should be rejected (p < 0.05), and KMO > 0.6 indicates adequate sampling adequacy.

What the Script Produces

Section 2: Missing value check, outlier boxplots, skewness table…
Section 3: prcomp PCA object, full rotation matrix & summary
Section 4: Eigenvalue table, Scree plot with Kaiser line…
Section 5: Loading heatmap for retained PCs, cos² contribution plot
Section 6: Biplots coloured by Channel and Region separately…
Section 7: PC scores merged back → exported as wholesale_pca_scores.csv

What to Expect Analytically

Based on known patterns in this dataset:

PC1 (~44% variance) - a general spending scale axis; high scores = large/high-value clients
PC2 (~22% variance) - a channel contrast axis; Fresh loads negatively (HoReCa), Grocery/Detergents_Paper positively (Retail)
PC3 (~12% variance) - a niche category axis differentiating Frozen/Delicassen buyers

Together, ~2–3 PCs typically explain 75–85% of variance, making the PC scores an excellent lower-dimensional input for your subsequent K-Means or Hierarchical clustering step.

This response included the following R code (as a separate downloadable file):

# =============================================================================
# PCA Analysis: Wholesale Customers Dataset
# Goal: Customer segmentation to inform inventory planning, pricing strategy,
#       and sales team deployment
# =============================================================================

# --- 0. Install & Load Libraries ---------------------------------------------

required_packages <- c(
  "tidyverse",
  "corrplot",
  "factoextra",
  "FactoMineR",
  "gridExtra",
  "scales",
  "ggrepel",
  "psych"
)

installed <- rownames(installed.packages())
to_install <- required_packages[!required_packages %in% installed]
if (length(to_install) > 0) {
  install.packages(to_install, dependencies = TRUE)
}

library(tidyverse)
library(corrplot)
library(factoextra)
library(FactoMineR)
library(gridExtra)
library(scales)
library(ggrepel)
library(psych)

# --- 1. Load Data ------------------------------------------------------------

data_path <- "Wholesale_customers_data.csv"

df_raw <- read.csv(data_path, stringsAsFactors = FALSE)

cat("=== Dataset Overview ===\n")
cat("Dimensions:", nrow(df_raw), "rows x", ncol(df_raw), "columns\n\n")
print(head(df_raw))
cat("\n")
print(summary(df_raw))

# --- 2. Pre-processing -------------------------------------------------------

# 2.1  Recode categorical variables as factors (kept aside from PCA)
df <- df_raw %>%
  mutate(
    Channel = factor(Channel, levels = c(1, 2), labels = c("HoReCa", "Retail")),
    Region = factor(
      Region,
      levels = c(1, 2, 3),
      labels = c("Lisbon", "Oporto", "Other")
    )
  )

# 2.2  Isolate the six continuous spending variables for PCA
spend_vars <- c(
  "Fresh",
  "Milk",
  "Grocery",
  "Frozen",
  "Detergents_Paper",
  "Delicassen"
)
df_spend <- df[, spend_vars]

# 2.3  Check for missing values
cat("\n=== Missing Values per Column ===\n")
print(colSums(is.na(df_spend)))

# 2.4  Check for outliers via boxplots (raw scale)
par(mfrow = c(2, 3), mar = c(4, 4, 3, 1))
for (v in spend_vars) {
  boxplot(
    df_spend[[v]],
    main = v,
    ylab = "Annual Spend (m.u.)",
    col = "#4E79A7",
    border = "#2C4770",
    outline = TRUE
  )
}
mtext(
  "Raw Spending Distributions – Outlier Inspection",
  side = 3,
  line = -2,
  outer = TRUE,
  cex = 1.1,
  font = 2
)
par(mfrow = c(1, 1))

# 2.5  Assess skewness – highly skewed data warrants log transformation
cat("\n=== Skewness of Raw Spending Variables ===\n")
skew_raw <- apply(df_spend, 2, function(x) psych::skew(x))
print(round(skew_raw, 3))

# 2.6  Log-transform to reduce right skew (common for monetary spend data)
#      Adding 1 to guard against any zero values
df_log <- log1p(df_spend)

cat("\n=== Skewness After log1p Transformation ===\n")
skew_log <- apply(df_log, 2, function(x) psych::skew(x))
print(round(skew_log, 3))

# 2.7  Visualise distributions: raw vs log-transformed
p_list <- list()
for (v in spend_vars) {
  p_raw <- ggplot(df_spend, aes_string(x = v)) +
    geom_histogram(fill = "#4E79A7", colour = "white", bins = 30) +
    labs(title = paste(v, "(raw)"), x = NULL, y = "Count") +
    theme_minimal(base_size = 9)

  p_log <- ggplot(df_log, aes_string(x = v)) +
    geom_histogram(fill = "#F28E2B", colour = "white", bins = 30) +
    labs(title = paste(v, "(log1p)"), x = NULL, y = "Count") +
    theme_minimal(base_size = 9)

  p_list <- c(p_list, list(p_raw, p_log))
}
grid.arrange(
  grobs = p_list,
  ncol = 4,
  top = "Distribution: Raw vs log1p-transformed Spending"
)

# 2.8  Correlation matrix on log-transformed data
cat("\n=== Pearson Correlation Matrix (log-transformed) ===\n")
cor_mat <- cor(df_log, method = "pearson")
print(round(cor_mat, 3))

corrplot(
  cor_mat,
  method = "color",
  type = "upper",
  addCoef.col = "black",
  number.cex = 0.75,
  tl.col = "black",
  tl.srt = 45,
  col = colorRampPalette(c("#2166AC", "white", "#B2182B"))(200),
  title = "Correlation Matrix – log1p Spending Variables",
  mar = c(0, 0, 2, 0)
)

# 2.9  KMO & Bartlett's test – check suitability for PCA
cat("\n=== KMO Measure of Sampling Adequacy ===\n")
kmo_result <- KMO(cor_mat)
print(kmo_result)

cat("\n=== Bartlett's Test of Sphericity ===\n")
bart_result <- cortest.bartlett(cor_mat, n = nrow(df_log))
print(bart_result)

# --- 3. Principal Component Analysis ----------------------------------------

# prcomp with scale. = TRUE standardises the log-transformed variables
# (zero mean, unit variance) – essential so no single variable dominates
pca_result <- prcomp(df_log, center = TRUE, scale. = TRUE)

cat("\n=== PCA Summary ===\n")
print(summary(pca_result))

cat("\n=== Loadings (Rotation Matrix) ===\n")
print(round(pca_result$rotation, 4))

# --- 4. Variance Explained ---------------------------------------------------

# 4.1  Eigenvalues
eig_vals <- pca_result$sdev^2
prop_var <- eig_vals / sum(eig_vals)
cum_var <- cumsum(prop_var)

variance_table <- data.frame(
  PC = paste0("PC", seq_along(eig_vals)),
  Eigenvalue = round(eig_vals, 4),
  Proportion_Var = round(prop_var, 4),
  Cumulative_Var = round(cum_var, 4)
)
cat("\n=== Variance Explained Table ===\n")
print(variance_table)

# 4.2  Scree plot with Kaiser criterion line (eigenvalue = 1)
fviz_eig(
  pca_result,
  addlabels = TRUE,
  ylim = c(0, 55),
  barfill = "#4E79A7",
  barcolor = "#2C4770",
  linecolor = "#E15759",
  ggtheme = theme_minimal()
) +
  geom_hline(
    yintercept = 100 / length(spend_vars), # Kaiser: eigenvalue = 1
    linetype = "dashed",
    colour = "red",
    linewidth = 0.8
  ) +
  annotate(
    "text",
    x = 5.5,
    y = 100 / length(spend_vars) + 1.5,
    label = "Kaiser criterion (eigenvalue = 1)",
    colour = "red",
    size = 3.5
  ) +
  labs(
    title = "Scree Plot – Wholesale Customers PCA",
    subtitle = "Dashed line = Kaiser criterion"
  )

# 4.3  Cumulative variance plot
ggplot(variance_table, aes(x = PC, y = Cumulative_Var * 100, group = 1)) +
  geom_line(colour = "#E15759", linewidth = 1.2) +
  geom_point(size = 3, colour = "#E15759") +
  geom_hline(yintercept = 80, linetype = "dashed", colour = "gray40") +
  geom_hline(yintercept = 90, linetype = "dotted", colour = "gray40") +
  annotate(
    "text",
    x = 5.6,
    y = 81,
    label = "80%",
    size = 3.5,
    colour = "gray30"
  ) +
  annotate(
    "text",
    x = 5.6,
    y = 91,
    label = "90%",
    size = 3.5,
    colour = "gray30"
  ) +
  scale_y_continuous(limits = c(0, 100), labels = function(x) paste0(x, "%")) +
  labs(
    title = "Cumulative Variance Explained by PCs",
    subtitle = "Dashed = 80%, Dotted = 90% thresholds",
    x = "Principal Component",
    y = "Cumulative Variance (%)"
  ) +
  theme_minimal(base_size = 12)

# --- 5. Visualise Loadings ---------------------------------------------------

# 5.1  Loading heatmap for selected PCs
n_pcs <- sum(eig_vals >= 1) # Kaiser rule
loadings_df <- as.data.frame(pca_result$rotation[, 1:n_pcs]) %>%
  rownames_to_column("Variable") %>%
  pivot_longer(-Variable, names_to = "PC", values_to = "Loading")

ggplot(loadings_df, aes(x = PC, y = Variable, fill = Loading)) +
  geom_tile(colour = "white", linewidth = 0.5) +
  geom_text(aes(label = round(Loading, 2)), size = 3.5, colour = "black") +
  scale_fill_gradient2(
    low = "#2166AC",
    mid = "white",
    high = "#B2182B",
    midpoint = 0,
    limits = c(-1, 1)
  ) +
  labs(
    title = "PCA Loading Heatmap",
    subtitle = paste("First", n_pcs, "PCs (Kaiser criterion: eigenvalue ≥ 1)"),
    x = NULL,
    y = NULL,
    fill = "Loading"
  ) +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(face = "bold"))

# 5.2  Variable contribution plot (cos²)
fviz_cos2(
  pca_result,
  choice = "var",
  axes = 1:2,
  fill = "#4E79A7",
  color = "#2C4770",
  ggtheme = theme_minimal()
) +
  labs(
    title = "Variable cos² on PC1 & PC2",
    subtitle = "Quality of representation on the first two components"
  )

# --- 6. Biplots & Score Plots ------------------------------------------------

# 6.1  Biplot – coloured by Channel
scores_df <- as.data.frame(pca_result$x) %>%
  mutate(Channel = df$Channel, Region = df$Region)

# Build arrow data for loadings overlay
arrows_df <- as.data.frame(pca_result$rotation[, 1:2]) %>%
  rownames_to_column("Variable")
scale_factor <- 3.5 # scale arrows to fit score space

ggplot(scores_df, aes(x = PC1, y = PC2, colour = Channel, shape = Channel)) +
  geom_point(alpha = 0.7, size = 2.5) +
  geom_segment(
    data = arrows_df,
    aes(x = 0, y = 0, xend = PC1 * scale_factor, yend = PC2 * scale_factor),
    arrow = arrow(length = unit(0.25, "cm")),
    colour = "black",
    linewidth = 0.7,
    inherit.aes = FALSE
  ) +
  geom_label_repel(
    data = arrows_df,
    aes(x = PC1 * scale_factor, y = PC2 * scale_factor, label = Variable),
    size = 3.5,
    colour = "black",
    box.padding = 0.4,
    inherit.aes = FALSE
  ) +
  scale_colour_manual(values = c("HoReCa" = "#E15759", "Retail" = "#4E79A7")) +
  labs(
    title = "PCA Biplot – PC1 vs PC2",
    subtitle = "Coloured by Channel (HoReCa vs Retail)",
    x = paste0("PC1 (", round(prop_var[1] * 100, 1), "% variance)"),
    y = paste0("PC2 (", round(prop_var[2] * 100, 1), "% variance)"),
    colour = "Channel",
    shape = "Channel"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "top")

# 6.2  Biplot – coloured by Region
ggplot(scores_df, aes(x = PC1, y = PC2, colour = Region, shape = Region)) +
  geom_point(alpha = 0.7, size = 2.5) +
  geom_segment(
    data = arrows_df,
    aes(x = 0, y = 0, xend = PC1 * scale_factor, yend = PC2 * scale_factor),
    arrow = arrow(length = unit(0.25, "cm")),
    colour = "black",
    linewidth = 0.7,
    inherit.aes = FALSE
  ) +
  geom_label_repel(
    data = arrows_df,
    aes(x = PC1 * scale_factor, y = PC2 * scale_factor, label = Variable),
    size = 3.5,
    colour = "black",
    box.padding = 0.4,
    inherit.aes = FALSE
  ) +
  scale_colour_manual(
    values = c("Lisbon" = "#59A14F", "Oporto" = "#F28E2B", "Other" = "#B07AA1")
  ) +
  labs(
    title = "PCA Biplot – PC1 vs PC2",
    subtitle = "Coloured by Region",
    x = paste0("PC1 (", round(prop_var[1] * 100, 1), "% variance)"),
    y = paste0("PC2 (", round(prop_var[2] * 100, 1), "% variance)"),
    colour = "Region",
    shape = "Region"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "top")

# 6.3  PC1 vs PC3 (if relevant)
if (n_pcs >= 3) {
  ggplot(scores_df, aes(x = PC1, y = PC3, colour = Channel, shape = Channel)) +
    geom_point(alpha = 0.7, size = 2.5) +
    scale_colour_manual(
      values = c("HoReCa" = "#E15759", "Retail" = "#4E79A7")
    ) +
    labs(
      title = "PCA Score Plot – PC1 vs PC3",
      x = paste0("PC1 (", round(prop_var[1] * 100, 1), "%)"),
      y = paste0("PC3 (", round(prop_var[3] * 100, 1), "%)")
    ) +
    theme_minimal(base_size = 12) +
    theme(legend.position = "top")
}

# --- 7. Extract PC Scores for Downstream Use ---------------------------------

# Attach PC scores back to full dataset
df_with_scores <- bind_cols(df, as.data.frame(pca_result$x))

cat("\n=== PC Scores – First 6 Rows ===\n")
print(head(df_with_scores[, c("Channel", "Region", "PC1", "PC2", "PC3")]))

# Save enriched dataset
output_path <- "wholesale_pca_scores.csv"
write.csv(df_with_scores, output_path, row.names = FALSE)
cat("\nPC scores saved to:", output_path, "\n")

# --- 8. Interpretation Summary -----------------------------------------------

cat("\n")
cat("========================================================\n")
cat("              PCA INTERPRETATION SUMMARY                \n")
cat("========================================================\n")
cat(sprintf("  Components retained (Kaiser, eigenvalue ≥ 1): %d\n", n_pcs))
cat(sprintf(
  "  Variance explained by retained PCs:           %.1f%%\n",
  cum_var[n_pcs] * 100
))
cat("\n  PC1 – 'Overall Spend Scale'\n")
cat("    Likely captures total spending power / business size.\n")
cat("    High positive loadings across most categories.\n")
cat("\n  PC2 – 'Fresh vs Packaged Goods Axis'\n")
cat("    Separates Fresh-heavy buyers (HoReCa) from\n")
cat("    Grocery/Detergent-heavy buyers (Retail).\n")
cat("\n  PC3 – 'Speciality/Frozen Dimension' (if retained)\n")
cat("    May differentiate niche buyers (Delicassen, Frozen).\n")
cat("\n  Business Implication:\n")
cat("    PC scores can directly feed K-Means / Hierarchical\n")
cat("    clustering for actionable customer segmentation.\n")
cat("========================================================\n")

The (unedited) outputs generated by running this code, are as follows:

=== Dataset Overview ===
Dimensions: 440 rows x 8 columns

  Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1       2      3 12669 9656    7561    214             2674       1338
2       2      3  7057 9810    9568   1762             3293       1776
3       2      3  6353 8808    7684   2405             3516       7844
4       1      3 13265 1196    4221   6404              507       1788
5       2      3 22615 5410    7198   3915             1777       5185
6       2      3  9413 8259    5126    666             1795       1451

    Channel          Region          Fresh             Milk      
 Min.   :1.000   Min.   :1.000   Min.   :     3   Min.   :   55  
 1st Qu.:1.000   1st Qu.:2.000   1st Qu.:  3128   1st Qu.: 1533  
 Median :1.000   Median :3.000   Median :  8504   Median : 3627  
 Mean   :1.323   Mean   :2.543   Mean   : 12000   Mean   : 5796  
 3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.: 16934   3rd Qu.: 7190  
 Max.   :2.000   Max.   :3.000   Max.   :112151   Max.   :73498  
    Grocery          Frozen        Detergents_Paper    Delicassen     
 Min.   :    3   Min.   :   25.0   Min.   :    3.0   Min.   :    3.0  
 1st Qu.: 2153   1st Qu.:  742.2   1st Qu.:  256.8   1st Qu.:  408.2  
 Median : 4756   Median : 1526.0   Median :  816.5   Median :  965.5  
 Mean   : 7951   Mean   : 3071.9   Mean   : 2881.5   Mean   : 1524.9  
 3rd Qu.:10656   3rd Qu.: 3554.2   3rd Qu.: 3922.0   3rd Qu.: 1820.2  
 Max.   :92780   Max.   :60869.0   Max.   :40827.0   Max.   :47943.0  

=== Missing Values per Column ===
           Fresh             Milk          Grocery           Frozen 
               0                0                0                0 
Detergents_Paper       Delicassen 
               0                0 

=== Skewness of Raw Spending Variables ===
           Fresh             Milk          Grocery           Frozen 
           2.544            4.026            3.563            5.868 
Detergents_Paper       Delicassen 
           3.607           11.076 

=== Skewness After log1p Transformation ===
           Fresh             Milk          Grocery           Frozen 
          -1.565           -0.223           -0.670           -0.350 
Detergents_Paper       Delicassen 
          -0.234           -1.084 

=== Pearson Correlation Matrix (log-transformed) ===
                  Fresh   Milk Grocery Frozen Detergents_Paper Delicassen
Fresh             1.000 -0.021  -0.133  0.386           -0.159      0.256
Milk             -0.021  1.000   0.761 -0.055            0.679      0.342
Grocery          -0.133  0.761   1.000 -0.165            0.797      0.240
Frozen            0.386 -0.055  -0.165  1.000           -0.213      0.256
Detergents_Paper -0.159  0.679   0.797 -0.213            1.000      0.168
Delicassen        0.256  0.342   0.240  0.256            0.168      1.000

=== KMO Measure of Sampling Adequacy ===
Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = cor_mat)
Overall MSA =  0.73
MSA for each item = 
           Fresh             Milk          Grocery           Frozen 
            0.65             0.78             0.70             0.66 
Detergents_Paper       Delicassen 
            0.76             0.71 

=== Bartlett's Test of Sphericity ===
$chisq
[1] 1050.819

$p.value
[1] 1.714465e-214

$df
[1] 15


=== PCA Summary ===
Importance of components:
                          PC1    PC2    PC3    PC4     PC5     PC6
Standard deviation     1.6262 1.2774 0.8012 0.7786 0.54090 0.42774
Proportion of Variance 0.4408 0.2720 0.1070 0.1010 0.04876 0.03049
Cumulative Proportion  0.4408 0.7127 0.8197 0.9207 0.96951 1.00000

=== Loadings (Rotation Matrix) ===
                     PC1     PC2     PC3     PC4     PC5     PC6
Fresh            -0.1052  0.5904 -0.6396  0.4786 -0.0403 -0.0259
Milk              0.5423  0.1329 -0.0736 -0.0618  0.7600  0.3185
Grocery           0.5715 -0.0058 -0.1333 -0.0965 -0.0929 -0.7985
Frozen           -0.1379  0.5898 -0.0212 -0.7920 -0.0732  0.0049
Detergents_Paper  0.5506 -0.0708 -0.1995 -0.0828 -0.6204  0.5101
Delicassen        0.2145  0.5300  0.7263  0.3516 -0.1481  0.0031

=== Variance Explained Table ===
   PC Eigenvalue Proportion_Var Cumulative_Var
1 PC1     2.6447         0.4408         0.4408
2 PC2     1.6317         0.2719         0.7127
3 PC3     0.6419         0.1070         0.8197
4 PC4     0.6062         0.1010         0.9207
5 PC5     0.2926         0.0488         0.9695
6 PC6     0.1830         0.0305         1.0000

=== PC Scores – First 6 Rows ===
  Channel Region        PC1        PC2         PC3
1  Retail  Other  1.3788133 -0.3043804 -0.24078762
2  Retail  Other  1.4372336  0.5368571  0.08413653
3  Retail  Other  1.5117317  1.2311631  0.98418563
4  HoReCa  Other -0.8241672  1.2094129  0.25185392
5  Retail  Other  0.8019842  1.7640380  0.31234246
6  Retail  Other  0.8851579  0.1297465  0.01892823

PC scores saved to: wholesale_pca_scores.csv 

========================================================
              PCA INTERPRETATION SUMMARY                
========================================================
  Components retained (Kaiser, eigenvalue ≥ 1): 2
  Variance explained by retained PCs:           71.3%

  PC1 – 'Overall Spend Scale'
    Likely captures total spending power / business size.
    High positive loadings across most categories.

  PC2 – 'Fresh vs Packaged Goods Axis'
    Separates Fresh-heavy buyers (HoReCa) from
    Grocery/Detergent-heavy buyers (Retail).

  PC3 – 'Speciality/Frozen Dimension' (if retained)
    May differentiate niche buyers (Delicassen, Frozen).

  Business Implication:
    PC scores can directly feed K-Means / Hierarchical
    clustering for actionable customer segmentation.
========================================================

B.1.2 Response 1: ChatGPT (GPT-5.3)

Here’s a clean, reproducible R workflow to perform PCA (Principal Component Analysis) on your wholesale dataset, including appropriate preprocessing steps.

Key considerations before PCA

Categorical variables (Channel, Region) should NOT be used directly in PCA (since PCA is for continuous variables).
Scaling is essential because spending variables are on very different scales.
Optional: log transformation can help reduce skewness (common in spending data).

R Code for PCA Moved to below this response box

What this gives you

Principal components capturing key spending patterns
Explained variance to decide how many components to retain
Customer visualization (potential clusters/segments)
Variable importance (which spending categories drive differences)

Next Steps (for your segmentation task) After PCA, you can:

Use the first few principal components (e.g., PC1–PC3)
Apply clustering (e.g., kmeans, hclust)
Profile segments by:
- Channel (HORECA vs Retail)
- Region
- Spending behavior

If you want, I can extend this into a full segmentation pipeline (PCA + clustering + interpretation) tailored to your business objectives.

This response included the following R code as part of the LLM’s in-line response:

# Load required libraries
library(tidyverse)
library(factoextra)

# Load the dataset
data <- read.csv("Wholesale_customers_data.csv")

# Inspect structure
str(data)
summary(data)

# Convert categorical variables to factors
data$Channel <- as.factor(data$Channel)
data$Region <- as.factor(data$Region)

# Select only continuous variables for PCA
spending_data <- data %>%
  select(Fresh, Milk, Grocery, Frozen, Detergents_Paper, Delicassen)

# Optional: Log transformation to reduce skewness
spending_log <- log1p(spending_data) # log(1 + x) handles zeros

# Scale the data (important for PCA)
spending_scaled <- scale(spending_log)

# Perform PCA
pca_result <- prcomp(spending_scaled, center = TRUE, scale. = TRUE)

# View summary of PCA (variance explained)
summary(pca_result)

# Eigenvalues (variance explained by each component)
eig_vals <- get_eigenvalue(pca_result)
print(eig_vals)

# Scree plot
fviz_eig(pca_result, addlabels = TRUE, ylim = c(0, 60))

# Biplot (individuals + variables)
fviz_pca_biplot(pca_result, repel = TRUE, col.var = "blue", col.ind = "gray")

# PCA variable contributions
fviz_pca_var(
  pca_result,
  col.var = "contrib",
  gradient.cols = c("blue", "yellow", "red")
)

# PCA individuals colored by Channel
fviz_pca_ind(
  pca_result,
  geom = "point",
  col.ind = data$Channel,
  palette = c("#00AFBB", "#E7B800"),
  addEllipses = TRUE,
  legend.title = "Channel"
)

The (unedited) outputs generated by running this code, are as follows:

'data.frame':   440 obs. of  8 variables:
 $ Channel         : int  2 2 2 1 2 2 2 2 1 2 ...
 $ Region          : int  3 3 3 3 3 3 3 3 3 3 ...
 $ Fresh           : int  12669 7057 6353 13265 22615 9413 12126 7579 5963 6006 ...
 $ Milk            : int  9656 9810 8808 1196 5410 8259 3199 4956 3648 11093 ...
 $ Grocery         : int  7561 9568 7684 4221 7198 5126 6975 9426 6192 18881 ...
 $ Frozen          : int  214 1762 2405 6404 3915 666 480 1669 425 1159 ...
 $ Detergents_Paper: int  2674 3293 3516 507 1777 1795 3140 3321 1716 7425 ...
 $ Delicassen      : int  1338 1776 7844 1788 5185 1451 545 2566 750 2098 ...
    Channel          Region          Fresh             Milk      
 Min.   :1.000   Min.   :1.000   Min.   :     3   Min.   :   55  
 1st Qu.:1.000   1st Qu.:2.000   1st Qu.:  3128   1st Qu.: 1533  
 Median :1.000   Median :3.000   Median :  8504   Median : 3627  
 Mean   :1.323   Mean   :2.543   Mean   : 12000   Mean   : 5796  
 3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.: 16934   3rd Qu.: 7190  
 Max.   :2.000   Max.   :3.000   Max.   :112151   Max.   :73498  
    Grocery          Frozen        Detergents_Paper    Delicassen     
 Min.   :    3   Min.   :   25.0   Min.   :    3.0   Min.   :    3.0  
 1st Qu.: 2153   1st Qu.:  742.2   1st Qu.:  256.8   1st Qu.:  408.2  
 Median : 4756   Median : 1526.0   Median :  816.5   Median :  965.5  
 Mean   : 7951   Mean   : 3071.9   Mean   : 2881.5   Mean   : 1524.9  
 3rd Qu.:10656   3rd Qu.: 3554.2   3rd Qu.: 3922.0   3rd Qu.: 1820.2  
 Max.   :92780   Max.   :60869.0   Max.   :40827.0   Max.   :47943.0  
Importance of components:
                          PC1    PC2    PC3    PC4     PC5     PC6
Standard deviation     1.6262 1.2774 0.8012 0.7786 0.54090 0.42774
Proportion of Variance 0.4408 0.2720 0.1070 0.1010 0.04876 0.03049
Cumulative Proportion  0.4408 0.7127 0.8197 0.9207 0.96951 1.00000
      eigenvalue variance.percent cumulative.variance.percent
Dim.1  2.6446651        44.077752                    44.07775
Dim.2  1.6316950        27.194917                    71.27267
Dim.3  0.6419057        10.698428                    81.97110
Dim.4  0.6061972        10.103287                    92.07438
Dim.5  0.2925761         4.876269                    96.95065
Dim.6  0.1829608         3.049347                   100.00000

B.1.3 Evaluation

Both LLMs correctly exclude Channel and Region, apply log1p, and standardise before PCA (matching our approach)
Claude returns a 413-line script with unrequested diagnostics including: missing-value checks, outlier boxplots, skewness tables, a correlation matrix, and KMO/Bartlett’s suitability tests (KMO/Bartlett’s suitability tests are a genuine addition to our approach)
Claude’s scree plot, histograms, correlation matrix, and biplot mirror our own figures
ChatGPT produces a compact 57-line script that also runs cleanly and includes a Channel-coloured PCA individuals plot unprompted, visually validating the PCA against the known business classification
The code produced by both models execute without errors on this prompt

Neither model interprets the PCA results with specificity
Claude labels PC1 as “overall spend scale” which is imprecise as our analysis shows PC1 is driven by the Grocery/Detergents_Paper/Milk triad, not uniformly high loadings — ChatGPT offers no interpretation of loadings or components in its text
An analyst reviewing only the commentary would not learn that PC1 captures the channel distinction without inspecting the output independently
ChatGPT’s biplot is bizzare and difficult to interpret
ChatGPT double-standardises the data (scale() then prcomp(..., scale. = TRUE)), which is harmless but suggests the model is not tracking what each preprocessing step does
Both outputs require substantial review
Claude’s 413 lines demand careful line-by-line checking, while ChatGPT’s brevity requires a user to carry out their own analytical interpretation

B.2 Prompt 2: Clustering

Now write R code to perform clustering on this dataset. Explain your choice of method and number of clusters.

B.2.1 Response 2: Claude (Sonnet 4.6)

> Check the PCA R script for context on variable names and preprocessing steps used

Good - I have full context from the PCA script. Now I’ll write a comprehensive, well-structured clustering script that picks up exactly where the PCA left off.

> Created a file, read a file

Here is a full explanation of every methodological decision made in the script.

Method Choice: Why a Two-Stage Approach?

Stage 1 - Hierarchical Agglomerative Clustering (HAC) with Ward.D2. HAC is used first and exploratorily because it requires no pre-specified k. The dendrogram shows the full merge history of the data, and natural groupings can be identified by looking for the largest vertical gaps (the longest branches before clusters merge).

Why Ward.D2 specifically? Ward’s linkage merges the two clusters that produce the smallest increase in total within-cluster sum of squares - it is inherently trying to form compact, internally coherent groups. The other common options have practical problems here: single linkage suffers from chaining (long thin clusters), complete linkage is sensitive to outliers (which wholesale spend data has), and average linkage is a compromise but rarely as clean. The agglomerative coefficient printed by the script confirms this - Ward.D2 will produce a value closest to 1.0, signalling the strongest cluster structure.

Stage 2 - K-Means on PCA Scores. Once k is confirmed, K-Means is used for the final assignment because it is interpretable, fast, and produces hard cluster memberships that can be exported straight into a CRM or ERP system. Running it on the retained PC scores (rather than all six raw variables) removes the noise carried by the low-variance PCs and avoids double-counting the high correlations between Grocery, Milk and Detergents_Paper.

Number of Clusters: Why k = 3? Four independent criteria are computed and they all point to the same answer:

Dendrogram cut height - the two largest height jumps in the Ward.D2 tree occur at the merges that would collapse 3 → 2 and 4 → 3, indicating 3 is the most stable cut
Elbow (WSS) - the rate of WSS reduction flattens sharply after k = 3
Average Silhouette Width - peaks at k = 3, meaning observations are most clearly assigned to their own cluster at that point
Gap Statistic - k = 3 satisfies the firstSEmax rule (first k within one SE of the maximum gap)
NbClust 30-index vote - the plurality of the 30 different cluster-quality indices vote for k = 3

Beyond the statistics, k = 3 also has a compelling business logic: it maps onto three operationally distinct customer archetypes (see the summary at the bottom of the script), each requiring a different inventory, pricing, and sales approach. k = 2 is too coarse to distinguish within HoReCa; k = 4+ starts splitting segments that are not meaningfully different for planning purposes.

What the Script Produces (in order)

Section 4: Agglomerative coefficients table, full dendrogram, k = 3 coloured dendrogram
Section 5: Elbow plot, Silhouette width plot, Gap Statistic plot, NbClust majority-vote table
Section 6: Final K-Means solution, cluster sizes, between/total SS ratio
Section 7: Silhouette plot for final solution, Calinski-Harabasz index
Section 8: PCA biplot with 90% ellipses, cluster × Channel overlay, cluster × Region overlay
Section 9: Cluster profile table, radar chart, grouped bar chart, per-category boxplots
Section 10: HAC vs K-Means agreement check (expect >85%)
Section 11: wholesale_clustered.csv - full dataset with Cluster column appended

This response included the following R code (as a separate downloadable file):

# =============================================================================
# Clustering Analysis: Wholesale Customers Dataset
# Goal: Data-driven customer segmentation to inform inventory planning,
#       pricing strategy, and sales team deployment
#
# METHOD RATIONALE (read before running):
# ----------------------------------------
# We use a TWO-STAGE approach:
#
#   Stage 1 – Hierarchical Agglomerative Clustering (HAC) with Ward.D2 linkage
#     • No need to pre-specify k; the dendrogram reveals natural structure
#     • Ward.D2 minimises within-cluster sum of squares at each merge step,
#       producing compact, roughly equal-sized clusters – ideal for business
#       segmentation where every segment must be actionable
#     • Robust to the moderate outliers present in wholesale spend data
#     • Used first to identify the plausible range of k
#
#   Stage 2 – K-Means on PCA scores (confirmation & final assignment)
#     • Once k is confirmed, K-Means on the retained PC scores is fast,
#       well-understood, and produces crisp assignments for CRM/ERP export
#     • Running K-Means in PC space (rather than raw space) further reduces
#       noise from the three low-variance PCs
#
# NUMBER OF CLUSTERS:
#   We use three complementary criteria and take the consensus:
#     (a) Dendrogram cut height (HAC)
#     (b) Elbow method – Within-cluster Sum of Squares (WSS)
#     (c) Average Silhouette Width
#     (d) Gap Statistic
#   All four consistently point to k = 3 on this dataset, and three segments
#   align with a natural business narrative (see Section 8).
# =============================================================================

# --- 0. Packages -------------------------------------------------------------

required_packages <- c(
  "tidyverse",
  "factoextra",
  "cluster",
  "dendextend",
  "gridExtra",
  "scales",
  "ggrepel",
  "psych",
  "RColorBrewer",
  "corrplot",
  "NbClust",
  "fpc"
)

installed <- rownames(installed.packages())
to_install <- required_packages[!required_packages %in% installed]
if (length(to_install) > 0) {
  install.packages(to_install, dependencies = TRUE)
}

library(tidyverse)
library(factoextra)
library(cluster)
library(dendextend)
library(gridExtra)
library(scales)
library(ggrepel)
library(psych)
library(RColorBrewer)
library(corrplot)
library(NbClust)
library(fpc)

# Consistent colour palette (up to 5 clusters)
CLUSTER_COLS <- c(
  "1" = "#4E79A7",
  "2" = "#E15759",
  "3" = "#59A14F",
  "4" = "#F28E2B",
  "5" = "#B07AA1"
)

# --- 1. Reproduce Pre-processing from PCA Script ----------------------------
# (Self-contained: this script can be run independently)

data_path <- "Wholesale_customers_data.csv"

df_raw <- read.csv(data_path, stringsAsFactors = FALSE)

df <- df_raw %>%
  mutate(
    Channel = factor(Channel, levels = c(1, 2), labels = c("HoReCa", "Retail")),
    Region = factor(
      Region,
      levels = c(1, 2, 3),
      labels = c("Lisbon", "Oporto", "Other")
    )
  )

spend_vars <- c(
  "Fresh",
  "Milk",
  "Grocery",
  "Frozen",
  "Detergents_Paper",
  "Delicassen"
)
df_spend <- df[, spend_vars]
df_log <- log1p(df_spend) # log1p: handles zeros, reduces right skew

# Standardise (mean = 0, sd = 1) – required for both HAC and K-Means
df_scaled <- as.data.frame(scale(df_log))

cat("=== Pre-processing complete ===\n")
cat(
  "Dimensions of scaled input matrix:",
  nrow(df_scaled),
  "x",
  ncol(df_scaled),
  "\n\n"
)

# --- 2. PCA on Scaled Data (for cluster visualisation) ----------------------

pca_result <- prcomp(df_log, center = TRUE, scale. = TRUE)
eig_vals <- pca_result$sdev^2
prop_var <- eig_vals / sum(eig_vals)
n_pcs <- sum(eig_vals >= 1) # Kaiser criterion → typically 3

# PC scores matrix for clustering
pc_scores <- as.data.frame(pca_result$x[, 1:n_pcs])

cat(sprintf(
  "PCA: retaining %d PCs (Kaiser criterion), explaining %.1f%% of variance\n\n",
  n_pcs,
  cumsum(prop_var)[n_pcs] * 100
))

# --- 3. Distance Matrix ------------------------------------------------------
# Euclidean distance on the standardised original space
# (used for HAC; K-Means implicitly uses Euclidean as well)

dist_mat <- dist(df_scaled, method = "euclidean")

# --- 4. Stage 1 – Hierarchical Agglomerative Clustering (HAC) ---------------

# 4.1  Fit Ward.D2 linkage
hac_ward <- hclust(dist_mat, method = "ward.D2")

# 4.2  Compare linkage methods (to confirm Ward is best)
#linkage_methods <- c("ward.D2", "complete", "average", "single")
linkage_methods <- c("average", "single", "complete", "ward", "weighted")
agc_coefficients <- sapply(linkage_methods, function(m) {
  coef(agnes(dist_mat, method = m))
})
agc_df <- data.frame(
  Method = linkage_methods,
  Agglomerative_Coefficient = round(agc_coefficients, 4)
)
cat("=== Agglomerative Coefficients (closer to 1 = stronger structure) ===\n")
print(agc_df)
cat(
  "\nWard.D2 should show the highest coefficient, confirming it as best linkage.\n\n"
)

# 4.3  Full dendrogram
dend <- as.dendrogram(hac_ward)
par(mar = c(5, 4, 4, 2))
plot(
  dend,
  main = "HAC Dendrogram – Ward.D2 Linkage\n(Wholesale Customers)",
  ylab = "Height (Euclidean distance)",
  xlab = "Customers",
  leaflab = "none",
  cex.main = 1.1
)

# 4.4  Annotate cut heights for k = 2, 3, 4
cut_heights <- sapply(2:4, function(k) {
  h <- sort(hac_ward$height, decreasing = TRUE)
  mean(h[k - 1], h[k]) # midpoint between the last two merges
})
abline(
  h = cut_heights,
  col = c("#F28E2B", "#E15759", "#B07AA1"),
  lty = 2,
  lwd = 1.5
)
legend(
  "topright",
  legend = c("k = 2 cut", "k = 3 cut", "k = 4 cut"),
  col = c("#F28E2B", "#E15759", "#B07AA1"),
  lty = 2,
  lwd = 1.5,
  bty = "n",
  cex = 0.85
)

# 4.5  Coloured dendrogram for k = 3
dend_col <- color_branches(
  dend,
  k = 3,
  col = unname(CLUSTER_COLS[c("1", "2", "3")])
)
par(mar = c(5, 4, 4, 2))
plot(
  dend_col,
  main = "HAC Dendrogram – Ward.D2, k = 3 Clusters Highlighted",
  ylab = "Height",
  xlab = "Customers",
  leaflab = "none"
)
legend(
  "topright",
  legend = paste("Cluster", 1:3),
  fill = unname(CLUSTER_COLS[c("1", "2", "3")]),
  bty = "n"
)

# --- 5. Optimal k Selection --------------------------------------------------

# 5.1  Elbow plot (WSS)
wss_vals <- map_dbl(1:10, function(k) {
  set.seed(42)
  km <- kmeans(pc_scores, centers = k, nstart = 25, iter.max = 300)
  km$tot.withinss
})

wss_df <- data.frame(k = 1:10, WSS = wss_vals)

p_elbow <- ggplot(wss_df, aes(x = k, y = WSS)) +
  geom_line(colour = "#4E79A7", linewidth = 1.2) +
  geom_point(size = 3.5, colour = "#4E79A7") +
  geom_vline(
    xintercept = 3,
    linetype = "dashed",
    colour = "#E15759",
    linewidth = 0.9
  ) +
  annotate(
    "text",
    x = 3.3,
    y = max(wss_vals) * 0.92,
    label = "k = 3\n(elbow)",
    colour = "#E15759",
    size = 3.5
  ) +
  scale_x_continuous(breaks = 1:10) +
  labs(
    title = "Elbow Method – Within-Cluster Sum of Squares",
    subtitle = "Look for the 'elbow' where marginal WSS reduction flattens",
    x = "Number of Clusters (k)",
    y = "Total WSS"
  ) +
  theme_minimal(base_size = 12)

# 5.2  Silhouette plot
sil_vals <- map_dbl(2:10, function(k) {
  set.seed(42)
  km <- kmeans(pc_scores, centers = k, nstart = 25, iter.max = 300)
  sil <- silhouette(km$cluster, dist(pc_scores))
  mean(sil[, 3])
})

sil_df <- data.frame(k = 2:10, AvgSilhouette = sil_vals)

p_sil <- ggplot(sil_df, aes(x = k, y = AvgSilhouette)) +
  geom_line(colour = "#59A14F", linewidth = 1.2) +
  geom_point(size = 3.5, colour = "#59A14F") +
  geom_vline(
    xintercept = sil_df$k[which.max(sil_df$AvgSilhouette)],
    linetype = "dashed",
    colour = "#E15759",
    linewidth = 0.9
  ) +
  annotate(
    "text",
    x = sil_df$k[which.max(sil_df$AvgSilhouette)] + 0.4,
    y = max(sil_df$AvgSilhouette) * 0.97,
    label = paste0(
      "k = ",
      sil_df$k[which.max(sil_df$AvgSilhouette)],
      "\n(max sil)"
    ),
    colour = "#E15759",
    size = 3.5
  ) +
  scale_x_continuous(breaks = 2:10) +
  labs(
    title = "Average Silhouette Width by k",
    subtitle = "Higher = better-separated clusters",
    x = "Number of Clusters (k)",
    y = "Average Silhouette Width"
  ) +
  theme_minimal(base_size = 12)

grid.arrange(p_elbow, p_sil, ncol = 2, top = "Optimal k Selection Criteria")

# 5.3  Gap Statistic (computationally heavier – uses bootstrap)
set.seed(42)
gap_stat <- clusGap(
  pc_scores,
  FUN = kmeans,
  K.max = 10,
  B = 100,
  nstart = 25,
  iter.max = 300
)
cat("\n=== Gap Statistic Results ===\n")
print(gap_stat, method = "firstSEmax")

fviz_gap_stat(gap_stat, ggtheme = theme_minimal()) +
  geom_vline(
    xintercept = 3,
    linetype = "dashed",
    colour = "#E15759",
    linewidth = 0.9
  ) +
  labs(
    title = "Gap Statistic – Optimal k",
    subtitle = "Vertical line marks k = 3"
  )

# 5.4  NbClust – 30 indices majority vote (definitive consensus)
cat("\n=== NbClust: 30-index Majority Vote ===\n")
set.seed(42)
nb_result <- NbClust(
  data = pc_scores,
  distance = "euclidean",
  min.nc = 2,
  max.nc = 8,
  method = "ward.D2",
  index = "all"
)

cat("\n--- NbClust Best Number of Clusters ---\n")
print(table(nb_result$Best.nc[1, ]))

# --- 6. Stage 2 – K-Means on PC Scores (k = 3) ------------------------------

set.seed(42)
K <- 3

km_final <- kmeans(
  pc_scores,
  centers = K,
  nstart = 50, # multiple random starts → stable solution
  iter.max = 500
)

cat("\n=== K-Means Final Solution (k = 3) ===\n")
cat("Cluster sizes:\n")
print(km_final$size)
cat(sprintf(
  "\nBetween-cluster SS / Total SS: %.1f%%\n",
  km_final$betweenss / km_final$totss * 100
))

# Attach cluster labels to full dataset
df_final <- df %>%
  bind_cols(as.data.frame(pca_result$x)) %>%
  mutate(Cluster = factor(km_final$cluster))

# --- 7. Cluster Validation ---------------------------------------------------

# 7.1  Silhouette plot for final k = 3 solution
sil_final <- silhouette(km_final$cluster, dist(pc_scores))
cat("\n=== Silhouette Summary (k = 3) ===\n")
print(summary(sil_final))

fviz_silhouette(
  sil_final,
  palette = unname(CLUSTER_COLS[as.character(1:K)]),
  ggtheme = theme_minimal(),
  print.summary = FALSE
) +
  labs(
    title = "Silhouette Plot – K-Means k = 3",
    subtitle = "Bars above 0 indicate well-classified observations"
  ) +
  theme(legend.position = "top")

# 7.2  Calinski-Harabasz & Davies-Bouldin indices
ch_stat <- calinhara(pc_scores, km_final$cluster)
cat(sprintf(
  "\nCalinski-Harabasz Index: %.2f  (higher = better separated)\n",
  ch_stat
))

# --- 8. Cluster Visualisation ------------------------------------------------

# 8.1  K-Means clusters on PC1 vs PC2
arrows_df <- as.data.frame(pca_result$rotation[, 1:2]) %>%
  rownames_to_column("Variable")
scale_fac <- 3.2

p_biplot <- ggplot(
  df_final,
  aes(x = PC1, y = PC2, colour = Cluster, shape = Cluster)
) +
  geom_point(alpha = 0.75, size = 2.5) +
  geom_segment(
    data = arrows_df,
    aes(x = 0, y = 0, xend = PC1 * scale_fac, yend = PC2 * scale_fac),
    arrow = arrow(length = unit(0.22, "cm")),
    colour = "grey30",
    linewidth = 0.65,
    inherit.aes = FALSE
  ) +
  geom_label_repel(
    data = arrows_df,
    aes(x = PC1 * scale_fac, y = PC2 * scale_fac, label = Variable),
    size = 3.2,
    colour = "grey20",
    box.padding = 0.4,
    inherit.aes = FALSE
  ) +
  stat_ellipse(
    aes(colour = Cluster),
    level = 0.90,
    linewidth = 0.8,
    linetype = "dashed"
  ) +
  scale_colour_manual(values = CLUSTER_COLS) +
  scale_shape_manual(values = c("1" = 16, "2" = 17, "3" = 15)) +
  labs(
    title = "K-Means Clusters (k = 3) on PCA Biplot",
    subtitle = "90% confidence ellipses; arrows show variable loadings",
    x = paste0("PC1 (", round(prop_var[1] * 100, 1), "% variance)"),
    y = paste0("PC2 (", round(prop_var[2] * 100, 1), "% variance)"),
    colour = "Cluster",
    shape = "Cluster"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "top")

print(p_biplot)

# 8.2  Clusters overlaid with Channel membership
p_channel <- ggplot(
  df_final,
  aes(x = PC1, y = PC2, colour = Cluster, shape = Channel)
) +
  geom_point(alpha = 0.75, size = 2.8) +
  scale_colour_manual(values = CLUSTER_COLS) +
  scale_shape_manual(values = c("HoReCa" = 17, "Retail" = 16)) +
  labs(
    title = "Cluster Membership vs Channel Type",
    subtitle = "Shape = Channel; Colour = Cluster",
    x = paste0("PC1 (", round(prop_var[1] * 100, 1), "%)"),
    y = paste0("PC2 (", round(prop_var[2] * 100, 1), "%)")
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "top")

print(p_channel)

# 8.3  Clusters vs Region
p_region <- ggplot(
  df_final,
  aes(x = PC1, y = PC2, colour = Cluster, shape = Region)
) +
  geom_point(alpha = 0.75, size = 2.8) +
  scale_colour_manual(values = CLUSTER_COLS) +
  scale_shape_manual(values = c("Lisbon" = 16, "Oporto" = 17, "Other" = 15)) +
  labs(
    title = "Cluster Membership vs Region",
    subtitle = "Shape = Region; Colour = Cluster",
    x = paste0("PC1 (", round(prop_var[1] * 100, 1), "%)"),
    y = paste0("PC2 (", round(prop_var[2] * 100, 1), "%)")
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "top")

print(p_region)

# --- 9. Cluster Profiling ----------------------------------------------------

# 9.1  Mean spending per cluster (original scale, unlogged)
profile_raw <- df_final %>%
  bind_cols(df_spend) %>%
  group_by(Cluster) %>%
  summarise(
    N = n(),
    Fresh_mean = mean(Fresh),
    Milk_mean = mean(Milk),
    Grocery_mean = mean(Grocery),
    Frozen_mean = mean(Frozen),
    DetPaper_mean = mean(Detergents_Paper),
    Delicassen_mean = mean(Delicassen),
    .groups = "drop"
  )

cat("\n=== Cluster Profiles – Mean Spending (original monetary units) ===\n")
print(profile_raw)

# 9.2  Channel and Region composition per cluster
cat("\n=== Channel Composition by Cluster ===\n")
print(table(df_final$Cluster, df_final$Channel))

cat("\n=== Region Composition by Cluster ===\n")
print(table(df_final$Cluster, df_final$Region))

# 9.3  Radar / Spider chart of normalised mean spend per cluster
profile_norm <- df_final %>%
  bind_cols(df_spend) %>%
  group_by(Cluster) %>%
  summarise(across(all_of(spend_vars), mean), .groups = "drop") %>%
  mutate(across(all_of(spend_vars), ~ rescale(.x, to = c(0, 1))))

profile_long <- profile_norm %>%
  pivot_longer(-Cluster, names_to = "Variable", values_to = "Score")

ggplot(
  profile_long,
  aes(x = Variable, y = Score, colour = Cluster, group = Cluster)
) +
  geom_polygon(aes(fill = Cluster), alpha = 0.15, linewidth = 0.8) +
  geom_line(linewidth = 0.9) +
  geom_point(size = 2.5) +
  coord_polar() +
  scale_colour_manual(values = CLUSTER_COLS) +
  scale_fill_manual(values = CLUSTER_COLS) +
  facet_wrap(~Cluster, labeller = label_both) +
  labs(
    title = "Cluster Spending Profiles – Radar Chart",
    subtitle = "Normalised mean spend per category (0 = min, 1 = max)"
  ) +
  theme_minimal(base_size = 11) +
  theme(
    axis.text.x = element_text(size = 8),
    legend.position = "none",
    strip.text = element_text(face = "bold")
  )

# 9.4  Grouped bar chart of mean spend by cluster
profile_long2 <- profile_raw %>%
  pivot_longer(
    cols = ends_with("_mean"),
    names_to = "Category",
    values_to = "MeanSpend"
  ) %>%
  mutate(Category = str_remove(Category, "_mean"))

ggplot(profile_long2, aes(x = Category, y = MeanSpend, fill = Cluster)) +
  geom_col(position = position_dodge(0.8), width = 0.7) +
  scale_fill_manual(values = CLUSTER_COLS) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Mean Annual Spend by Cluster and Category",
    subtitle = "Monetary units (m.u.)",
    x = NULL,
    y = "Mean Annual Spend (m.u.)",
    fill = "Cluster"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    axis.text.x = element_text(angle = 30, hjust = 1),
    legend.position = "top"
  )

# 9.5  Boxplots – spend distribution per cluster for each category
df_box <- df_final %>%
  bind_cols(df_spend) %>%
  pivot_longer(
    cols = all_of(spend_vars),
    names_to = "Category",
    values_to = "Spend"
  )

ggplot(df_box, aes(x = Cluster, y = Spend, fill = Cluster)) +
  geom_boxplot(outlier.size = 1.2, outlier.alpha = 0.5, alpha = 0.8) +
  scale_fill_manual(values = CLUSTER_COLS) +
  scale_y_continuous(labels = comma) +
  facet_wrap(~Category, scales = "free_y", ncol = 3) +
  labs(
    title = "Spending Distribution by Cluster and Category",
    x = "Cluster",
    y = "Annual Spend (m.u.)"
  ) +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none", strip.text = element_text(face = "bold"))

# --- 10. HAC vs K-Means Agreement Check -------------------------------------

hac_labels_k3 <- cutree(hac_ward, k = 3)

# Align HAC labels to K-Means labels by majority vote (label-switching fix)
conf_mat <- table(HAC = hac_labels_k3, KMeans = km_final$cluster)
cat("\n=== Contingency Table: HAC vs K-Means Assignments ===\n")
print(conf_mat)

agreement <- sum(apply(conf_mat, 1, max)) / nrow(df_final)
cat(sprintf(
  "\nLabel-adjusted agreement between HAC and K-Means: %.1f%%\n",
  agreement * 100
))
cat("(Values above 85%% indicate the two methods recover the same structure)\n")

# --- 11. Export Results ------------------------------------------------------

output_path <- "wholesale_clustered.csv"
write.csv(df_final, output_path, row.names = FALSE)
cat("\nFull dataset with cluster labels saved to:\n", output_path, "\n")

# --- 12. Business Interpretation Summary ------------------------------------

cat("\n")
cat("================================================================\n")
cat("          CUSTOMER SEGMENTATION – BUSINESS SUMMARY             \n")
cat("================================================================\n")
cat("\n  METHOD:  HAC (Ward.D2) to identify structure +\n")
cat("           K-Means (k=3) on PCA scores for final assignment\n")
cat("  INPUT:   log1p-transformed, standardised spend variables\n")
cat("           projected onto retained PCs (Kaiser criterion)\n")
cat("\n  CLUSTER 1 – 'High-Volume Fresh Buyers' (HoReCa dominant)\n")
cat("    • Highest Fresh and Frozen spend\n")
cat("    • Low Grocery, Detergents_Paper\n")
cat("    → Hotels, restaurants, cafés; perishable-led ordering\n")
cat("    IMPLICATION: Prioritise cold-chain logistics; daily/weekly\n")
cat("    replenishment cycles; dedicated HoReCa sales reps\n")
cat("\n  CLUSTER 2 – 'Packaged Goods Retailers' (Retail dominant)\n")
cat("    • Highest Grocery, Milk, Detergents_Paper\n")
cat("    • Low Fresh spend\n")
cat("    → Supermarkets, convenience stores; ambient goods focus\n")
cat("    IMPLICATION: Bulk-order pricing tiers; shelf-stable\n")
cat("    inventory buffers; quarterly promotions on detergents\n")
cat("\n  CLUSTER 3 – 'Small / Mixed Buyers'\n")
cat("    • Below-average spend across all categories\n")
cat("    • Mixed Channel and Region membership\n")
cat("    → Small independent shops, delis, mixed traders\n")
cat("    IMPLICATION: Route-optimised delivery (lower margin);\n")
cat("    upsell programmes; self-service ordering portal\n")
cat("\n  NOTE: Exact cluster labels (1/2/3) may differ from above\n")
cat("  depending on K-Means initialisation; inspect profile_raw\n")
cat("  and the radar/bar charts to assign the business labels.\n")
cat("================================================================\n")

The (unedited) outputs generated by running this code, are as follows:

=== Pre-processing complete ===
Dimensions of scaled input matrix: 440 x 6 

PCA: retaining 2 PCs (Kaiser criterion), explaining 71.3% of variance

=== Agglomerative Coefficients (closer to 1 = stronger structure) ===
           Method Agglomerative_Coefficient
average   average                    0.8761
single     single                    0.8031
complete complete                    0.9204
ward         ward                    0.9734
weighted weighted                    0.8894

Ward.D2 should show the highest coefficient, confirming it as best linkage.


=== Gap Statistic Results ===
Clustering Gap statistic ["clusGap"] from call:
clusGap(x = pc_scores, FUNcluster = kmeans, K.max = 10, B = 100, nstart = 25, iter.max = 300)
B=100 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
 --> Number of clusters (method 'firstSEmax', SE.factor=1): 1
          logW   E.logW       gap     SE.sim
 [1,] 5.633372 6.336062 0.7026902 0.01710636
 [2,] 5.328019 6.046137 0.7181186 0.01843138
 [3,] 5.175220 5.824960 0.6497394 0.01736385
 [4,] 5.049390 5.630825 0.5814342 0.01659374
 [5,] 4.959056 5.523051 0.5639956 0.01628787
 [6,] 4.850967 5.427449 0.5764815 0.01530802
 [7,] 4.787376 5.342180 0.5548043 0.01598955
 [8,] 4.721439 5.263845 0.5424064 0.01541976
 [9,] 4.667722 5.196424 0.5287017 0.01431602
[10,] 4.613041 5.138184 0.5251432 0.01353503

== SOURCE ERROR ==
 unused argument (ggtheme = theme_minimal())

B.2.2 Response 2: ChatGPT (GPT-5.3)

Below is a complete R workflow for clustering, building directly on the PCA-ready preprocessing. I’ll also explain why this method and how to choose the number of clusters.

Why this approach?

Method: K-means clustering

Works well for continuous, scaled data
Efficient for your dataset size (440 rows)
Produces clear, actionable segments for business use (inventory, pricing, sales)

Why use PCA first?

Reduces noise and multicollinearity (e.g., Grocery ↔︎ Detergents_Paper)
Improves clustering stability
Makes clusters easier to visualize

Choosing number of clusters (k) We’ll use:

Elbow method (within-cluster variance)
Silhouette score (cluster separation quality)

Typically, this dataset yields ~3–5 clusters, often:

Retail-heavy bulk buyers
HORECA fresh-focused buyers
Low-volume or niche customers

R Code: Clustering Pipeline Moved to below this response box

Interpretation Guide After running this, you should:

Pick k
- Look for:
  - “Elbow” bend in WSS plot
  - Peak silhouette score
- Usually k = 3 or 4 works well here
Interpret clusters (example patterns) You may find:
- Cluster 1: High Grocery + Detergents → Retail chains
- Cluster 2: High Fresh + Frozen → HORECA clients
- Cluster 3: Low across all → Small / infrequent buyers
Translate into business actions
- Inventory planning: stock Fresh vs Grocery differently per segment
- Pricing strategy: volume discounts for retail clusters
- Sales deployment: assign reps by segment type

Optional (better alternative) If you want more robust segmentation:

Try Hierarchical clustering (for validation)
Or Gaussian Mixture Models (GMM) for probabilistic clusters

If you want, I can help you turn this into a fully polished report with cluster personas and business recommendations-that’s usually what gets top marks.

This response included the following R code as part of the LLM’s in-line response:

# Load libraries
library(tidyverse)
library(factoextra)
library(cluster)

# Load data
data <- read.csv("Wholesale_customers_data.csv")

# Convert categorical variables
data$Channel <- as.factor(data$Channel)
data$Region <- as.factor(data$Region)

# Select continuous variables
spending_data <- data %>%
  select(Fresh, Milk, Grocery, Frozen, Detergents_Paper, Delicassen)

# Log transform to reduce skewness
spending_log <- log1p(spending_data)

# Scale data
spending_scaled <- scale(spending_log)

# -------------------------
# PCA (for dimensionality reduction)
# -------------------------
pca_result <- prcomp(spending_scaled, center = TRUE, scale. = TRUE)

# Extract first few principal components (e.g., first 3)
pca_scores <- as.data.frame(pca_result$x[, 1:3])

# -------------------------
# Determine optimal number of clusters
# -------------------------

# Elbow method
fviz_nbclust(pca_scores, kmeans, method = "wss") +
  ggtitle("Elbow Method")

# Silhouette method
fviz_nbclust(pca_scores, kmeans, method = "silhouette") +
  ggtitle("Silhouette Method")

# -------------------------
# Apply K-means clustering
# -------------------------

set.seed(123) # for reproducibility

k <- 3 # Adjust based on elbow/silhouette results

kmeans_result <- kmeans(pca_scores, centers = k, nstart = 25)

# Add cluster labels to original data
data$Cluster <- as.factor(kmeans_result$cluster)

# -------------------------
# Visualisation
# -------------------------

# Cluster plot
fviz_cluster(
  kmeans_result,
  data = pca_scores,
  geom = "point",
  ellipse.type = "convex",
  palette = "jco",
  ggtheme = theme_minimal()
)

# -------------------------
# Cluster profiling
# -------------------------

# Average spending per cluster
cluster_summary <- data %>%
  group_by(Cluster) %>%
  summarise(across(Fresh:Delicassen, mean))

print(cluster_summary)

# Distribution of Channel by cluster
table(data$Cluster, data$Channel)

# Distribution of Region by cluster
table(data$Cluster, data$Region)

The (unedited) outputs generated by running this code, are as follows:

# A tibble: 3 × 7
  Cluster  Fresh   Milk Grocery Frozen Detergents_Paper Delicassen
  <fct>    <dbl>  <dbl>   <dbl>  <dbl>            <dbl>      <dbl>
1 1       11939.  2006.   2502.  3266.             424.       891.
2 2        2997.  7045.  12505.   608.            5511.       797.
3 3       17182. 10658.  13381.  4181.            5004.      2867.
   
      1   2
  1 210   3
  2  32  50
  3  56  89
   
      1   2   3
  1  40  26 147
  2  14  10  58
  3  23  11 111

B.2.3 Evaluation

Claude generates a 633-line two-stage pipeline (HAC then K-Means) with four \(k\)-selection criteria (dendrogram height, elbow, silhouette, and gap statistic) plus a 30-index NbClust majority vote
This goes beyond our analysis, which relied primarily on silhouette scores
The gap statistic uses bootstrapped reference distributions (B = 100), an approach we did not employ
The script includes a coloured dendrogram with annotated cut heights, a Calinski-Harabasz index, and a formal HAC-vs-K-Means agreement check
This broadly follows our approach but expanding on specific diagnostics
The business summary translates each cluster into operational implications (cold-chain logistics, bulk pricing, route optimisation), matching and in places exceeding our commercial framing. This was generated entirely from prior domain knowledge rather than from the code output, a limitation not clearly communicated to the user
ChatGPT’s code runs cleanly without any errors and produces appropriate diagnostics (elbow, silhouette)
The decision to cluster on PCA scores is justified with a clear rationale (noise reduction, multicollinearity), and cross-tabulation against Channel and Region is included unprompted

Claude’s script fails at runtime: fviz_gap_stat(gap_stat, ggtheme = theme_minimal()) : unused argument (ggtheme = theme_minimal())
Claude’s business summary was generated before code execution suggesting that the cluster profiles are predictions from training data, not derived findings
ChatGPT presents K-Means as the sole method, with hierarchical clustering and GMM relegated to “optional” afterthoughts. Its cluster descriptions (“Retail chains,” “HORECA clients,” “Small/infrequent buyers”) similarly appear as expected outcomes before any code is run, raising the same memorisation concern as Claude
ChatGPT’s elbow and silhouette plots are produced but not interpreted. \(k\) is hardcoded to 3 with the comment “adjust based on results,” leaving the decision entirely to the user
Neither model applied robustness checks unprompted

B.3 Prompt 3: Alternative methods & robustness checks

Are there any alternative methods or robustness checks you would recommend?

B.3.1 Response 3: Claude (Sonnet 4.6)

Alternative Clustering Methods

1. Gaussian Mixture Models (GMM / Model-Based Clustering). K-Means assumes clusters are spherical and equally sized - a hard geometric constraint. GMM relaxes this by fitting a probabilistic mixture of multivariate Gaussians, allowing clusters to be elliptical, differently sized, and differently oriented. Each customer gets a soft probability of belonging to each cluster rather than a hard label. In R: mclust::Mclust() - it uses BIC to automatically select both the number of clusters and the covariance structure (e.g., diagonal vs. full, tied vs. variable). On wholesale data, where the HoReCa segment often has a very different spread from the Retail segment, GMM frequently produces more honest assignments. Prefer it over K-Means when the biplot shows clusters that are elongated or unequal in density - which is likely here given the strong Channel separation on PC2.

2. DBSCAN (Density-Based Spatial Clustering). DBSCAN identifies clusters as dense regions of points separated by sparse regions, and critically, it classifies low-density points as noise/outliers rather than forcing them into a cluster. This is valuable for wholesale data because a handful of very large spenders (e.g., a major supermarket chain) can distort K-Means centroids significantly. In R: dbscan::dbscan() - requires tuning two parameters: eps (neighbourhood radius) and minPts (minimum points to form a core). The kNNdistplot() function helps select eps via a knee in the k-NN distance curve. Limitation: DBSCAN struggles with clusters of varying density (common in spending data), and it does not generalise naturally to new observations - a limitation if you want to score new customers joining the database.

3. K-Medoids / PAM (Partitioning Around Medoids). Like K-Means but uses actual data points as cluster centres (medoids) rather than arithmetic means. This makes it substantially more robust to outliers, since a single extreme spender cannot pull a centroid arbitrarily far. It also works with any distance metric, not just Euclidean. In R: cluster::pam() - also produces a silhouette plot natively. A useful direct comparison: if PAM and K-Means produce very similar cluster assignments, it confirms the K-Means solution is not being distorted by outliers.

4. Agglomerative HAC with Alternative Linkages. The script already uses Ward.D2, but for a robustness check you should formally compare it against Complete and Average linkage and compute the Cophenetic Correlation Coefficient for each - this measures how faithfully the dendrogram preserves the original pairwise distances. The linkage with the highest cophenetic correlation is the most geometrically honest representation of the data’s structure.

Robustness Checks

5. Stability Analysis via Bootstrap (clusterboot). The most important robustness check you can run. fpc::clusterboot() repeatedly resamples the data (with replacement), re-clusters each bootstrap sample, and measures how consistently the same clusters re-emerge using the Jaccard similarity coefficient. A cluster with mean Jaccard > 0.75 is considered stable; below 0.5 it is likely an artefact of the particular sample. If the three clusters survive this test, there is strong evidence that the segmentation is a genuine feature of the population of wholesale customers in Portugal, not just of these 440 observations.

6. Sensitivity to Outliers. Wholesale spend data almost always contains extreme values (e.g., a single very large hotel group). Two targeted checks: first, winsorising the top 1–5% of each spending variable before clustering, then comparing cluster assignments to the unwinsorised solution using the Adjusted Rand Index (ARI) - an ARI close to 1.0 means the solution is insensitive to those extreme points. Second, leave-one-out influence analysis: iteratively remove the top 10 highest-spend customers and re-run K-Means. If cluster memberships for the remaining 430 customers shift substantially, the centroids are being pulled by individual outliers.

7. Sensitivity to the Number of PCs. The script retains PCs by the Kaiser criterion (eigenvalue ≥ 1), which typically gives 3 PCs explaining ~80% of variance. A robustness check is to re-run clustering using (a) 2 PCs only and (b) all 6 standardised variables directly (no dimensionality reduction), and compare the three solutions with ARI. Convergence across all three input representations confirms that the structure is not an artefact of the PCA truncation.

8. Multiple Random Initialisations & Consensus Clustering. K-Means with nstart = 50 already guards against poor local minima, but a stronger check is consensus clustering via the ConsensusClusterPlus package. It runs the algorithm hundreds of times on subsampled data and builds a consensus matrix - an N×N matrix where entry (i,j) records the proportion of runs in which customers i and j ended up in the same cluster. A well-separated block structure in the consensus heatmap is strong visual confirmation that the clusters are stable and not sensitive to initialisation.

9. External Validation Against Channel and Region. Since Channel and Region are known labels, they can be used as external validators without having been used in the clustering itself. Compute the Adjusted Rand Index between cluster assignments and Channel labels, and the Normalised Mutual Information between cluster assignments and Region labels. A high ARI against Channel (expect ~0.5–0.7 on this dataset) confirms the unsupervised algorithm is recovering a structure that aligns with a known business reality - a strong sign of face validity.

Summary Recommendation In priority order for a report:

Run clusterboot() - the single most persuasive robustness check for any clustering paper
Fit a GMM with Mclust() - the most principled alternative; compare ARI against K-Means
Run PAM and compare ARI - confirms outlier robustness
Compute cophenetic correlations across linkage methods - validates the HAC choice
Test sensitivity to PC retention - shows the structure is not an artefact of dimensionality reduction

If the consensus across all five checks points to three clusters with consistent membership, you have a very defensible segmentation - both statistically and for a business audience.

Claude did not return any R code as part of its response.

B.3.2 Response 3: ChatGPT (GPT-5.3)

Yes - and this is exactly the kind of thing that strengthens your analysis. K-means + PCA is a solid baseline, but for a data-driven segmentation, you should always validate that your clusters are stable, meaningful, and not method-dependent. Here are the most important alternatives and robustness checks, with R code.

1. Alternative Clustering Methods

(A) Hierarchical Clustering (Validation + interpretability) Why:

No need to pre-specify k upfront
Produces a dendrogram (useful for justification in reports)
Helps confirm whether k-means structure is real