This report analyses annual spending data from 440 clients of a Portuguese
wholesale distributor [1], [2], comprising 298
Hotel/Restaurant/Café (HoReCa) and 142 Retail customers across six product
categories: Fresh,
Milk, Grocery,
Frozen, Detergents_Paper, and
Delicassen. Data-driven
customer segmentation informs inventory planning, pricing strategy, and
sales team deployment, yet the existing classification is a simple binary label
(Channel).
This analysis uses Principal Component Analysis (PCA) and clustering to
investigate whether the existing two-category classification adequately captures
the structure of customer spending, or whether the data supports a more nuanced
segmentation.
A log-transformation was applied to normalise the distribution and address the right-skewness observed across all six spending variables (skewness > 2.5 for all - see Appendix, Figures A.2 & A.3). The variables were then standardised to ensure equal contribution to the PCA, preventing categories with naturally larger absolute expenditures from dominating the model.
The first two principal components capture 71.3% of the total variance (PC1:
44.1%; PC2: 27.2%; Figure 1.1(a)). The component loadings in
Figure 1.1(b) reveal distinct commercial purchasing patterns.
PC1 acts as the primary “channel axis,” driven by strong positive loadings for
Grocery (0.571),
Detergents_Paper (0.551), and
Milk (0.542),
contrasted with near-zero or negative loadings for
Fresh and Frozen.
PC2 serves as a “perishable volume axis,” characterised by high positive
loadings for Fresh (0.590), Frozen
(0.590), and Delicassen (0.530), capturing the scale of
fresh food purchasing independent of business type.
Figure 1.1(d) confirms this visually, showing clear
separation between HoReCa and Retail customers along PC1. PCA is restricted to
continuous inputs, necessitating the exclusion of Channel
and Region. A Gower-distance MDS analysis incorporating
these variables (Appendix, A.2) confirms that the categorical
labels dominate the distance metric, validating the decision to analyse spending
variables independently.
Figure 1.1: Principal Component Analysis of log-transformed, standardised spending data. (a) Scree plot showing the percentage of variance explained by each component. (b) Loadings of the original variables on the first four principal components. Values exceeding 0.4 in magnitude indicate strong contributions. (c) Biplot overlaying variable loadings (red arrows) on individual observations. (d) Individuals projected onto PC1-PC2, coloured by distribution channel (1 = HoReCa, 2 = Retail) with 95% concentration ellipses.
Figure 2.1: Revenue concentration by product category under three segmentation approaches. Top: the distributor’s existing binary Channel classification. Middle: K-Means (k=3) spending-based clusters. Bottom: Gaussian Mixture Modelling (GMM) (k=5) spending-based clusters. Crucially, the K-Means and GMM segmentations were derived solely from log-transformed, standardised spending data without access to Channel labels. Each bar shows the percentage of total category spend attributable to each segment and the Total column shows each segment’s share of overall revenue.
K-Means clustering was applied to the same log-transformed, standardised spending data used for PCA. Silhouette analysis identifies k=2 as optimal (0.290), which recovers the known Channel distinction and validates the clustering approach. However, the silhouette score for k=3 remains strong (0.259), and the marginal reduction in cluster quality yields a substantially richer segmentation. Beyond k=3, scores plateau around 0.19 with no meaningful recovery (Appendix, Figure A.5), suggesting additional clusters fragment genuine structure rather than reveal new segments.
K-Means was chosen for its interpretability and computational efficiency on this dataset. As robustness checks, both Ward’s hierarchical clustering and Gaussian Mixture Modelling (GMM) were applied to the same input. Ward’s method (k=3) agrees with K-Means on the dominant cluster structure, with disagreement confined to boundary observations between the smaller segments (Appendix A.4). GMM relaxes K-Means’ spherical cluster assumption: the BIC selected a 5-component VVE model (Appendix, Figure A.6), splitting both channels into finer sub-segments. Cross-tabulation against Channel and Region (Appendix, Tables A.1 & A.3) confirms that these sub-segments are spending-driven rather than geographic, with Region distributed roughly proportionally within each cluster.
Figure 2.1 presents the commercial impact of each segmentation. Under the existing binary Channel label, HoReCa customers account for 76% of Fresh revenue but only 68% of customers. K-Means (k=3) identifies a “Mixed Basket” segment (81 customers, 18%) with spending patterns that cut across both channels. GMM further distinguishes Heavy Retail from Diversified Retail, and separates High-volume HoReCa from Balanced HoReCa. Figure 2.2 traces how spending flows between these segmentations, revealing that GMM-3 captures high Fresh and Frozen spenders from multiple K-Means clusters.
Figure 2.2: Sankey diagram of spending across clusters. Each stacked bar shows the percentages of total spendings broken down into product categories (left), Gaussian Mixed Mixture clusters (middle), and KMeans (k=3) clusters (right). The flows on the left describes how each GMM cluster is comprised of spendings in different product categories. The flows on the right represent how the spendings are segmented differently by KMeans. Only flows with percentage value greater than 3% are labelled to minimise clutter.
The strength of this approach is that it operates entirely without access to Channel or Region labels. The near-perfect recovery of the Channel distinction validates the segmentation as behaviourally grounded, while the emergence of additional clusters reveals sub-structure the existing classification does not capture. However, the commercial value of a 5-segment model over a 3-segment model cannot be determined from a single cross-sectional snapshot. Validation would require longitudinal purchasing data, margin information, and ideally a controlled trial of differentiated service strategies across segments.
Two Large Language Models (LLMs), Claude (Sonnet 4.6) and ChatGPT (GPT-5.3),
were evaluated. To ensure comparability, identical prompts were used for both
models and neither was given the raw csv file. The dataset was described in
the first prompt and the LLMs were asked to produce R code for independent
execution. This design choice was deliberate: withholding the data replicated a
commercial setting in which an organisation may not wish to upload proprietary
customer data to a third-party platform. It also ensured that neither model
could adapt its recommendations to the observed distributions, testing whether
each model’s default methodological choices were appropriate for this dataset.
Four prompts were used, included in full in Appendix B.
As shown in Table 3.1, both LLMs correctly excluded Channel and Region, recommended log-transformation and standardisation, and converged on k=3, all without access to the data. Neither model interpreted the PCA loadings with specificity. Claude labelled PC1 as a “general spending scale” rather than the channel axis revealed by the actual loadings, while ChatGPT produced plots without interpreting them. Robustness checks and limitations were not mentioned until explicitly prompted. When asked in the final prompt, both models acknowledged that the data is from a well-known public dataset whose typical outcomes appear extensively in their training data, raising the question of whether the LLMs were performing genuine analytical reasoning or pattern-matching against memorised results.
ChatGPT produced more concise, modular code that executed without errors, while Claude generated comprehensive scripts with richer methodological commentary but several execution errors requiring manual debugging. LLMs demonstrate clear value for code generation and methodological structuring, but their inability to verify outputs against actual data distributions (and their tendency to present memorised patterns as derived findings) limits their current utility and analytical value.
| Analysis Step | Our Analysis | ChatGPT (GPT-5.3) | Claude (Sonnet 4.6) |
|---|---|---|---|
| Preprocessing | |||
| Excluded Channel & Region from PCA/clustering | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) |
| Identified skewness / recommended log transform | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) |
| Standardised variables before PCA | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) |
| PCA | |||
| Produced interpretable scree plot / variance | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) |
| Interpreted PC loadings with specificity | \(\checkmark\) | \(\circ\) | |
| Validated PCA against Channel label | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) |
| Clustering | |||
| Justified choice of clustering method | \(\checkmark\) | \(\circ\) | \(\checkmark\) |
| Used formal \(k\) selection (silhouette / elbow) | \(\checkmark\) | \(\checkmark\) | \(\circ\) |
| Identified structure beyond \(k{=}2\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) |
| Applied robustness checks unprompted | \(\checkmark\) | ||
| Interpretation | |||
| Commercial framing of segments | \(\checkmark\) | \(\circ\) | \(\checkmark\) |
| Acknowledged limitations unprompted | \(\checkmark\) | ||
| Generated code executed without errors | \(\checkmark\) | \(\checkmark\) | |
| 13/13 | 8/13 (2 partial) | 8/13 (2 partial) | |
box_data <- reshape(
data[, spending.cols],
direction = "long",
varying = spending.cols,
v.names = "Spend",
timevar = "Category",
times = spending.cols
)
cat_totals <- data.frame(
Category = spending.cols,
Total = colSums(data[, spending.cols]),
Median = apply(data[, spending.cols], 2, median)
)
# option to sort by median vs total;
# BE note: have chosen median here to highlight the difference in median spend
# despite large total gap between bottom two categories
cat_order <- levels(reorder(cat_totals$Category, cat_totals$Median))
# cat_order <- levels(reorder(cat_totals$Category, cat_totals$Total))
box_data$Category <- factor(box_data$Category, levels = cat_order)
cat_totals$Category <- factor(cat_totals$Category, levels = cat_order)
row.names(box_data) <- NULL
p_box <- ggplot(
box_data,
aes(x = Category, y = Spend)
) +
geom_boxplot(fill = "steelblue", alpha = 0.7) +
coord_flip() +
scale_y_continuous(
labels = scales::comma,
limits = c(0, NA),
expand = expansion(mult = c(0, 0.05))
) +
labs(
title = "(a) Spending Distribution per Customer",
y = "Annual Spend (m.u.)",
x = ""
) +
theme_minimal() +
theme(
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()
)
p_bar <- ggplot(cat_totals, aes(x = Category, y = Total)) +
geom_col(fill = "steelblue", alpha = 0.7, width = 0.7) +
geom_text(
aes(label = format(round(Total), big.mark = ",")),
hjust = -0.1,
size = 3.5
) +
coord_flip() +
scale_y_continuous(
labels = scales::comma,
expand = expansion(mult = c(0, 0.15))
) +
labs(
title = "(b) Total Category Spend",
y = "Total Spend (m.u.)",
x = ""
) +
theme_minimal() +
theme(
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()
)
p_box / p_barFigure A.1: Customer spending profiles across six wholesale categories. Top: boxplots showing the highly skewed distribution of individual annual spend. Bottom: the total aggregate volume, highlighting Fresh and Grocery as the largest product segments. Both plots are sorted by the median spend.
Figure A.1 provides a baseline overview of purchasing
behaviour across the six categories. Median spend is relatively modest in all
categories, but every distribution exhibits a long right tail of high-value
customers. At the aggregate level, Fresh and
Grocery dominate total revenue, while
Delicassen remains a niche segment.
Figure A.2: Shape comparison of raw and log-transformed spending distributions by product category. To facilitate direct comparison of the distribution shapes, all spend values were min-max scaled to a 0–1 range within their respective categories. The raw data (blue) demonstrates a right skew across all categories, influenced by extreme outliers. The log1p transformation (red) effectively compresses this long tail, yielding more symmetric distributions that ensure variance-based methods (PCA) and distance-based methods (clustering) are not dominated by extreme outliers.
This skewness is examined more closely in Figure A.2,
which overlays raw and log-transformed distributions for each category. The raw
spending figures exhibit a strong rightward skew across all six product
categories, indicating that while most customers spend relatively small amounts,
a long tail of high-spending customers dominates the scale. Applying a log
transformation (log1p) successfully ‘pulls in’ these extreme values, centring
the bulk of the data and producing much more symmetric distributions.
par(mfrow = c(1, 2))
soft_palette <- colorRampPalette(c("#D73027", "#F46D43", "#FFFFFF", "#74ADD1", "#4575B4"))(200)
corrplot(
cor(spending),
method = "color",
type = "upper",
col = soft_palette,
addCoef.col = "black",
number.cex = 0.8,
tl.col = "black",
title = "Raw spending",
mar = c(0, 0, 2, 0)
)
corrplot(
cor(spending_log),
method = "color",
type = "upper",
col = soft_palette,
addCoef.col = "black",
number.cex = 0.8,
tl.col = "black",
title = "Log-transformed spending",
mar = c(0, 0, 2, 0)
)Figure A.3: Pairwise correlation matrices of the six product categories, comparing linear relationships using raw spending data (left) versus log-transformed spending data (right).
The side-by-side correlation plots reveal the underlying purchasing
relationships between product categories before and after adjusting for scale.
Both matrices highlight a distinct cluster of highly correlated spending:
Grocery, Detergents_Paper, and
Milk show strong positive linear relationships,
suggesting they are frequently purchased together in bulk. While the log
transformation preserves this core structure, it slightly softens the most
extreme correlations (for instance, the correlation between
Grocery and Detergents_Paper
drops from 0.92
to 0.80).
This indicates that in the raw data, a few extreme outlier purchases were
artificially inflating the strength of that linear relationship.
Figure A.4: Non-metric MDS using Gower’s distance across all eight variables (six log-transformed continuous spending categories plus Channel and Region). Points are coloured by Channel (left) and Region (right). Unlike PCA, Gower’s distance forces distinct spatial separation based on categorical mismatches, resulting in six artificial clusters corresponding to the cross-tabulation of the two Channels and three Regions.
Figure A.4 presents a non-metric MDS solution computed
from Gower’s distance across all eight variables, incorporating the categorical
Channel and Region columns excluded
from the PCA. Because Gower’s metric treats any categorical mismatch as a
maximal distance contribution on that variable, the resulting spatial map is
dominated by these labels rather than underlying spending behaviour.
Panel (a) shows separation by Channel along the primary
axis (MDS1), while Panel (b) reveals further stratification by
Region along MDS2. The result is a forced stratification
into six spatial clusters reflecting the Channel \(\times\) Region
cross-tabulation. While there is minor regional intermingling (e.g., two Oporto
customers mapping near the Lisbon cluster), the overall spatial map merely
reflects the cross-tabulation of the categorical labels rather than organic,
dataset-wide purchasing patterns. The solution’s stress value of
14.5 further indicates a
mediocre fit, falling above the conventional threshold of 10 for reliable
low-dimensional representation. This confirms that the PCA approach, relying
purely on continuous spending variables, is better suited to uncovering natural
commercial structure in this dataset.
Figure A.5: Silhouette analysis for K-Means clustering (k=2 to 10). The optimal number of clusters is k=2 followed by k=3 as a strong second. Beyond k=3, scores plateau around 0.19.
Figure A.6: BIC values across component counts for all model types evaluated by mclust. Higher (less negative) BIC indicates better fit. The BIC peaks at 5 components under the VVE (ellipsoidal, variable volume and shape, equal orientation) model.
Silhouette analysis (Figure A.5) identifies k=2 as optimal (0.290), with k=3 as a close second (0.259). Beyond k=3, scores plateau around 0.19.
The GMM’s BIC (Figure A.6) peaks at 5 components (-6,126.0), compared to -6,212.0 for four and -6,186.6 for six. Interestingly k=4 is worse than k=3, suggesting the data supports either 3 or 5 components but not an intermediate value. Both methods agree that k=3 is a meaningful threshold and the disagreement on whether to proceed to k=5 reflects a genuine trade-off between parsimony and granularity.
| Channel | Lisbon | Oporto | Other |
|---|---|---|---|
| KM-1 (HoReCa Dominant) | |||
| 1 | 39 | 25 | 145 |
| 2 | 1 | 1 | 1 |
| KM-2 (Retail Dominant) | |||
| 1 | 13 | 2 | 41 |
| 2 | 10 | 9 | 72 |
| KM-3 (Mixed Basket) | |||
| 1 | 7 | 1 | 25 |
| 2 | 7 | 9 | 32 |
| Channel | Lisbon | Oporto | Other |
|---|---|---|---|
| GMM-1 (Heavy Retail) | |||
| 1 | 1 | 0 | 6 |
| 2 | 7 | 8 | 29 |
| GMM-2 (Diversified Retail) | |||
| 1 | 5 | 0 | 13 |
| 2 | 9 | 10 | 64 |
| GMM-3 (High-volume HoReCa) | |||
| 1 | 34 | 15 | 100 |
| 2 | 1 | 1 | 4 |
| GMM-4 (Balanced HoReCa) | |||
| 1 | 12 | 10 | 55 |
| 2 | 1 | 0 | 3 |
| GMM-5 (Low-spend / Mixed) | |||
| 1 | 7 | 3 | 37 |
| 2 | 0 | 0 | 5 |
Table A.1 shows that K-Means (k=3) produces one
cluster that is almost exclusively HoReCa
(KM-1, “HoReCa Dominant”:
209 of
212 customers), one that leans
Retail (KM-2, “Retail Dominant”:
91 of
147), and a third
(KM-3, “Mixed Basket”) that draws substantially
from both channels
(33 HoReCa,
48 Retail). The “Retail Dominant”
label reflects the spending profile rather than exclusive Channel membership:
these customers concentrate on Grocery and
Detergents_Paper regardless of their administrative
Channel label.
| Cluster | Label | \(n\) | Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicassen |
|---|---|---|---|---|---|---|---|---|
| K-Means (k = 3) | ||||||||
| KM-1 | HoReCa Dominant | 212 | 9,635 | 1,598 | 2,151 | 2,151 | 274 | 686 |
| KM-2 | Retail Dominant | 147 | 12,126 | 7,184 | 9,965 | 2,005 | 3,378 | 2,005 |
| KM-3 | Mixed Basket | 81 | 1,454 | 6,264 | 10,487 | 388 | 4,196 | 436 |
| Gaussian Mixture Model (k = 5) | ||||||||
| GMM-1 | Heavy Retail | 51 | 1,454 | 7,845 | 13,567 | 283 | 6,830 | 716 |
| GMM-2 | Diversified Retail | 101 | 7,842 | 7,243 | 11,238 | 1,389 | 4,621 | 1,625 |
| GMM-3 | High-volume HoReCa | 155 | 14,276 | 1,786 | 2,405 | 3,383 | 325 | 933 |
| GMM-4 | Balanced HoReCa | 81 | 5,969 | 3,218 | 3,133 | 907 | 637 | 656 |
| GMM-5 | Low-spend / Mixed | 52 | 2,270 | 1,572 | 3,712 | 892 | 174 | 284 |
| Cluster | Label | Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicassen |
|---|---|---|---|---|---|---|---|
| K-Means (k = 3) | |||||||
| KM-1 | HoReCa Dominant | 58.4 | 9.7 | 13.0 | 13.0 | 1.7 | 4.2 |
| KM-2 | Retail Dominant | 33.1 | 19.6 | 27.2 | 5.5 | 9.2 | 5.5 |
| KM-3 | Mixed Basket | 6.3 | 27.0 | 45.2 | 1.7 | 18.1 | 1.9 |
| Gaussian Mixture Model (k = 5) | |||||||
| GMM-1 | Heavy Retail | 4.7 | 25.6 | 44.2 | 0.9 | 22.3 | 2.3 |
| GMM-2 | Diversified Retail | 23.1 | 21.3 | 33.1 | 4.1 | 13.6 | 4.8 |
| GMM-3 | High-volume HoReCa | 61.8 | 7.7 | 10.4 | 14.6 | 1.4 | 4.0 |
| GMM-4 | Balanced HoReCa | 41.1 | 22.2 | 21.6 | 6.2 | 4.4 | 4.5 |
| GMM-5 | Low-spend / Mixed | 25.5 | 17.7 | 41.7 | 10.0 | 2.0 | 3.2 |
| Cluster | Label | Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicassen |
|---|---|---|---|---|---|---|---|
| K-Means (k = 3) | |||||||
| KM-1 | HoReCa Dominant | 48.2 | 16.6 | 15.1 | 51.4 | 7.1 | 28.2 |
| KM-2 | Retail Dominant | 47.4 | 60.9 | 56.0 | 44.9 | 57.8 | 62.4 |
| KM-3 | Mixed Basket | 4.4 | 22.5 | 28.8 | 3.6 | 35.1 | 9.4 |
| Gaussian Mixture Model (k = 5) | |||||||
| GMM-1 | Heavy Retail | 3.1 | 18.0 | 24.0 | 2.1 | 31.5 | 9.0 |
| GMM-2 | Diversified Retail | 22.2 | 42.1 | 43.4 | 15.0 | 52.9 | 31.5 |
| GMM-3 | High-volume HoReCa | 51.9 | 15.7 | 13.9 | 58.6 | 5.7 | 31.7 |
| GMM-4 | Balanced HoReCa | 13.2 | 11.5 | 9.3 | 10.5 | 5.1 | 12.4 |
| GMM-5 | Low-spend / Mixed | 9.6 | 12.8 | 9.4 | 13.8 | 4.7 | 15.4 |
Table A.5, A.7, and A.9 present the cluster spending profiles from three perspectives.
The K-Means “HoReCa Dominant” cluster
(KM-1) allocates 58.4% of
its median basket
(Table A.7) to Fresh and accounts
for 48.2% of total Fresh revenue
(Table A.9),
while “Retail Dominant” (KM-2) concentrates on
Grocery
(27.2%) and
Detergents_Paper
(9.2%) but comprises only
147 customers
(33%
of the base). At the GMM level, the distinction sharpens further: the median
GMM-3 customer (“High-volume HoReCa”) dedicates 61.8%
of its basket
to Fresh and, despite representing
35% of customers, captures
51.9% of total Fresh revenue. The Retail split is
equally informative: the typical GMM-1 customer (“Heavy Retail”)
concentrates on Grocery (44.2%)
and Detergents_Paper
(22.3%), while the typical GMM-2 customer
(“Diversified Retail”) spreads spend more evenly across categories including
Fresh (23.1%), suggesting a
convenience-store or diversified retailer profile.
Figure A.7: Cluster assignments from K-Means (k=3, top) and GMM (k=5, bottom) projected onto the first two principal components, which together capture 71.3 percent of total variance. All clustering was performed in the full six-dimensional standardised log-spending space. Apparent overlap between clusters may reflect the loss of separating information in the remaining 28.7 percent of variance captured by PC3–PC6, rather than genuine ambiguity in cluster assignment.
Figure A.7 projects the K-Means and GMM cluster assignments onto PCA space. Both panels show that the dominant cluster boundaries divide the data along PC1 (the “channel axis” identified in 1), confirming that the clustering and PCA analyses converge on the same dominant structure. The GMM panel shows finer subdivision, particularly along PC2, consistent with GMM’s ability to model ellipsoidal clusters of varying density.
| Ward 1 | Ward 2 | Ward 3 | |
|---|---|---|---|
| K-Means 1 | 0 | 201 | 11 |
| K-Means 2 | 91 | 50 | 6 |
| K-Means 3 | 34 | 11 | 36 |
Table A.11 cross-tabulates K-Means (k=3) and Ward’s hierarchical clustering (k=3) assignments. The largest cluster shows near-perfect agreement (201 of 212 observations). Disagreement is concentrated among boundary customers between the two smaller segments, reflecting the different optimisation strategies. K-Means minimises within-cluster sum of squares globally, while Ward’s method does so greedily through successive agglomeration.
A consistent visual format is used to distinguish between prompts sent to the LLM, its responses, and our commentary.
Prompts to the LLM - shown with a right-hand border. These are the instructions and questions we provided to the model.
LLM Response - shown with a left-hand border, forming a conversational flow with the prompt. Responses may include a mix of explanation and code:
Our group insight and critical evaluation - observations or key findings drawn from the output (i.e., what we found).
Caution: points requiring verification - potential concerns, limitations, or caveats with the generated output (i.e., what we question).
The R code generated by each LLM was executed programmatically within this
document using a sandboxed runtime environment. Each script was run in an
isolated R subprocess via the callr package, ensuring that no libraries,
variables, or side effects from one script could contaminate the main document
environment or subsequent scripts. Console output and diagnostic messages were
captured and reproduced verbatim, while graphical output was intercepted via a
PNG device and embedded automatically. Any runtime errors were caught and
reported inline rather than halting the document compilation.
This approach was used to ensure that the displayed output faithfully reflected what each LLM-generated script produced when executed from a clean R session.
The full chat history is available via:
Role : You are a data scientist working in a company that does wholesale distribution to clients located in Portugal.
task : Your company is looking to implement a Data-driven customer segmentation to inform inventory planning, pricing strategy, and sales team deployment.
I have a dataset called Wholesale_customers_data.csv located at
.\Wholesale_customers_data.csv. It contains 440 rows and 8 columns: Channel
(1=Hotel/Restaurant/Café, 2=Retail), Region (1=Lisbon, 2=Oporto, 3=Other), and
six continuous spending variables (Fresh, Milk, Grocery, Frozen,
Detergents_Paper, Delicassen) representing annual spending in monetary units.
Please write R code to perform PCA on this dataset. Include any preprocessing
you think is necessary.
> Check for relevant skills
No file uploaded, but I have a clear description of the dataset. Let me write comprehensive R code for PCA on this wholesale customers dataset.
> Created a file, read a file
Here’s a walkthrough of every decision made in the script:
Pre-processing Rationale
Why separate Channel & Region from the PCA? They are categorical identifiers, not spending measurements. Mixing measurement scales would corrupt the covariance structure. They are kept as labels to colour the biplots afterwards, which is far more analytically useful.
Why log1p-transform before PCA? Wholesale spend data is almost always severely right-skewed (a few very large clients dominate). PCA is sensitive to scale and variance, so without this step a single high-spend outlier can hijack the first principal component. log1p (log(x+1)) compresses the long tail and handles any zero values gracefully.
Why scale. = TRUE inside prcomp? Even after log-transformation the six variables have different magnitudes (e.g., Fresh spending is typically much larger than Delicassen). Standardising to unit variance ensures each variable contributes equally to the decomposition.
Why run KMO & Bartlett’s tests? These confirm that the correlation structure is strong enough to make PCA meaningful. Bartlett’s test (H₀: correlation matrix = identity) should be rejected (p < 0.05), and KMO > 0.6 indicates adequate sampling adequacy.
What the Script Produces
prcomp PCA object, full rotation matrix & summarywholesale_pca_scores.csvWhat to Expect Analytically
Based on known patterns in this dataset:
Together, ~2–3 PCs typically explain 75–85% of variance, making the PC scores an excellent lower-dimensional input for your subsequent K-Means or Hierarchical clustering step.
This response included the following R code (as a separate downloadable file):
# =============================================================================
# PCA Analysis: Wholesale Customers Dataset
# Goal: Customer segmentation to inform inventory planning, pricing strategy,
# and sales team deployment
# =============================================================================
# --- 0. Install & Load Libraries ---------------------------------------------
required_packages <- c(
"tidyverse",
"corrplot",
"factoextra",
"FactoMineR",
"gridExtra",
"scales",
"ggrepel",
"psych"
)
installed <- rownames(installed.packages())
to_install <- required_packages[!required_packages %in% installed]
if (length(to_install) > 0) {
install.packages(to_install, dependencies = TRUE)
}
library(tidyverse)
library(corrplot)
library(factoextra)
library(FactoMineR)
library(gridExtra)
library(scales)
library(ggrepel)
library(psych)
# --- 1. Load Data ------------------------------------------------------------
data_path <- "Wholesale_customers_data.csv"
df_raw <- read.csv(data_path, stringsAsFactors = FALSE)
cat("=== Dataset Overview ===\n")
cat("Dimensions:", nrow(df_raw), "rows x", ncol(df_raw), "columns\n\n")
print(head(df_raw))
cat("\n")
print(summary(df_raw))
# --- 2. Pre-processing -------------------------------------------------------
# 2.1 Recode categorical variables as factors (kept aside from PCA)
df <- df_raw %>%
mutate(
Channel = factor(Channel, levels = c(1, 2), labels = c("HoReCa", "Retail")),
Region = factor(
Region,
levels = c(1, 2, 3),
labels = c("Lisbon", "Oporto", "Other")
)
)
# 2.2 Isolate the six continuous spending variables for PCA
spend_vars <- c(
"Fresh",
"Milk",
"Grocery",
"Frozen",
"Detergents_Paper",
"Delicassen"
)
df_spend <- df[, spend_vars]
# 2.3 Check for missing values
cat("\n=== Missing Values per Column ===\n")
print(colSums(is.na(df_spend)))
# 2.4 Check for outliers via boxplots (raw scale)
par(mfrow = c(2, 3), mar = c(4, 4, 3, 1))
for (v in spend_vars) {
boxplot(
df_spend[[v]],
main = v,
ylab = "Annual Spend (m.u.)",
col = "#4E79A7",
border = "#2C4770",
outline = TRUE
)
}
mtext(
"Raw Spending Distributions – Outlier Inspection",
side = 3,
line = -2,
outer = TRUE,
cex = 1.1,
font = 2
)
par(mfrow = c(1, 1))
# 2.5 Assess skewness – highly skewed data warrants log transformation
cat("\n=== Skewness of Raw Spending Variables ===\n")
skew_raw <- apply(df_spend, 2, function(x) psych::skew(x))
print(round(skew_raw, 3))
# 2.6 Log-transform to reduce right skew (common for monetary spend data)
# Adding 1 to guard against any zero values
df_log <- log1p(df_spend)
cat("\n=== Skewness After log1p Transformation ===\n")
skew_log <- apply(df_log, 2, function(x) psych::skew(x))
print(round(skew_log, 3))
# 2.7 Visualise distributions: raw vs log-transformed
p_list <- list()
for (v in spend_vars) {
p_raw <- ggplot(df_spend, aes_string(x = v)) +
geom_histogram(fill = "#4E79A7", colour = "white", bins = 30) +
labs(title = paste(v, "(raw)"), x = NULL, y = "Count") +
theme_minimal(base_size = 9)
p_log <- ggplot(df_log, aes_string(x = v)) +
geom_histogram(fill = "#F28E2B", colour = "white", bins = 30) +
labs(title = paste(v, "(log1p)"), x = NULL, y = "Count") +
theme_minimal(base_size = 9)
p_list <- c(p_list, list(p_raw, p_log))
}
grid.arrange(
grobs = p_list,
ncol = 4,
top = "Distribution: Raw vs log1p-transformed Spending"
)
# 2.8 Correlation matrix on log-transformed data
cat("\n=== Pearson Correlation Matrix (log-transformed) ===\n")
cor_mat <- cor(df_log, method = "pearson")
print(round(cor_mat, 3))
corrplot(
cor_mat,
method = "color",
type = "upper",
addCoef.col = "black",
number.cex = 0.75,
tl.col = "black",
tl.srt = 45,
col = colorRampPalette(c("#2166AC", "white", "#B2182B"))(200),
title = "Correlation Matrix – log1p Spending Variables",
mar = c(0, 0, 2, 0)
)
# 2.9 KMO & Bartlett's test – check suitability for PCA
cat("\n=== KMO Measure of Sampling Adequacy ===\n")
kmo_result <- KMO(cor_mat)
print(kmo_result)
cat("\n=== Bartlett's Test of Sphericity ===\n")
bart_result <- cortest.bartlett(cor_mat, n = nrow(df_log))
print(bart_result)
# --- 3. Principal Component Analysis ----------------------------------------
# prcomp with scale. = TRUE standardises the log-transformed variables
# (zero mean, unit variance) – essential so no single variable dominates
pca_result <- prcomp(df_log, center = TRUE, scale. = TRUE)
cat("\n=== PCA Summary ===\n")
print(summary(pca_result))
cat("\n=== Loadings (Rotation Matrix) ===\n")
print(round(pca_result$rotation, 4))
# --- 4. Variance Explained ---------------------------------------------------
# 4.1 Eigenvalues
eig_vals <- pca_result$sdev^2
prop_var <- eig_vals / sum(eig_vals)
cum_var <- cumsum(prop_var)
variance_table <- data.frame(
PC = paste0("PC", seq_along(eig_vals)),
Eigenvalue = round(eig_vals, 4),
Proportion_Var = round(prop_var, 4),
Cumulative_Var = round(cum_var, 4)
)
cat("\n=== Variance Explained Table ===\n")
print(variance_table)
# 4.2 Scree plot with Kaiser criterion line (eigenvalue = 1)
fviz_eig(
pca_result,
addlabels = TRUE,
ylim = c(0, 55),
barfill = "#4E79A7",
barcolor = "#2C4770",
linecolor = "#E15759",
ggtheme = theme_minimal()
) +
geom_hline(
yintercept = 100 / length(spend_vars), # Kaiser: eigenvalue = 1
linetype = "dashed",
colour = "red",
linewidth = 0.8
) +
annotate(
"text",
x = 5.5,
y = 100 / length(spend_vars) + 1.5,
label = "Kaiser criterion (eigenvalue = 1)",
colour = "red",
size = 3.5
) +
labs(
title = "Scree Plot – Wholesale Customers PCA",
subtitle = "Dashed line = Kaiser criterion"
)
# 4.3 Cumulative variance plot
ggplot(variance_table, aes(x = PC, y = Cumulative_Var * 100, group = 1)) +
geom_line(colour = "#E15759", linewidth = 1.2) +
geom_point(size = 3, colour = "#E15759") +
geom_hline(yintercept = 80, linetype = "dashed", colour = "gray40") +
geom_hline(yintercept = 90, linetype = "dotted", colour = "gray40") +
annotate(
"text",
x = 5.6,
y = 81,
label = "80%",
size = 3.5,
colour = "gray30"
) +
annotate(
"text",
x = 5.6,
y = 91,
label = "90%",
size = 3.5,
colour = "gray30"
) +
scale_y_continuous(limits = c(0, 100), labels = function(x) paste0(x, "%")) +
labs(
title = "Cumulative Variance Explained by PCs",
subtitle = "Dashed = 80%, Dotted = 90% thresholds",
x = "Principal Component",
y = "Cumulative Variance (%)"
) +
theme_minimal(base_size = 12)
# --- 5. Visualise Loadings ---------------------------------------------------
# 5.1 Loading heatmap for selected PCs
n_pcs <- sum(eig_vals >= 1) # Kaiser rule
loadings_df <- as.data.frame(pca_result$rotation[, 1:n_pcs]) %>%
rownames_to_column("Variable") %>%
pivot_longer(-Variable, names_to = "PC", values_to = "Loading")
ggplot(loadings_df, aes(x = PC, y = Variable, fill = Loading)) +
geom_tile(colour = "white", linewidth = 0.5) +
geom_text(aes(label = round(Loading, 2)), size = 3.5, colour = "black") +
scale_fill_gradient2(
low = "#2166AC",
mid = "white",
high = "#B2182B",
midpoint = 0,
limits = c(-1, 1)
) +
labs(
title = "PCA Loading Heatmap",
subtitle = paste("First", n_pcs, "PCs (Kaiser criterion: eigenvalue ≥ 1)"),
x = NULL,
y = NULL,
fill = "Loading"
) +
theme_minimal(base_size = 12) +
theme(axis.text.x = element_text(face = "bold"))
# 5.2 Variable contribution plot (cos²)
fviz_cos2(
pca_result,
choice = "var",
axes = 1:2,
fill = "#4E79A7",
color = "#2C4770",
ggtheme = theme_minimal()
) +
labs(
title = "Variable cos² on PC1 & PC2",
subtitle = "Quality of representation on the first two components"
)
# --- 6. Biplots & Score Plots ------------------------------------------------
# 6.1 Biplot – coloured by Channel
scores_df <- as.data.frame(pca_result$x) %>%
mutate(Channel = df$Channel, Region = df$Region)
# Build arrow data for loadings overlay
arrows_df <- as.data.frame(pca_result$rotation[, 1:2]) %>%
rownames_to_column("Variable")
scale_factor <- 3.5 # scale arrows to fit score space
ggplot(scores_df, aes(x = PC1, y = PC2, colour = Channel, shape = Channel)) +
geom_point(alpha = 0.7, size = 2.5) +
geom_segment(
data = arrows_df,
aes(x = 0, y = 0, xend = PC1 * scale_factor, yend = PC2 * scale_factor),
arrow = arrow(length = unit(0.25, "cm")),
colour = "black",
linewidth = 0.7,
inherit.aes = FALSE
) +
geom_label_repel(
data = arrows_df,
aes(x = PC1 * scale_factor, y = PC2 * scale_factor, label = Variable),
size = 3.5,
colour = "black",
box.padding = 0.4,
inherit.aes = FALSE
) +
scale_colour_manual(values = c("HoReCa" = "#E15759", "Retail" = "#4E79A7")) +
labs(
title = "PCA Biplot – PC1 vs PC2",
subtitle = "Coloured by Channel (HoReCa vs Retail)",
x = paste0("PC1 (", round(prop_var[1] * 100, 1), "% variance)"),
y = paste0("PC2 (", round(prop_var[2] * 100, 1), "% variance)"),
colour = "Channel",
shape = "Channel"
) +
theme_minimal(base_size = 12) +
theme(legend.position = "top")
# 6.2 Biplot – coloured by Region
ggplot(scores_df, aes(x = PC1, y = PC2, colour = Region, shape = Region)) +
geom_point(alpha = 0.7, size = 2.5) +
geom_segment(
data = arrows_df,
aes(x = 0, y = 0, xend = PC1 * scale_factor, yend = PC2 * scale_factor),
arrow = arrow(length = unit(0.25, "cm")),
colour = "black",
linewidth = 0.7,
inherit.aes = FALSE
) +
geom_label_repel(
data = arrows_df,
aes(x = PC1 * scale_factor, y = PC2 * scale_factor, label = Variable),
size = 3.5,
colour = "black",
box.padding = 0.4,
inherit.aes = FALSE
) +
scale_colour_manual(
values = c("Lisbon" = "#59A14F", "Oporto" = "#F28E2B", "Other" = "#B07AA1")
) +
labs(
title = "PCA Biplot – PC1 vs PC2",
subtitle = "Coloured by Region",
x = paste0("PC1 (", round(prop_var[1] * 100, 1), "% variance)"),
y = paste0("PC2 (", round(prop_var[2] * 100, 1), "% variance)"),
colour = "Region",
shape = "Region"
) +
theme_minimal(base_size = 12) +
theme(legend.position = "top")
# 6.3 PC1 vs PC3 (if relevant)
if (n_pcs >= 3) {
ggplot(scores_df, aes(x = PC1, y = PC3, colour = Channel, shape = Channel)) +
geom_point(alpha = 0.7, size = 2.5) +
scale_colour_manual(
values = c("HoReCa" = "#E15759", "Retail" = "#4E79A7")
) +
labs(
title = "PCA Score Plot – PC1 vs PC3",
x = paste0("PC1 (", round(prop_var[1] * 100, 1), "%)"),
y = paste0("PC3 (", round(prop_var[3] * 100, 1), "%)")
) +
theme_minimal(base_size = 12) +
theme(legend.position = "top")
}
# --- 7. Extract PC Scores for Downstream Use ---------------------------------
# Attach PC scores back to full dataset
df_with_scores <- bind_cols(df, as.data.frame(pca_result$x))
cat("\n=== PC Scores – First 6 Rows ===\n")
print(head(df_with_scores[, c("Channel", "Region", "PC1", "PC2", "PC3")]))
# Save enriched dataset
output_path <- "wholesale_pca_scores.csv"
write.csv(df_with_scores, output_path, row.names = FALSE)
cat("\nPC scores saved to:", output_path, "\n")
# --- 8. Interpretation Summary -----------------------------------------------
cat("\n")
cat("========================================================\n")
cat(" PCA INTERPRETATION SUMMARY \n")
cat("========================================================\n")
cat(sprintf(" Components retained (Kaiser, eigenvalue ≥ 1): %d\n", n_pcs))
cat(sprintf(
" Variance explained by retained PCs: %.1f%%\n",
cum_var[n_pcs] * 100
))
cat("\n PC1 – 'Overall Spend Scale'\n")
cat(" Likely captures total spending power / business size.\n")
cat(" High positive loadings across most categories.\n")
cat("\n PC2 – 'Fresh vs Packaged Goods Axis'\n")
cat(" Separates Fresh-heavy buyers (HoReCa) from\n")
cat(" Grocery/Detergent-heavy buyers (Retail).\n")
cat("\n PC3 – 'Speciality/Frozen Dimension' (if retained)\n")
cat(" May differentiate niche buyers (Delicassen, Frozen).\n")
cat("\n Business Implication:\n")
cat(" PC scores can directly feed K-Means / Hierarchical\n")
cat(" clustering for actionable customer segmentation.\n")
cat("========================================================\n")The (unedited) outputs generated by running this code, are as follows:
=== Dataset Overview ===
Dimensions: 440 rows x 8 columns
Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 2 3 12669 9656 7561 214 2674 1338
2 2 3 7057 9810 9568 1762 3293 1776
3 2 3 6353 8808 7684 2405 3516 7844
4 1 3 13265 1196 4221 6404 507 1788
5 2 3 22615 5410 7198 3915 1777 5185
6 2 3 9413 8259 5126 666 1795 1451
Channel Region Fresh Milk
Min. :1.000 Min. :1.000 Min. : 3 Min. : 55
1st Qu.:1.000 1st Qu.:2.000 1st Qu.: 3128 1st Qu.: 1533
Median :1.000 Median :3.000 Median : 8504 Median : 3627
Mean :1.323 Mean :2.543 Mean : 12000 Mean : 5796
3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.: 16934 3rd Qu.: 7190
Max. :2.000 Max. :3.000 Max. :112151 Max. :73498
Grocery Frozen Detergents_Paper Delicassen
Min. : 3 Min. : 25.0 Min. : 3.0 Min. : 3.0
1st Qu.: 2153 1st Qu.: 742.2 1st Qu.: 256.8 1st Qu.: 408.2
Median : 4756 Median : 1526.0 Median : 816.5 Median : 965.5
Mean : 7951 Mean : 3071.9 Mean : 2881.5 Mean : 1524.9
3rd Qu.:10656 3rd Qu.: 3554.2 3rd Qu.: 3922.0 3rd Qu.: 1820.2
Max. :92780 Max. :60869.0 Max. :40827.0 Max. :47943.0
=== Missing Values per Column ===
Fresh Milk Grocery Frozen
0 0 0 0
Detergents_Paper Delicassen
0 0
=== Skewness of Raw Spending Variables ===
Fresh Milk Grocery Frozen
2.544 4.026 3.563 5.868
Detergents_Paper Delicassen
3.607 11.076
=== Skewness After log1p Transformation ===
Fresh Milk Grocery Frozen
-1.565 -0.223 -0.670 -0.350
Detergents_Paper Delicassen
-0.234 -1.084
=== Pearson Correlation Matrix (log-transformed) ===
Fresh Milk Grocery Frozen Detergents_Paper Delicassen
Fresh 1.000 -0.021 -0.133 0.386 -0.159 0.256
Milk -0.021 1.000 0.761 -0.055 0.679 0.342
Grocery -0.133 0.761 1.000 -0.165 0.797 0.240
Frozen 0.386 -0.055 -0.165 1.000 -0.213 0.256
Detergents_Paper -0.159 0.679 0.797 -0.213 1.000 0.168
Delicassen 0.256 0.342 0.240 0.256 0.168 1.000
=== KMO Measure of Sampling Adequacy ===
Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = cor_mat)
Overall MSA = 0.73
MSA for each item =
Fresh Milk Grocery Frozen
0.65 0.78 0.70 0.66
Detergents_Paper Delicassen
0.76 0.71
=== Bartlett's Test of Sphericity ===
$chisq
[1] 1050.819
$p.value
[1] 1.714465e-214
$df
[1] 15
=== PCA Summary ===
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 1.6262 1.2774 0.8012 0.7786 0.54090 0.42774
Proportion of Variance 0.4408 0.2720 0.1070 0.1010 0.04876 0.03049
Cumulative Proportion 0.4408 0.7127 0.8197 0.9207 0.96951 1.00000
=== Loadings (Rotation Matrix) ===
PC1 PC2 PC3 PC4 PC5 PC6
Fresh -0.1052 0.5904 -0.6396 0.4786 -0.0403 -0.0259
Milk 0.5423 0.1329 -0.0736 -0.0618 0.7600 0.3185
Grocery 0.5715 -0.0058 -0.1333 -0.0965 -0.0929 -0.7985
Frozen -0.1379 0.5898 -0.0212 -0.7920 -0.0732 0.0049
Detergents_Paper 0.5506 -0.0708 -0.1995 -0.0828 -0.6204 0.5101
Delicassen 0.2145 0.5300 0.7263 0.3516 -0.1481 0.0031
=== Variance Explained Table ===
PC Eigenvalue Proportion_Var Cumulative_Var
1 PC1 2.6447 0.4408 0.4408
2 PC2 1.6317 0.2719 0.7127
3 PC3 0.6419 0.1070 0.8197
4 PC4 0.6062 0.1010 0.9207
5 PC5 0.2926 0.0488 0.9695
6 PC6 0.1830 0.0305 1.0000
=== PC Scores – First 6 Rows ===
Channel Region PC1 PC2 PC3
1 Retail Other 1.3788133 -0.3043804 -0.24078762
2 Retail Other 1.4372336 0.5368571 0.08413653
3 Retail Other 1.5117317 1.2311631 0.98418563
4 HoReCa Other -0.8241672 1.2094129 0.25185392
5 Retail Other 0.8019842 1.7640380 0.31234246
6 Retail Other 0.8851579 0.1297465 0.01892823
PC scores saved to: wholesale_pca_scores.csv
========================================================
PCA INTERPRETATION SUMMARY
========================================================
Components retained (Kaiser, eigenvalue ≥ 1): 2
Variance explained by retained PCs: 71.3%
PC1 – 'Overall Spend Scale'
Likely captures total spending power / business size.
High positive loadings across most categories.
PC2 – 'Fresh vs Packaged Goods Axis'
Separates Fresh-heavy buyers (HoReCa) from
Grocery/Detergent-heavy buyers (Retail).
PC3 – 'Speciality/Frozen Dimension' (if retained)
May differentiate niche buyers (Delicassen, Frozen).
Business Implication:
PC scores can directly feed K-Means / Hierarchical
clustering for actionable customer segmentation.
========================================================
Here’s a clean, reproducible R workflow to perform PCA (Principal Component Analysis) on your wholesale dataset, including appropriate preprocessing steps.
Key considerations before PCA
Channel, Region) should NOT be used directly in PCA (since PCA is for continuous variables).R Code for PCA Moved to below this response box
What this gives you
Next Steps (for your segmentation task) After PCA, you can:
kmeans, hclust)If you want, I can extend this into a full segmentation pipeline (PCA + clustering + interpretation) tailored to your business objectives.
This response included the following R code as part of the LLM’s in-line response:
# Load required libraries
library(tidyverse)
library(factoextra)
# Load the dataset
data <- read.csv("Wholesale_customers_data.csv")
# Inspect structure
str(data)
summary(data)
# Convert categorical variables to factors
data$Channel <- as.factor(data$Channel)
data$Region <- as.factor(data$Region)
# Select only continuous variables for PCA
spending_data <- data %>%
select(Fresh, Milk, Grocery, Frozen, Detergents_Paper, Delicassen)
# Optional: Log transformation to reduce skewness
spending_log <- log1p(spending_data) # log(1 + x) handles zeros
# Scale the data (important for PCA)
spending_scaled <- scale(spending_log)
# Perform PCA
pca_result <- prcomp(spending_scaled, center = TRUE, scale. = TRUE)
# View summary of PCA (variance explained)
summary(pca_result)
# Eigenvalues (variance explained by each component)
eig_vals <- get_eigenvalue(pca_result)
print(eig_vals)
# Scree plot
fviz_eig(pca_result, addlabels = TRUE, ylim = c(0, 60))
# Biplot (individuals + variables)
fviz_pca_biplot(pca_result, repel = TRUE, col.var = "blue", col.ind = "gray")
# PCA variable contributions
fviz_pca_var(
pca_result,
col.var = "contrib",
gradient.cols = c("blue", "yellow", "red")
)
# PCA individuals colored by Channel
fviz_pca_ind(
pca_result,
geom = "point",
col.ind = data$Channel,
palette = c("#00AFBB", "#E7B800"),
addEllipses = TRUE,
legend.title = "Channel"
)The (unedited) outputs generated by running this code, are as follows:
'data.frame': 440 obs. of 8 variables:
$ Channel : int 2 2 2 1 2 2 2 2 1 2 ...
$ Region : int 3 3 3 3 3 3 3 3 3 3 ...
$ Fresh : int 12669 7057 6353 13265 22615 9413 12126 7579 5963 6006 ...
$ Milk : int 9656 9810 8808 1196 5410 8259 3199 4956 3648 11093 ...
$ Grocery : int 7561 9568 7684 4221 7198 5126 6975 9426 6192 18881 ...
$ Frozen : int 214 1762 2405 6404 3915 666 480 1669 425 1159 ...
$ Detergents_Paper: int 2674 3293 3516 507 1777 1795 3140 3321 1716 7425 ...
$ Delicassen : int 1338 1776 7844 1788 5185 1451 545 2566 750 2098 ...
Channel Region Fresh Milk
Min. :1.000 Min. :1.000 Min. : 3 Min. : 55
1st Qu.:1.000 1st Qu.:2.000 1st Qu.: 3128 1st Qu.: 1533
Median :1.000 Median :3.000 Median : 8504 Median : 3627
Mean :1.323 Mean :2.543 Mean : 12000 Mean : 5796
3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.: 16934 3rd Qu.: 7190
Max. :2.000 Max. :3.000 Max. :112151 Max. :73498
Grocery Frozen Detergents_Paper Delicassen
Min. : 3 Min. : 25.0 Min. : 3.0 Min. : 3.0
1st Qu.: 2153 1st Qu.: 742.2 1st Qu.: 256.8 1st Qu.: 408.2
Median : 4756 Median : 1526.0 Median : 816.5 Median : 965.5
Mean : 7951 Mean : 3071.9 Mean : 2881.5 Mean : 1524.9
3rd Qu.:10656 3rd Qu.: 3554.2 3rd Qu.: 3922.0 3rd Qu.: 1820.2
Max. :92780 Max. :60869.0 Max. :40827.0 Max. :47943.0
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 1.6262 1.2774 0.8012 0.7786 0.54090 0.42774
Proportion of Variance 0.4408 0.2720 0.1070 0.1010 0.04876 0.03049
Cumulative Proportion 0.4408 0.7127 0.8197 0.9207 0.96951 1.00000
eigenvalue variance.percent cumulative.variance.percent
Dim.1 2.6446651 44.077752 44.07775
Dim.2 1.6316950 27.194917 71.27267
Dim.3 0.6419057 10.698428 81.97110
Dim.4 0.6061972 10.103287 92.07438
Dim.5 0.2925761 4.876269 96.95065
Dim.6 0.1829608 3.049347 100.00000
log1p, and standardise before PCA (matching our approach)scale() then prcomp(..., scale. = TRUE)), which is harmless but suggests the model is not tracking what each preprocessing step doesNow write R code to perform clustering on this dataset. Explain your choice of method and number of clusters.
> Check the PCA R script for context on variable names and preprocessing steps used
Good - I have full context from the PCA script. Now I’ll write a comprehensive, well-structured clustering script that picks up exactly where the PCA left off.
> Created a file, read a file
Here is a full explanation of every methodological decision made in the script.
Method Choice: Why a Two-Stage Approach?
Stage 1 - Hierarchical Agglomerative Clustering (HAC) with Ward.D2. HAC is used first and exploratorily because it requires no pre-specified k. The dendrogram shows the full merge history of the data, and natural groupings can be identified by looking for the largest vertical gaps (the longest branches before clusters merge).
Why Ward.D2 specifically? Ward’s linkage merges the two clusters that produce the smallest increase in total within-cluster sum of squares - it is inherently trying to form compact, internally coherent groups. The other common options have practical problems here: single linkage suffers from chaining (long thin clusters), complete linkage is sensitive to outliers (which wholesale spend data has), and average linkage is a compromise but rarely as clean. The agglomerative coefficient printed by the script confirms this - Ward.D2 will produce a value closest to 1.0, signalling the strongest cluster structure.
Stage 2 - K-Means on PCA Scores. Once k is confirmed, K-Means is used for the final assignment because it is interpretable, fast, and produces hard cluster memberships that can be exported straight into a CRM or ERP system. Running it on the retained PC scores (rather than all six raw variables) removes the noise carried by the low-variance PCs and avoids double-counting the high correlations between Grocery, Milk and Detergents_Paper.
Number of Clusters: Why k = 3? Four independent criteria are computed and they all point to the same answer:
firstSEmax rule (first k within one SE of the maximum gap)Beyond the statistics, k = 3 also has a compelling business logic: it maps onto three operationally distinct customer archetypes (see the summary at the bottom of the script), each requiring a different inventory, pricing, and sales approach. k = 2 is too coarse to distinguish within HoReCa; k = 4+ starts splitting segments that are not meaningfully different for planning purposes.
What the Script Produces (in order)
wholesale_clustered.csv - full dataset with Cluster column appendedThis response included the following R code (as a separate downloadable file):
# =============================================================================
# Clustering Analysis: Wholesale Customers Dataset
# Goal: Data-driven customer segmentation to inform inventory planning,
# pricing strategy, and sales team deployment
#
# METHOD RATIONALE (read before running):
# ----------------------------------------
# We use a TWO-STAGE approach:
#
# Stage 1 – Hierarchical Agglomerative Clustering (HAC) with Ward.D2 linkage
# • No need to pre-specify k; the dendrogram reveals natural structure
# • Ward.D2 minimises within-cluster sum of squares at each merge step,
# producing compact, roughly equal-sized clusters – ideal for business
# segmentation where every segment must be actionable
# • Robust to the moderate outliers present in wholesale spend data
# • Used first to identify the plausible range of k
#
# Stage 2 – K-Means on PCA scores (confirmation & final assignment)
# • Once k is confirmed, K-Means on the retained PC scores is fast,
# well-understood, and produces crisp assignments for CRM/ERP export
# • Running K-Means in PC space (rather than raw space) further reduces
# noise from the three low-variance PCs
#
# NUMBER OF CLUSTERS:
# We use three complementary criteria and take the consensus:
# (a) Dendrogram cut height (HAC)
# (b) Elbow method – Within-cluster Sum of Squares (WSS)
# (c) Average Silhouette Width
# (d) Gap Statistic
# All four consistently point to k = 3 on this dataset, and three segments
# align with a natural business narrative (see Section 8).
# =============================================================================
# --- 0. Packages -------------------------------------------------------------
required_packages <- c(
"tidyverse",
"factoextra",
"cluster",
"dendextend",
"gridExtra",
"scales",
"ggrepel",
"psych",
"RColorBrewer",
"corrplot",
"NbClust",
"fpc"
)
installed <- rownames(installed.packages())
to_install <- required_packages[!required_packages %in% installed]
if (length(to_install) > 0) {
install.packages(to_install, dependencies = TRUE)
}
library(tidyverse)
library(factoextra)
library(cluster)
library(dendextend)
library(gridExtra)
library(scales)
library(ggrepel)
library(psych)
library(RColorBrewer)
library(corrplot)
library(NbClust)
library(fpc)
# Consistent colour palette (up to 5 clusters)
CLUSTER_COLS <- c(
"1" = "#4E79A7",
"2" = "#E15759",
"3" = "#59A14F",
"4" = "#F28E2B",
"5" = "#B07AA1"
)
# --- 1. Reproduce Pre-processing from PCA Script ----------------------------
# (Self-contained: this script can be run independently)
data_path <- "Wholesale_customers_data.csv"
df_raw <- read.csv(data_path, stringsAsFactors = FALSE)
df <- df_raw %>%
mutate(
Channel = factor(Channel, levels = c(1, 2), labels = c("HoReCa", "Retail")),
Region = factor(
Region,
levels = c(1, 2, 3),
labels = c("Lisbon", "Oporto", "Other")
)
)
spend_vars <- c(
"Fresh",
"Milk",
"Grocery",
"Frozen",
"Detergents_Paper",
"Delicassen"
)
df_spend <- df[, spend_vars]
df_log <- log1p(df_spend) # log1p: handles zeros, reduces right skew
# Standardise (mean = 0, sd = 1) – required for both HAC and K-Means
df_scaled <- as.data.frame(scale(df_log))
cat("=== Pre-processing complete ===\n")
cat(
"Dimensions of scaled input matrix:",
nrow(df_scaled),
"x",
ncol(df_scaled),
"\n\n"
)
# --- 2. PCA on Scaled Data (for cluster visualisation) ----------------------
pca_result <- prcomp(df_log, center = TRUE, scale. = TRUE)
eig_vals <- pca_result$sdev^2
prop_var <- eig_vals / sum(eig_vals)
n_pcs <- sum(eig_vals >= 1) # Kaiser criterion → typically 3
# PC scores matrix for clustering
pc_scores <- as.data.frame(pca_result$x[, 1:n_pcs])
cat(sprintf(
"PCA: retaining %d PCs (Kaiser criterion), explaining %.1f%% of variance\n\n",
n_pcs,
cumsum(prop_var)[n_pcs] * 100
))
# --- 3. Distance Matrix ------------------------------------------------------
# Euclidean distance on the standardised original space
# (used for HAC; K-Means implicitly uses Euclidean as well)
dist_mat <- dist(df_scaled, method = "euclidean")
# --- 4. Stage 1 – Hierarchical Agglomerative Clustering (HAC) ---------------
# 4.1 Fit Ward.D2 linkage
hac_ward <- hclust(dist_mat, method = "ward.D2")
# 4.2 Compare linkage methods (to confirm Ward is best)
#linkage_methods <- c("ward.D2", "complete", "average", "single")
linkage_methods <- c("average", "single", "complete", "ward", "weighted")
agc_coefficients <- sapply(linkage_methods, function(m) {
coef(agnes(dist_mat, method = m))
})
agc_df <- data.frame(
Method = linkage_methods,
Agglomerative_Coefficient = round(agc_coefficients, 4)
)
cat("=== Agglomerative Coefficients (closer to 1 = stronger structure) ===\n")
print(agc_df)
cat(
"\nWard.D2 should show the highest coefficient, confirming it as best linkage.\n\n"
)
# 4.3 Full dendrogram
dend <- as.dendrogram(hac_ward)
par(mar = c(5, 4, 4, 2))
plot(
dend,
main = "HAC Dendrogram – Ward.D2 Linkage\n(Wholesale Customers)",
ylab = "Height (Euclidean distance)",
xlab = "Customers",
leaflab = "none",
cex.main = 1.1
)
# 4.4 Annotate cut heights for k = 2, 3, 4
cut_heights <- sapply(2:4, function(k) {
h <- sort(hac_ward$height, decreasing = TRUE)
mean(h[k - 1], h[k]) # midpoint between the last two merges
})
abline(
h = cut_heights,
col = c("#F28E2B", "#E15759", "#B07AA1"),
lty = 2,
lwd = 1.5
)
legend(
"topright",
legend = c("k = 2 cut", "k = 3 cut", "k = 4 cut"),
col = c("#F28E2B", "#E15759", "#B07AA1"),
lty = 2,
lwd = 1.5,
bty = "n",
cex = 0.85
)
# 4.5 Coloured dendrogram for k = 3
dend_col <- color_branches(
dend,
k = 3,
col = unname(CLUSTER_COLS[c("1", "2", "3")])
)
par(mar = c(5, 4, 4, 2))
plot(
dend_col,
main = "HAC Dendrogram – Ward.D2, k = 3 Clusters Highlighted",
ylab = "Height",
xlab = "Customers",
leaflab = "none"
)
legend(
"topright",
legend = paste("Cluster", 1:3),
fill = unname(CLUSTER_COLS[c("1", "2", "3")]),
bty = "n"
)
# --- 5. Optimal k Selection --------------------------------------------------
# 5.1 Elbow plot (WSS)
wss_vals <- map_dbl(1:10, function(k) {
set.seed(42)
km <- kmeans(pc_scores, centers = k, nstart = 25, iter.max = 300)
km$tot.withinss
})
wss_df <- data.frame(k = 1:10, WSS = wss_vals)
p_elbow <- ggplot(wss_df, aes(x = k, y = WSS)) +
geom_line(colour = "#4E79A7", linewidth = 1.2) +
geom_point(size = 3.5, colour = "#4E79A7") +
geom_vline(
xintercept = 3,
linetype = "dashed",
colour = "#E15759",
linewidth = 0.9
) +
annotate(
"text",
x = 3.3,
y = max(wss_vals) * 0.92,
label = "k = 3\n(elbow)",
colour = "#E15759",
size = 3.5
) +
scale_x_continuous(breaks = 1:10) +
labs(
title = "Elbow Method – Within-Cluster Sum of Squares",
subtitle = "Look for the 'elbow' where marginal WSS reduction flattens",
x = "Number of Clusters (k)",
y = "Total WSS"
) +
theme_minimal(base_size = 12)
# 5.2 Silhouette plot
sil_vals <- map_dbl(2:10, function(k) {
set.seed(42)
km <- kmeans(pc_scores, centers = k, nstart = 25, iter.max = 300)
sil <- silhouette(km$cluster, dist(pc_scores))
mean(sil[, 3])
})
sil_df <- data.frame(k = 2:10, AvgSilhouette = sil_vals)
p_sil <- ggplot(sil_df, aes(x = k, y = AvgSilhouette)) +
geom_line(colour = "#59A14F", linewidth = 1.2) +
geom_point(size = 3.5, colour = "#59A14F") +
geom_vline(
xintercept = sil_df$k[which.max(sil_df$AvgSilhouette)],
linetype = "dashed",
colour = "#E15759",
linewidth = 0.9
) +
annotate(
"text",
x = sil_df$k[which.max(sil_df$AvgSilhouette)] + 0.4,
y = max(sil_df$AvgSilhouette) * 0.97,
label = paste0(
"k = ",
sil_df$k[which.max(sil_df$AvgSilhouette)],
"\n(max sil)"
),
colour = "#E15759",
size = 3.5
) +
scale_x_continuous(breaks = 2:10) +
labs(
title = "Average Silhouette Width by k",
subtitle = "Higher = better-separated clusters",
x = "Number of Clusters (k)",
y = "Average Silhouette Width"
) +
theme_minimal(base_size = 12)
grid.arrange(p_elbow, p_sil, ncol = 2, top = "Optimal k Selection Criteria")
# 5.3 Gap Statistic (computationally heavier – uses bootstrap)
set.seed(42)
gap_stat <- clusGap(
pc_scores,
FUN = kmeans,
K.max = 10,
B = 100,
nstart = 25,
iter.max = 300
)
cat("\n=== Gap Statistic Results ===\n")
print(gap_stat, method = "firstSEmax")
fviz_gap_stat(gap_stat, ggtheme = theme_minimal()) +
geom_vline(
xintercept = 3,
linetype = "dashed",
colour = "#E15759",
linewidth = 0.9
) +
labs(
title = "Gap Statistic – Optimal k",
subtitle = "Vertical line marks k = 3"
)
# 5.4 NbClust – 30 indices majority vote (definitive consensus)
cat("\n=== NbClust: 30-index Majority Vote ===\n")
set.seed(42)
nb_result <- NbClust(
data = pc_scores,
distance = "euclidean",
min.nc = 2,
max.nc = 8,
method = "ward.D2",
index = "all"
)
cat("\n--- NbClust Best Number of Clusters ---\n")
print(table(nb_result$Best.nc[1, ]))
# --- 6. Stage 2 – K-Means on PC Scores (k = 3) ------------------------------
set.seed(42)
K <- 3
km_final <- kmeans(
pc_scores,
centers = K,
nstart = 50, # multiple random starts → stable solution
iter.max = 500
)
cat("\n=== K-Means Final Solution (k = 3) ===\n")
cat("Cluster sizes:\n")
print(km_final$size)
cat(sprintf(
"\nBetween-cluster SS / Total SS: %.1f%%\n",
km_final$betweenss / km_final$totss * 100
))
# Attach cluster labels to full dataset
df_final <- df %>%
bind_cols(as.data.frame(pca_result$x)) %>%
mutate(Cluster = factor(km_final$cluster))
# --- 7. Cluster Validation ---------------------------------------------------
# 7.1 Silhouette plot for final k = 3 solution
sil_final <- silhouette(km_final$cluster, dist(pc_scores))
cat("\n=== Silhouette Summary (k = 3) ===\n")
print(summary(sil_final))
fviz_silhouette(
sil_final,
palette = unname(CLUSTER_COLS[as.character(1:K)]),
ggtheme = theme_minimal(),
print.summary = FALSE
) +
labs(
title = "Silhouette Plot – K-Means k = 3",
subtitle = "Bars above 0 indicate well-classified observations"
) +
theme(legend.position = "top")
# 7.2 Calinski-Harabasz & Davies-Bouldin indices
ch_stat <- calinhara(pc_scores, km_final$cluster)
cat(sprintf(
"\nCalinski-Harabasz Index: %.2f (higher = better separated)\n",
ch_stat
))
# --- 8. Cluster Visualisation ------------------------------------------------
# 8.1 K-Means clusters on PC1 vs PC2
arrows_df <- as.data.frame(pca_result$rotation[, 1:2]) %>%
rownames_to_column("Variable")
scale_fac <- 3.2
p_biplot <- ggplot(
df_final,
aes(x = PC1, y = PC2, colour = Cluster, shape = Cluster)
) +
geom_point(alpha = 0.75, size = 2.5) +
geom_segment(
data = arrows_df,
aes(x = 0, y = 0, xend = PC1 * scale_fac, yend = PC2 * scale_fac),
arrow = arrow(length = unit(0.22, "cm")),
colour = "grey30",
linewidth = 0.65,
inherit.aes = FALSE
) +
geom_label_repel(
data = arrows_df,
aes(x = PC1 * scale_fac, y = PC2 * scale_fac, label = Variable),
size = 3.2,
colour = "grey20",
box.padding = 0.4,
inherit.aes = FALSE
) +
stat_ellipse(
aes(colour = Cluster),
level = 0.90,
linewidth = 0.8,
linetype = "dashed"
) +
scale_colour_manual(values = CLUSTER_COLS) +
scale_shape_manual(values = c("1" = 16, "2" = 17, "3" = 15)) +
labs(
title = "K-Means Clusters (k = 3) on PCA Biplot",
subtitle = "90% confidence ellipses; arrows show variable loadings",
x = paste0("PC1 (", round(prop_var[1] * 100, 1), "% variance)"),
y = paste0("PC2 (", round(prop_var[2] * 100, 1), "% variance)"),
colour = "Cluster",
shape = "Cluster"
) +
theme_minimal(base_size = 12) +
theme(legend.position = "top")
print(p_biplot)
# 8.2 Clusters overlaid with Channel membership
p_channel <- ggplot(
df_final,
aes(x = PC1, y = PC2, colour = Cluster, shape = Channel)
) +
geom_point(alpha = 0.75, size = 2.8) +
scale_colour_manual(values = CLUSTER_COLS) +
scale_shape_manual(values = c("HoReCa" = 17, "Retail" = 16)) +
labs(
title = "Cluster Membership vs Channel Type",
subtitle = "Shape = Channel; Colour = Cluster",
x = paste0("PC1 (", round(prop_var[1] * 100, 1), "%)"),
y = paste0("PC2 (", round(prop_var[2] * 100, 1), "%)")
) +
theme_minimal(base_size = 12) +
theme(legend.position = "top")
print(p_channel)
# 8.3 Clusters vs Region
p_region <- ggplot(
df_final,
aes(x = PC1, y = PC2, colour = Cluster, shape = Region)
) +
geom_point(alpha = 0.75, size = 2.8) +
scale_colour_manual(values = CLUSTER_COLS) +
scale_shape_manual(values = c("Lisbon" = 16, "Oporto" = 17, "Other" = 15)) +
labs(
title = "Cluster Membership vs Region",
subtitle = "Shape = Region; Colour = Cluster",
x = paste0("PC1 (", round(prop_var[1] * 100, 1), "%)"),
y = paste0("PC2 (", round(prop_var[2] * 100, 1), "%)")
) +
theme_minimal(base_size = 12) +
theme(legend.position = "top")
print(p_region)
# --- 9. Cluster Profiling ----------------------------------------------------
# 9.1 Mean spending per cluster (original scale, unlogged)
profile_raw <- df_final %>%
bind_cols(df_spend) %>%
group_by(Cluster) %>%
summarise(
N = n(),
Fresh_mean = mean(Fresh),
Milk_mean = mean(Milk),
Grocery_mean = mean(Grocery),
Frozen_mean = mean(Frozen),
DetPaper_mean = mean(Detergents_Paper),
Delicassen_mean = mean(Delicassen),
.groups = "drop"
)
cat("\n=== Cluster Profiles – Mean Spending (original monetary units) ===\n")
print(profile_raw)
# 9.2 Channel and Region composition per cluster
cat("\n=== Channel Composition by Cluster ===\n")
print(table(df_final$Cluster, df_final$Channel))
cat("\n=== Region Composition by Cluster ===\n")
print(table(df_final$Cluster, df_final$Region))
# 9.3 Radar / Spider chart of normalised mean spend per cluster
profile_norm <- df_final %>%
bind_cols(df_spend) %>%
group_by(Cluster) %>%
summarise(across(all_of(spend_vars), mean), .groups = "drop") %>%
mutate(across(all_of(spend_vars), ~ rescale(.x, to = c(0, 1))))
profile_long <- profile_norm %>%
pivot_longer(-Cluster, names_to = "Variable", values_to = "Score")
ggplot(
profile_long,
aes(x = Variable, y = Score, colour = Cluster, group = Cluster)
) +
geom_polygon(aes(fill = Cluster), alpha = 0.15, linewidth = 0.8) +
geom_line(linewidth = 0.9) +
geom_point(size = 2.5) +
coord_polar() +
scale_colour_manual(values = CLUSTER_COLS) +
scale_fill_manual(values = CLUSTER_COLS) +
facet_wrap(~Cluster, labeller = label_both) +
labs(
title = "Cluster Spending Profiles – Radar Chart",
subtitle = "Normalised mean spend per category (0 = min, 1 = max)"
) +
theme_minimal(base_size = 11) +
theme(
axis.text.x = element_text(size = 8),
legend.position = "none",
strip.text = element_text(face = "bold")
)
# 9.4 Grouped bar chart of mean spend by cluster
profile_long2 <- profile_raw %>%
pivot_longer(
cols = ends_with("_mean"),
names_to = "Category",
values_to = "MeanSpend"
) %>%
mutate(Category = str_remove(Category, "_mean"))
ggplot(profile_long2, aes(x = Category, y = MeanSpend, fill = Cluster)) +
geom_col(position = position_dodge(0.8), width = 0.7) +
scale_fill_manual(values = CLUSTER_COLS) +
scale_y_continuous(labels = comma) +
labs(
title = "Mean Annual Spend by Cluster and Category",
subtitle = "Monetary units (m.u.)",
x = NULL,
y = "Mean Annual Spend (m.u.)",
fill = "Cluster"
) +
theme_minimal(base_size = 12) +
theme(
axis.text.x = element_text(angle = 30, hjust = 1),
legend.position = "top"
)
# 9.5 Boxplots – spend distribution per cluster for each category
df_box <- df_final %>%
bind_cols(df_spend) %>%
pivot_longer(
cols = all_of(spend_vars),
names_to = "Category",
values_to = "Spend"
)
ggplot(df_box, aes(x = Cluster, y = Spend, fill = Cluster)) +
geom_boxplot(outlier.size = 1.2, outlier.alpha = 0.5, alpha = 0.8) +
scale_fill_manual(values = CLUSTER_COLS) +
scale_y_continuous(labels = comma) +
facet_wrap(~Category, scales = "free_y", ncol = 3) +
labs(
title = "Spending Distribution by Cluster and Category",
x = "Cluster",
y = "Annual Spend (m.u.)"
) +
theme_minimal(base_size = 11) +
theme(legend.position = "none", strip.text = element_text(face = "bold"))
# --- 10. HAC vs K-Means Agreement Check -------------------------------------
hac_labels_k3 <- cutree(hac_ward, k = 3)
# Align HAC labels to K-Means labels by majority vote (label-switching fix)
conf_mat <- table(HAC = hac_labels_k3, KMeans = km_final$cluster)
cat("\n=== Contingency Table: HAC vs K-Means Assignments ===\n")
print(conf_mat)
agreement <- sum(apply(conf_mat, 1, max)) / nrow(df_final)
cat(sprintf(
"\nLabel-adjusted agreement between HAC and K-Means: %.1f%%\n",
agreement * 100
))
cat("(Values above 85%% indicate the two methods recover the same structure)\n")
# --- 11. Export Results ------------------------------------------------------
output_path <- "wholesale_clustered.csv"
write.csv(df_final, output_path, row.names = FALSE)
cat("\nFull dataset with cluster labels saved to:\n", output_path, "\n")
# --- 12. Business Interpretation Summary ------------------------------------
cat("\n")
cat("================================================================\n")
cat(" CUSTOMER SEGMENTATION – BUSINESS SUMMARY \n")
cat("================================================================\n")
cat("\n METHOD: HAC (Ward.D2) to identify structure +\n")
cat(" K-Means (k=3) on PCA scores for final assignment\n")
cat(" INPUT: log1p-transformed, standardised spend variables\n")
cat(" projected onto retained PCs (Kaiser criterion)\n")
cat("\n CLUSTER 1 – 'High-Volume Fresh Buyers' (HoReCa dominant)\n")
cat(" • Highest Fresh and Frozen spend\n")
cat(" • Low Grocery, Detergents_Paper\n")
cat(" → Hotels, restaurants, cafés; perishable-led ordering\n")
cat(" IMPLICATION: Prioritise cold-chain logistics; daily/weekly\n")
cat(" replenishment cycles; dedicated HoReCa sales reps\n")
cat("\n CLUSTER 2 – 'Packaged Goods Retailers' (Retail dominant)\n")
cat(" • Highest Grocery, Milk, Detergents_Paper\n")
cat(" • Low Fresh spend\n")
cat(" → Supermarkets, convenience stores; ambient goods focus\n")
cat(" IMPLICATION: Bulk-order pricing tiers; shelf-stable\n")
cat(" inventory buffers; quarterly promotions on detergents\n")
cat("\n CLUSTER 3 – 'Small / Mixed Buyers'\n")
cat(" • Below-average spend across all categories\n")
cat(" • Mixed Channel and Region membership\n")
cat(" → Small independent shops, delis, mixed traders\n")
cat(" IMPLICATION: Route-optimised delivery (lower margin);\n")
cat(" upsell programmes; self-service ordering portal\n")
cat("\n NOTE: Exact cluster labels (1/2/3) may differ from above\n")
cat(" depending on K-Means initialisation; inspect profile_raw\n")
cat(" and the radar/bar charts to assign the business labels.\n")
cat("================================================================\n")The (unedited) outputs generated by running this code, are as follows:
=== Pre-processing complete ===
Dimensions of scaled input matrix: 440 x 6
PCA: retaining 2 PCs (Kaiser criterion), explaining 71.3% of variance
=== Agglomerative Coefficients (closer to 1 = stronger structure) ===
Method Agglomerative_Coefficient
average average 0.8761
single single 0.8031
complete complete 0.9204
ward ward 0.9734
weighted weighted 0.8894
Ward.D2 should show the highest coefficient, confirming it as best linkage.
=== Gap Statistic Results ===
Clustering Gap statistic ["clusGap"] from call:
clusGap(x = pc_scores, FUNcluster = kmeans, K.max = 10, B = 100, nstart = 25, iter.max = 300)
B=100 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
--> Number of clusters (method 'firstSEmax', SE.factor=1): 1
logW E.logW gap SE.sim
[1,] 5.633372 6.336062 0.7026902 0.01710636
[2,] 5.328019 6.046137 0.7181186 0.01843138
[3,] 5.175220 5.824960 0.6497394 0.01736385
[4,] 5.049390 5.630825 0.5814342 0.01659374
[5,] 4.959056 5.523051 0.5639956 0.01628787
[6,] 4.850967 5.427449 0.5764815 0.01530802
[7,] 4.787376 5.342180 0.5548043 0.01598955
[8,] 4.721439 5.263845 0.5424064 0.01541976
[9,] 4.667722 5.196424 0.5287017 0.01431602
[10,] 4.613041 5.138184 0.5251432 0.01353503
== SOURCE ERROR ==
unused argument (ggtheme = theme_minimal())
Below is a complete R workflow for clustering, building directly on the PCA-ready preprocessing. I’ll also explain why this method and how to choose the number of clusters.
Why this approach?
Method: K-means clustering
Why use PCA first?
Choosing number of clusters (k) We’ll use:
Typically, this dataset yields ~3–5 clusters, often:
R Code: Clustering Pipeline Moved to below this response box
Interpretation Guide After running this, you should:
Optional (better alternative) If you want more robust segmentation:
If you want, I can help you turn this into a fully polished report with cluster personas and business recommendations-that’s usually what gets top marks.
This response included the following R code as part of the LLM’s in-line response:
# Load libraries
library(tidyverse)
library(factoextra)
library(cluster)
# Load data
data <- read.csv("Wholesale_customers_data.csv")
# Convert categorical variables
data$Channel <- as.factor(data$Channel)
data$Region <- as.factor(data$Region)
# Select continuous variables
spending_data <- data %>%
select(Fresh, Milk, Grocery, Frozen, Detergents_Paper, Delicassen)
# Log transform to reduce skewness
spending_log <- log1p(spending_data)
# Scale data
spending_scaled <- scale(spending_log)
# -------------------------
# PCA (for dimensionality reduction)
# -------------------------
pca_result <- prcomp(spending_scaled, center = TRUE, scale. = TRUE)
# Extract first few principal components (e.g., first 3)
pca_scores <- as.data.frame(pca_result$x[, 1:3])
# -------------------------
# Determine optimal number of clusters
# -------------------------
# Elbow method
fviz_nbclust(pca_scores, kmeans, method = "wss") +
ggtitle("Elbow Method")
# Silhouette method
fviz_nbclust(pca_scores, kmeans, method = "silhouette") +
ggtitle("Silhouette Method")
# -------------------------
# Apply K-means clustering
# -------------------------
set.seed(123) # for reproducibility
k <- 3 # Adjust based on elbow/silhouette results
kmeans_result <- kmeans(pca_scores, centers = k, nstart = 25)
# Add cluster labels to original data
data$Cluster <- as.factor(kmeans_result$cluster)
# -------------------------
# Visualisation
# -------------------------
# Cluster plot
fviz_cluster(
kmeans_result,
data = pca_scores,
geom = "point",
ellipse.type = "convex",
palette = "jco",
ggtheme = theme_minimal()
)
# -------------------------
# Cluster profiling
# -------------------------
# Average spending per cluster
cluster_summary <- data %>%
group_by(Cluster) %>%
summarise(across(Fresh:Delicassen, mean))
print(cluster_summary)
# Distribution of Channel by cluster
table(data$Cluster, data$Channel)
# Distribution of Region by cluster
table(data$Cluster, data$Region)The (unedited) outputs generated by running this code, are as follows:
# A tibble: 3 × 7
Cluster Fresh Milk Grocery Frozen Detergents_Paper Delicassen
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 11939. 2006. 2502. 3266. 424. 891.
2 2 2997. 7045. 12505. 608. 5511. 797.
3 3 17182. 10658. 13381. 4181. 5004. 2867.
1 2
1 210 3
2 32 50
3 56 89
1 2 3
1 40 26 147
2 14 10 58
3 23 11 111
NbClust majority voteB = 100), an approach we did not employfviz_gap_stat(gap_stat, ggtheme = theme_minimal()) : unused argument (ggtheme = theme_minimal())Are there any alternative methods or robustness checks you would recommend?
Alternative Clustering Methods
1. Gaussian Mixture Models (GMM / Model-Based Clustering).
K-Means assumes clusters are spherical and equally sized - a hard geometric
constraint. GMM relaxes this by fitting a probabilistic mixture of multivariate
Gaussians, allowing clusters to be elliptical, differently sized, and differently
oriented. Each customer gets a soft probability of belonging to each cluster
rather than a hard label. In R: mclust::Mclust() - it uses BIC to
automatically select both the number of clusters and the covariance structure
(e.g., diagonal vs. full, tied vs. variable). On wholesale data, where the
HoReCa segment often has a very different spread from the Retail segment, GMM
frequently produces more honest assignments. Prefer it over K-Means when the
biplot shows clusters that are elongated or unequal in density - which is likely
here given the strong Channel separation on PC2.
2. DBSCAN (Density-Based Spatial Clustering).
DBSCAN identifies clusters as dense regions of points separated by sparse
regions, and critically, it classifies low-density points as noise/outliers
rather than forcing them into a cluster. This is valuable for wholesale data
because a handful of very large spenders (e.g., a major supermarket chain) can
distort K-Means centroids significantly. In R: dbscan::dbscan() - requires
tuning two parameters: eps (neighbourhood radius) and minPts (minimum points
to form a core). The kNNdistplot() function helps select eps via a knee in
the k-NN distance curve. Limitation: DBSCAN struggles with clusters of varying
density (common in spending data), and it does not generalise naturally to new
observations - a limitation if you want to score new customers joining the
database.
3. K-Medoids / PAM (Partitioning Around Medoids).
Like K-Means but uses actual data points as cluster centres (medoids) rather than
arithmetic means. This makes it substantially more robust to outliers, since a
single extreme spender cannot pull a centroid arbitrarily far. It also works with
any distance metric, not just Euclidean. In R: cluster::pam() - also produces
a silhouette plot natively. A useful direct comparison: if PAM and K-Means
produce very similar cluster assignments, it confirms the K-Means solution is not
being distorted by outliers.
4. Agglomerative HAC with Alternative Linkages. The script already uses Ward.D2, but for a robustness check you should formally compare it against Complete and Average linkage and compute the Cophenetic Correlation Coefficient for each - this measures how faithfully the dendrogram preserves the original pairwise distances. The linkage with the highest cophenetic correlation is the most geometrically honest representation of the data’s structure.
Robustness Checks
5. Stability Analysis via Bootstrap (clusterboot).
The most important robustness check you can run. fpc::clusterboot() repeatedly
resamples the data (with replacement), re-clusters each bootstrap sample, and
measures how consistently the same clusters re-emerge using the Jaccard
similarity coefficient. A cluster with mean Jaccard > 0.75 is considered stable;
below 0.5 it is likely an artefact of the particular sample. If the three
clusters survive this test, there is strong evidence that the segmentation is a
genuine feature of the population of wholesale customers in Portugal, not just of
these 440 observations.
6. Sensitivity to Outliers. Wholesale spend data almost always contains extreme values (e.g., a single very large hotel group). Two targeted checks: first, winsorising the top 1–5% of each spending variable before clustering, then comparing cluster assignments to the unwinsorised solution using the Adjusted Rand Index (ARI) - an ARI close to 1.0 means the solution is insensitive to those extreme points. Second, leave-one-out influence analysis: iteratively remove the top 10 highest-spend customers and re-run K-Means. If cluster memberships for the remaining 430 customers shift substantially, the centroids are being pulled by individual outliers.
7. Sensitivity to the Number of PCs. The script retains PCs by the Kaiser criterion (eigenvalue ≥ 1), which typically gives 3 PCs explaining ~80% of variance. A robustness check is to re-run clustering using (a) 2 PCs only and (b) all 6 standardised variables directly (no dimensionality reduction), and compare the three solutions with ARI. Convergence across all three input representations confirms that the structure is not an artefact of the PCA truncation.
8. Multiple Random Initialisations & Consensus Clustering.
K-Means with nstart = 50 already guards against poor local minima, but a
stronger check is consensus clustering via the ConsensusClusterPlus package. It
runs the algorithm hundreds of times on subsampled data and builds a consensus
matrix - an N×N matrix where entry (i,j) records the proportion of runs in which
customers i and j ended up in the same cluster. A well-separated block structure
in the consensus heatmap is strong visual confirmation that the clusters are
stable and not sensitive to initialisation.
9. External Validation Against Channel and Region. Since Channel and Region are known labels, they can be used as external validators without having been used in the clustering itself. Compute the Adjusted Rand Index between cluster assignments and Channel labels, and the Normalised Mutual Information between cluster assignments and Region labels. A high ARI against Channel (expect ~0.5–0.7 on this dataset) confirms the unsupervised algorithm is recovering a structure that aligns with a known business reality - a strong sign of face validity.
Summary Recommendation In priority order for a report:
clusterboot() - the single most persuasive robustness check for any clustering paperMclust() - the most principled alternative; compare ARI against K-MeansIf the consensus across all five checks points to three clusters with consistent membership, you have a very defensible segmentation - both statistically and for a business audience.
Claude did not return any R code as part of its response.
Yes - and this is exactly the kind of thing that strengthens your analysis. K-means + PCA is a solid baseline, but for a data-driven segmentation, you should always validate that your clusters are stable, meaningful, and not method-dependent. Here are the most important alternatives and robustness checks, with R code.
1. Alternative Clustering Methods
(A) Hierarchical Clustering (Validation + interpretability) Why:
R code moved to below this response box
If clusters align well with k-means → strong evidence of robustness
(B) Gaussian Mixture Models (GMM) Why:
R code moved to below this response box
If GMM selects similar k → supports your choice. If different → suggests more complex structure
(C) DBSCAN (Density-based) Why:
R code moved to below this response box
Useful to identify unusual customers (important for risk or niche segments)
2. Robustness Checks
(A) Stability Analysis (Resampling) Why:
R code moved to below this response box
Values:
(B) Silhouette Analysis (Quality check)
R code moved to below this response box
Interpretation:
(C) Compare Clustering on Raw vs PCA Data
R code moved to below this response box
If results are similar → PCA did not distort structure. If very different → investigate further
(D) Sensitivity to k
R code moved to below this response box
Ensures your choice of k is not arbitrary
How to Present This in Your Assignment You can frame it like this:
Strong Conclusion (what markers look for) A high-quality conclusion would say:
If you want, I can help you turn this into a top-grade “Methodology + Validation” write-up with proper academic language and justification.
This response included the following R code as part of the LLM’s in-line response. This code assumes that variables from the previous response have been loaded. The new code generated is included below from line 87.
# Load libraries
library(tidyverse)
library(factoextra)
library(cluster)
# Load data
data <- read.csv("Wholesale_customers_data.csv")
# Convert categorical variables
data$Channel <- as.factor(data$Channel)
data$Region <- as.factor(data$Region)
# Select continuous variables
spending_data <- data %>%
select(Fresh, Milk, Grocery, Frozen, Detergents_Paper, Delicassen)
# Log transform to reduce skewness
spending_log <- log1p(spending_data)
# Scale data
spending_scaled <- scale(spending_log)
# -------------------------
# PCA (for dimensionality reduction)
# -------------------------
pca_result <- prcomp(spending_scaled, center = TRUE, scale. = TRUE)
# Extract first few principal components (e.g., first 3)
pca_scores <- as.data.frame(pca_result$x[, 1:3])
# -------------------------
# Determine optimal number of clusters
# -------------------------
# Elbow method
fviz_nbclust(pca_scores, kmeans, method = "wss") +
ggtitle("Elbow Method")
# Silhouette method
fviz_nbclust(pca_scores, kmeans, method = "silhouette") +
ggtitle("Silhouette Method")
# -------------------------
# Apply K-means clustering
# -------------------------
set.seed(123) # for reproducibility
k <- 3 # Adjust based on elbow/silhouette results
kmeans_result <- kmeans(pca_scores, centers = k, nstart = 25)
# Add cluster labels to original data
data$Cluster <- as.factor(kmeans_result$cluster)
# -------------------------
# Visualisation
# -------------------------
# Cluster plot
fviz_cluster(
kmeans_result,
data = pca_scores,
geom = "point",
ellipse.type = "convex",
palette = "jco",
ggtheme = theme_minimal()
)
# -------------------------
# Cluster profiling
# -------------------------
# Average spending per cluster
cluster_summary <- data %>%
group_by(Cluster) %>%
summarise(across(Fresh:Delicassen, mean))
print(cluster_summary)
# Distribution of Channel by cluster
table(data$Cluster, data$Channel)
# Distribution of Region by cluster
table(data$Cluster, data$Region)
# New code below ====
# Compute distance matrix
dist_mat <- dist(pca_scores)
# Hierarchical clustering (Ward's method)
hc <- hclust(dist_mat, method = "ward.D2")
# Plot dendrogram
plot(hc, labels = FALSE, main = "Dendrogram")
# Cut tree into k clusters
k <- 3
hc_clusters <- cutree(hc, k = k)
# Compare with k-means
table(hc_clusters, kmeans_result$cluster)
library(mclust)
gmm_model <- Mclust(pca_scores)
# Best model and number of clusters (BIC-based)
summary(gmm_model)
# Cluster assignments
gmm_clusters <- gmm_model$classification
# Compare with k-means
table(gmm_clusters, kmeans_result$cluster)
library(dbscan)
# kNN distance plot to choose eps
kNNdistplot(pca_scores, k = 5)
abline(h = 1.5, col = "red") # adjust visually
# Run DBSCAN
db <- dbscan(pca_scores, eps = 1.5, minPts = 5)
# Cluster labels
table(db$cluster)
library(fpc)
cluster_stability <- clusterboot(
pca_scores,
B = 100,
clustermethod = kmeansCBI,
k = 3
)
cluster_stability$bootmean
library(cluster)
sil <- silhouette(kmeans_result$cluster, dist(pca_scores))
# Plot
fviz_silhouette(sil)
# Average silhouette score
mean(sil[, 3])
# K-means on original scaled data
set.seed(123)
kmeans_raw <- kmeans(spending_scaled, centers = 3, nstart = 25)
# Compare assignments
table(kmeans_raw$cluster, kmeans_result$cluster)
# Try multiple k values
k_values <- 2:6
for (k in k_values) {
km <- kmeans(pca_scores, centers = k, nstart = 25)
cat("k =", k, " | Tot.withinss =", km$tot.withinss, "\n")
}The (unedited) outputs generated by running this code, are as follows:
# A tibble: 3 × 7
Cluster Fresh Milk Grocery Frozen Detergents_Paper Delicassen
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 11939. 2006. 2502. 3266. 424. 891.
2 2 2997. 7045. 12505. 608. 5511. 797.
3 3 17182. 10658. 13381. 4181. 5004. 2867.
1 2
1 210 3
2 32 50
3 56 89
1 2 3
1 40 26 147
2 14 10 58
3 23 11 111
hc_clusters 1 2 3
1 1 2 119
2 212 18 19
3 0 62 7
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust VEI (diagonal, equal shape) model with 3 components:
log-likelihood n df BIC ICL
-2013.013 440 16 -4123.414 -4328.873
Clustering table:
1 2 3
62 202 176
gmm_clusters 1 2 3
1 32 28 2
2 12 54 136
3 169 0 7
0 1
16 424
boot 1
boot 2
boot 3
boot 4
boot 5
boot 6
boot 7
boot 8
boot 9
boot 10
boot 11
boot 12
boot 13
boot 14
boot 15
boot 16
boot 17
boot 18
boot 19
boot 20
boot 21
boot 22
boot 23
boot 24
boot 25
boot 26
boot 27
boot 28
boot 29
boot 30
boot 31
boot 32
boot 33
boot 34
boot 35
boot 36
boot 37
boot 38
boot 39
boot 40
boot 41
boot 42
boot 43
boot 44
boot 45
boot 46
boot 47
boot 48
boot 49
boot 50
boot 51
boot 52
boot 53
boot 54
boot 55
boot 56
boot 57
boot 58
boot 59
boot 60
boot 61
boot 62
boot 63
boot 64
boot 65
boot 66
boot 67
boot 68
boot 69
boot 70
boot 71
boot 72
boot 73
boot 74
boot 75
boot 76
boot 77
boot 78
boot 79
boot 80
boot 81
boot 82
boot 83
boot 84
boot 85
boot 86
boot 87
boot 88
boot 89
boot 90
boot 91
boot 92
boot 93
boot 94
boot 95
boot 96
boot 97
boot 98
boot 99
boot 100
[1] 0.6982740 0.7161780 0.8350312
cluster size ave.sil.width
1 1 213 0.39
2 2 82 0.19
3 3 145 0.33
[1] 0.3313364
1 2 3
1 1 80 0
2 0 2 145
3 212 0 0
k = 2 | Tot.withinss = 1365.289
k = 3 | Tot.withinss = 1075.661
k = 4 | Tot.withinss = 912.0646
k = 5 | Tot.withinss = 807.8276
k = 6 | Tot.withinss = 714.8749
ConsensusClusterPlus, leave-one-out influence analysis) are disproportionate for a 440-row dataset and read more as a textbook survey than a targeted recommendationpca_scores, kmeans_result, spending_scaled) without documenting this dependencyeps parameter is set to 1.5 with “adjust visually” but no dataset-specific tuning guidance is providedWe used slightly different prompts for each LLM here. The base query was the same, but we pasted in extracts from each LLM’s previous output as part of each prompt.
> Inserting LLM output back into the LLM here
these were your comments and they are correct and apply exactly on the data that i have, so i’m wondering did you somehow analysized the data yourself
================================================================
CUSTOMER SEGMENTATION – BUSINESS SUMMARY
================================================================
METHOD: HAC (Ward.D2) to identify structure +
K-Means (k=3) on PCA scores for final assignment
INPUT: log1p-transformed, standardised spend variables
projected onto retained PCs (Kaiser criterion)
CLUSTER 1 – 'High-Volume Fresh Buyers' (HoReCa dominant)
• Highest Fresh and Frozen spend
• Low Grocery, Detergents_Paper
→ Hotels, restaurants, cafés; perishable-led ordering
IMPLICATION: Prioritise cold-chain logistics; daily/weekly
replenishment cycles; dedicated HoReCa sales reps
CLUSTER 2 – 'Packaged Goods Retailers' (Retail dominant)
• Highest Grocery, Milk, Detergents_Paper
• Low Fresh spend
→ Supermarkets, convenience stores; ambient goods focus
IMPLICATION: Bulk-order pricing tiers; shelf-stable
inventory buffers; quarterly promotions on detergents
CLUSTER 3 – 'Small / Mixed Buyers'
• Below-average spend across all categories
• Mixed Channel and Region membership
→ Small independent shops, delis, mixed traders
IMPLICATION: Route-optimised delivery (lower margin);
upsell programmes; self-service ordering portal
NOTE: Exact cluster labels (1/2/3) may differ from above
depending on K-Means initialisation; inspect profile_raw
and the radar/bar charts to assign the business labels.
================================================================
Example Segment Types
Did the LLM analyse the actual data? No - the LLM did not have access to the data file. The descriptions it provided were an informed prediction based on two things:
1. This is a well-known public dataset. The UCI Wholesale Customers dataset is one of the most widely used benchmark datasets in clustering tutorials and academic papers. The LLM had seen it discussed extensively in its training data, and therefore knew its typical structure - the HoReCa/Retail split on PC2, the three-cluster solution, and the rough spending profiles each cluster tends to produce.
2. Domain logic for Portuguese wholesale distribution. Even without knowing the specific dataset, the business logic is fairly predictable: restaurants and hotels will dominate Fresh and Frozen spend - that is the nature of their business. Supermarkets and retailers will dominate Grocery, Milk, and Detergents_Paper - ambient shelf goods are their core stock. Any wholesale customer base will contain a long tail of smaller, lower-spend mixed buyers. The cluster descriptions were therefore a combination of prior knowledge about this specific dataset and general domain reasoning about wholesale food distribution.
Implication for interpretation. Since this dataset is publicly known, the
“expected” results are well established. What differentiates a strong analysis is
not just reproducing the three clusters, but the quality of the methodological
justification - why log1p, why Ward.D2, why PCA before clustering, and why the
robustness checks matter. That reasoning is what the code and discussion above
has been building toward.
Good question-and fair to ask. No, I didn’t actually access or run your specific dataset. I can’t see files on your computer or execute code on them unless you explicitly upload data here. What I did instead is rely on well-known, highly consistent patterns in this exact dataset (Wholesale Customers Data Set, commonly used in clustering/PCA coursework). This dataset has been analyzed many times, and it almost always produces very similar structures because of how the variables behave.
Why my interpretation matched your results There are strong, repeatable patterns in this data:
1. Correlation structure drives clustering. Grocery, Milk, Detergents_Paper → highly correlated → typically forms a “Retail-heavy” cluster. Fresh, Frozen → less correlated with the above → typically forms a “HORECA / fresh-food” cluster.
2. Channel variable reinforces it. Even though clustering is unsupervised: Channel = 2 (Retail) → high Grocery + Detergents. Channel = 1 (HORECA) → higher Fresh/Frozen. So clustering naturally reconstructs this split.
3. Skewed spending distribution. A large group of customers has low spending across categories. This almost always becomes a “small buyers” cluster.
Why this happens (intuitively) Think of your customers as three business types:
These behavioral differences are strong enough that almost any reasonable clustering method will rediscover them.
Important (for your assignment) Even though your results match expectations, you should not present them as assumed. You should:
A stronger way to phrase it in your report Instead of: “Clusters represent retail vs HORECA”
Say something like: “Cluster analysis reveals three distinct customer segments characterized by (i) high Grocery and Detergents spending, (ii) high Fresh and Frozen spending, and (iii) uniformly low spending. Cross-tabulation with the Channel variable shows strong alignment with Retail and HORECA customer types, validating the segmentation.”
Bottom line
If you want, I can help you go one level deeper-like naming your clusters professionally (e.g., “Bulk Retailers”, “Fresh-Focused HORECA”, “Low-Value Accounts”) and tying them tightly to strategy.
An accessible HTML version of this report is available via a public GitHub page:
This report was created in R Markdown. The source code is open source and is available via:
Changes made to this document were tracked using Git and are also available via the same repository (https://github.com/ytterbiu/SMM636-ML-group-coursework-03-g01/commits/main/). Please note that this commit history reflects updates made to the project repository and is not a comprehensive timeline of all contributions by group members.
The total word count for the core sections of this report is 997 words (below the 1000 word limit).
The calculation was performed programmatically directly from the source .rmd
file. An R script was used to read the raw document lines and strip out
non-text elements, including the YAML front matter, R code chunks, sub-headers,
markdown formatting (tables and figures), and LaTeX/HTML syntax. The script
isolates lines falling within #intro, #qone, #qtwo, and #qthr. It then
calculates the final tally using the stringi package.
The code chunk that performs this calculation is executed silently at the beginning of this document, but is reproduced below for transparency:
rmd_lines <- readLines("01SMM363-assessment-03.rmd", warn = FALSE)
# ---- Remove YAML front matter ----
yaml_bounds <- which(rmd_lines == "---")
if (length(yaml_bounds) >= 2) {
rmd_lines <- rmd_lines[-(yaml_bounds[1]:yaml_bounds[2])]
}
# ---- Remove code chunks ----
in_chunk <- FALSE
keep <- rep(TRUE, length(rmd_lines))
for (i in seq_along(rmd_lines)) {
if (grepl("^```\\{", rmd_lines[i])) {
in_chunk <- TRUE
keep[i] <- FALSE
next
}
if (in_chunk && grepl("^```\\s*$", rmd_lines[i])) {
in_chunk <- FALSE
keep[i] <- FALSE
next
}
if (in_chunk) keep[i] <- FALSE
}
rmd_lines <- rmd_lines[keep]
# ---- Identify which lines fall inside counted sections ----
# Sections to include (level-1 headers only)
counted_ids <- c("intro", "qone", "qtwo", "qthr")
in_counted_section <- FALSE
include <- rep(FALSE, length(rmd_lines))
for (i in seq_along(rmd_lines)) {
line <- rmd_lines[i]
# Detect any level-1 header (# but not ##)
if (grepl("^#\\s+[^#]", line)) {
# Check if this header belongs to a counted section
in_counted_section <- any(sapply(counted_ids, function(id) {
# [^}]* allows any characters (like spaces, hyphens, classes) before the
# closing }
# this is required to count the words in the unnumbered introduction
grepl(paste0("\\{\\s*#", id, "[^}]*\\}"), line)
}))
next # skip the header line itself
}
if (in_counted_section) include[i] <- TRUE
}
counted_lines <- rmd_lines[include]
# ---- Remove sub-headers (##, ###, etc.) ----
counted_lines <- counted_lines[!grepl("^#{2,}", counted_lines)]
# ---- Remove table lines (markdown pipes) ----
counted_lines <- counted_lines[!grepl("^\\|", counted_lines)]
# ---- Remove figure captions:  ----
counted_lines <- counted_lines[!grepl("^!\\[", counted_lines)]
# ---- Remove fenced div markers (::: name) ----
counted_lines <- counted_lines[!grepl("^:::", counted_lines)]
# ---- Remove LaTeX commands ----
counted_lines <- counted_lines[!grepl("^\\\\", counted_lines)]
# ---- Remove HTML comments ----
counted_lines <- counted_lines[!grepl("^<!--", counted_lines)]
# ---- Remove inline R references like \@ref() ----
counted_lines <- gsub("\\\\@ref\\([^)]*\\)", "", counted_lines)
# ---- Count ----
word_count <- sum(stringi::stri_count_words(counted_lines), na.rm = TRUE)
message("Word count (sections qone, qtwo, qthr, body text only): ", word_count)
# Write for LaTeX preamble
writeLines(
paste0("\\newcommand{\\wordcount}{",
formatC(word_count, big.mark = ","), "}"),
"wordcount.tex"
)Generative AI tools (Claude Opus 4.6, Sonnet 4.6, ChatGPT GPT-5.3, Google Gemini Pro) were used throughout this project to assist with code generation, debugging, methodological validation, and report structuring.
The final analysis, statistical interpretation, and written content of this report are the work of all group members.