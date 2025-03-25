In my pursuit to understand why certain themes consistently outperformed others, Google Analytics by itself was falling flat in helping uncover why certain articles soared while others stumbled. So I downloaded decades of data from Google Analytics and WordPress and loaded it into a Jupyter notebook and decided to use semantic clustering, a type of unsupervised machine learning that grouped articles based purely on linguistic and thematic similarities. After waiting sometimes to crunch the data, my eyes perked up when I found a cluster with content performing that seemed to perform 20.1× better than average. To be fair, while it captured some of the strongest recent performers, the 83 members it turned up were a hodge podge —one was a 17-year-old Ebola story from 2008. And some were so old they had no traffic data at all.

But there is clearly a signal here. Based on the items that did have data, Cluster 4 significantly outperforms all other clusters across several metrics:

Unique visitors: Cluster 4 approximately 17.7 times more visitors per item than the next best cluster (Cluster 1).

Total pageviews: As mentioned earlier, Cluster 4 generates about 17.9 times more pageviews than Cluster 1.

Engagement time: Users spend a 2.6 times longer than the next highest.

In terms of themes, Cluster 4 included content on a range of subjects (full list below), including R&D, AI, science, solar, and computing. Below is a view of the clusters. Because they were clustered using embeddings (ModernBERT), to view them required using a technique to project high-dimensional semantic embeddings into a two-dimensional space. I used t-SNE for this. Claude 3.7 Sonnet helped with the coding and analyzing the data.

Why did the cluster turn up items with and without Google Analytics data? Because the clustering method doesn’t care about publication dates or pageviews; it looks strictly at the thematic signals in the headlines. It found an Ebola article, for instance because it asked a big, forward-looking question, mentioned potential global health crises, and used a cautionary tone. Those were similar to some of the items in our newer high-performers.

Clustering techniques like the t-SNE visualization shown could be valuable in a range of scientific research areas. In cancer research, for instance, clustering can identify patterns in gene expression data that differentiate cancer cells from normal cells.

Here are the items in the R&D World Cluster organized by theme.

Cluster 4: AI and computing

Cluster 4: Space and astronomy

Cluster 4: Energy and sustainability

Cluster 4: Biology and medicine

Cluster 4: Materials science and physics

Cluster 4: Research and industry trends

Cluster 4: Earth and environmental sciences