Creator of galaxy-inspired algorithm torque clustering envisions a path toward AI beyond human blind spots

Infrared view of the M81 galaxy [NASA]

On a clear night in early 2019, in his hometown in China, Jie Yang, Ph.D., was thinking about how galaxies merge. In essence, they combine through mass, distance and gravity, with nothing else involved. Yang, now a postdoctoral research fellow at the University of Sydney, asked whether a clustering method could group itself the same way. “I also believe in Occam’s razor,” Yang said in a recent interview. “I think a good model should be as simple as it can be. There is a saying in Chinese, Da Dao Zhi Jian (大道至简), meaning the deepest principles are the simplest ones. The galaxy picture and that idea came together, and that is where torque clustering began.”

The core intuition of torque clustering is that a cluster joins its nearest higher-mass neighbor unless that connection is both heavy and long. In other words, two large clusters separated by a long distance create a high-torque link. “That one rule, inspired by how galaxies behave, is enough to recover a surprisingly broad range of structures in data,” Yang said. The idea has gained traction steadily after it was announced with a corresponding GitHub repo. Yang and his doctoral advisor at the University of Technology Sydney (UTS), Distinguished Professor Chin-Teng Lin, published it in IEEE Transactions on Pattern Analysis and Machine Intelligence in January 2025 (DOI: 10.1109/TPAMI.2025.3535743; an earlier preprint is on arXiv), reporting an average adjusted mutual information score of 97.7% across 1,000 datasets, against roughly 80% for other leading methods. R&D World covered the launch at the time.

The torque clustering data have continued to show high accuracy in many settings. A 2025 study by David Hofmeyr, Ph.D., senior lecturer in statistics at Lancaster University, pitted torque clustering against his own automatic method and standard algorithms such as k-means, spectral clustering and HDBSCAN, and across all 48 clustering tasks torque posted the best average rank on Adjusted Mutual Information (AMI), one of the field’s key accuracy metrics. Yang said Hofmeyr reached out only to ask how to run the open-source code. The paper notes that torque had no chance to be tuned on those datasets, which Yang considers the point. “The study happened without my knowledge,” he said. “That was a clean independent test, and a published result like that means much more than anything I could say about my own method.”

The algorithm has also traveled into fields Yang never targeted, turning up in published work on software fairness, environmental forecasting, hydraulic engineering and signal processing. The example that surprised him most came from 6G, where a team that included two IEEE fellows used torque, which their paper calls a state-of-the-art algorithm, to separate real objects, such as vehicles and pedestrians, from noise in LiDAR point clouds across near and far ranges while building a digital-twin dataset for next-generation sensing (SynthSoM-Twin).

On the benefits of clustering autonomy

Several widely used clustering methods make you choose key structure up front. K-means, the standard workhorse, is a clear example: you fix the number of groups before you start, leaning on heuristics like the elbow method or silhouette analysis to guess at it. Torque can remove that step in automatic mode. “Torque clustering autonomously detects the number of clusters and identifies noise,” Yang said.
Yang sees torque clustering as complementary to foundation models, which have grown steadily more expensive to train. Public estimates from Stanford’s AI Index and Epoch AI put GPT-4 (2023) at roughly $78 million in compute and Google’s Gemini Ultra at about $191 million. The largest publicly estimated final training run in Epoch AI’s data is xAI’s Grok 4, which Epoch estimated in September 2025 cost about $490 million for the final training run alone, before staff, experiments or the compute to serve it.

Single training runs are now a minority of what the leading labs spend. Epoch puts OpenAI’s R&D compute at around $5 billion in 2024, of which only about $500 million, roughly 10%, went to the final runs that produced released models like GPT-4.5; the rest went to experiments, synthetic-data generation and unreleased models. That total is projected to have risen to roughly $9 billion in 2025, and Epoch’s cost model projects the largest single runs will cross $1 billion by 2027.

The 2020 scaling-laws paper that Dario Amodei co-authored at OpenAI, before he founded Anthropic, found that a model’s loss falls predictably as a power law in three quantities: model size, dataset size and training compute. That lever gets harder to pull as high-quality public text runs short. Epoch AI projects that, on current trends, models will use up the stock of human-generated public text between 2026 and 2032.

Synthetic data is one answer to that squeeze. Yang points to a different one, and he starts by noting that the squeeze sits on top of a paradigm that is already unsupervised at its core. “The training behind every large language model is itself a form of unsupervised learning,” Yang said. That shift is much of what set off the current AI wave. By training models to predict the next token in ordinary text rather than learning from human-labeled examples, labs could draw on enormous unlabeled corpora and largely escape the labeling bottleneck that had constrained earlier machine learning. The transformer architecture, introduced in 2017, made the approach practical at the scale modern models require. “So the real question is which unsupervised problems we will solve and which we have not.”

Yang sees two roles for a method like torque clustering. One is upstream, organizing and structuring the data that feeds large models. The second he considers more important, because it lands on a constraint the field is hitting now. “Larger models are starting to run low on high-quality data,” he said. “Some are already training partly on synthetic data.”

Yang reaches for a familiar metaphor with a specific number behind it, citing IDC’s Global DataSphere forecast, which projects annual data generation to reach roughly 394 zettabytes by 2028, with more than 80% of it unstructured by the standard analyst consensus, most of it unlabeled and sitting unused. “I think of unstructured data like crude oil: it is hugely valuable, but until it is refined,” Yang said. “Tools like torque clustering are a way to refine it, to pull out the patterns and the structures and the relationships hidden inside, and turn them into something a model can actually learn from.”

Human, all too human

At Sydney, Yang is applying Torque to multimodal physiological signals. “I am trying to apply torque clustering to real applications, such as sleep staging and sleep-related disorders,” Yang said. Sleep is scored into a few fixed stages that humans defined decades ago, and Yang wants to know whether an autonomous method can surface sub-stages or patient subtypes those categories would hide.

While large language models have been trained on massive corpora, future AI may peer beyond the limits of human knowledge. The clearest precedents come from games, where a machine handed nothing but the rules has repeatedly surfaced ideas that centuries of human play had missed. AlphaZero taught itself chess from self-play in 2017 and settled on a style that DeepMind’s Demis Hassabis called “alien.” It won with counterintuitive sacrifices for long-term position, all of it inside the rules of the game.

Yang reaches for the Go version of that moment. “In 2016, AlphaGo made moves that no human player would make,” Yang said. Unlike AlphaZero, the original AlphaGo first learned from a database of expert human games before refining its play through self-play. Even so, it produced moves human players almost never would. DeepMind estimated that Move 37 in its 2016 match against legendary player Lee Sedol had about a one-in-10,000 chance of being played by a human, and the move helped AlphaGo win that game.

Today’s models train on human-generated data and tend to inherit its limits. “If we use unsupervised methods like torque clustering to find the structure that humans have not noticed, and feed that into how we train AI models, we might get models that are genuinely more creative, models that can see past our blind spots,” Yang said. “I want to be careful to call that a direction, but it is a direction I find promising.”

The torque clustering method is itself Creative Commons (CC BY-NC-SA) licensed while use for financial gain is restricted. “I wanted the method to remain fully open for academic and research use because reproducibility matters,” he said. “I chose to restrict commercial use because I want a clear path for bringing torque clustering into industry.” He notes that there has been “real commercial interest, including financial-sector use cases around customer-behavior clustering.”

On the benefits of clustering autonomy

Human, all too human

Related Articles Read More >

As AI floods drug discovery with designs, Twist uses DNA chips to tackle the wet-lab bottleneck

Inside AutoLabs: PNNL’s self-correcting AI still needs an expert in the loop

Claude Opus 5 outscores Fable 5 on 8 of 13 benchmarks at half the token price

As U.S. power demand surges, Brookhaven taps AWS in aim to drive 20-100x faster power grid decision-making with GridSearch AI

Search R&D World