High-dimensional data types

genomic, climate, financial, sensor

Researchers develop Bayesian inference for hidden

Professor Kyoungjae Lee and his team at Sungkyunkwan University have cracked a problem that matters everywhere data gets complicated: how to find hidden connections in massive datasets without oversimplifying reality. Their method, called j-LANCE, reveals the dependence structures that connect variables in genomic data, climate records, financial markets, and sensor networks — and it does so while respecting the fact that different groups of data may follow different rules.

High-dimensional data is everywhere now. Climate scientists observe thousands of measurements across regions. Genomicists sequence millions of genetic markers simultaneously. Financial analysts track countless market variables in real time. In each case, the raw numbers alone tell an incomplete story. What matters is understanding how the pieces relate to one another. In climate data, temperatures in nearby regions influence one another because of wind and weather patterns. In genomic data, genes located next to each other on a chromosome often work together. Traditional statistical methods either analyze these variables in isolation — losing the relational information entirely — or assume all groups operate identically, which is rarely true.

The j-LANCE framework, developed in collaboration with Professor Won Chang of Seoul National University and Professor Xuan Cao of the University of Cincinnati, takes a different approach. The method exploits a key insight: in real-world datasets, variables typically have a natural ordering and are most strongly connected to their immediate neighbors. Instead of trying to learn every possible relationship simultaneously, j-LANCE learns which variables are most tightly linked across space or sequence, while remaining flexible about how these patterns differ between groups. The research team used a Markov random field prior — a mathematical structure that allows similarities and differences across groups to emerge naturally from the data itself, rather than being imposed as an assumption.

The technical achievement is substantial. The researchers proved mathematically that j-LANCE can accurately estimate these dependence structures even when data is extremely high-dimensional, and that the rate at which estimates converge to the true values is nearly optimal. Equally important for practitioners: they avoided relying on MCMC — a computationally expensive iterative method that slows analysis to a crawl with large datasets. By sidestepping that bottleneck, j-LANCE enables fast analysis even when variables number in the thousands.

The practical proof came from climate data. The team analyzed temperatures from 30 locations across the Pacific Northwest of the United States between 2019 and 2021, using ERA5 data and a spatial ordering that reflects wind flow patterns. The results vindicated the approach: j-LANCE captured similar temperature dependence patterns across all three years while also detecting distinctive structures that appeared uniquely in individual years — the kind of nuanced finding that either-or methods would miss entirely.

The implications ripple outward. Climate modelers can now better understand regional temperature networks. Genomicists can identify gene interactions more reliably. Financial risk managers can model correlations more accurately across market regimes. Sensor networks deployed across industrial or environmental systems can reveal how measurements cascade and influence one another. The method addresses a genuine gap in statistical science: the need for flexible, computationally efficient tools that learn dependence structures while respecting both the real ordering in data and the reality that different groups behave differently.

The research appears in Bayesian Analysis, a peer-reviewed journal. As high-dimensional data continues to accumulate across every scientific and industrial field, methods like j-LANCE that extract meaningful structure without sacrificing speed or accuracy will become increasingly essential infrastructure for understanding our complex world.

Researchers develop Bayesian inference for hidden dependence structures in multi-group high-dimensional data