Probabilistic Merging?

Patrick Durusau
Sat Jun 30 15:15:03 CEST 2012


Ran across an article that will appear in my blog soon but wanted to 
share it ahead of time.

Probabilistic merging across databases.

Well, the actual title is:

> SkyQuery: An Implementation of a Parallel Probabilistic Join Engine 
> for Cross-Identification of Multiple Astronomical Databases

The abstract:

> Multi-wavelength astronomical studies require cross-identification of 
> detections of the same celestial objects in multiple catalogs based on 
> spherical coordinates and other properties. Because of the large data 
> volumes and spherical geometry, the symmetric N-way association of 
> astronomical detections is a computationally intensive problem, even 
> when sophisticated indexing schemes are used to exclude obviously 
> false candidates. Legacy astronomical catalogs already contain 
> detections of more than a hundred million objects while the ongoing 
> and future surveys will produce catalogs of billions of objects with 
> multiple detections of each at different times. The varying 
> statistical error of position measurements, moving and extended 
> objects, and other physical properties make it necessary to perform 
> the cross-identification using a mathematically correct, proper 
> Bayesian probabilistic algorithm, capable of including various priors. 
> One time, pair-wise cross-identification of these large catalogs is 
> not sufficient for many astronomical scenarios. Consequently, a novel 
> system is necessary that can cross-identify multiple catalogs 
> on-demand, efficiently and reliably. In this paper, we present our 
> solution based on a cluster of commodity servers and ordinary 
> relational databases. The cross-identification problems are formulated 
> in a language based on SQL, but extended with special clauses. These 
> special queries are partitioned spatially by coordinate ranges and 
> compiled into a complex workflow of ordinary SQL queries. Workflows 
> are then executed in a parallel framework using a cluster of servers 
> hosting identical mirrors of the same data sets. 


Key sentence: "One time, pair-wise cross-identification of these large 
catalogs is not sufficient for many astronomical scenarios. "

I suspect that to be the case for many scenarios, not just those in 

But how would I reliably interchange the parameters for such queries?

Standards anyone?

Hope everyone is having a great weekend!


Patrick Durusau
patrick at
Former Chair, V1 - US TAG to JTC 1/SC 34
Convener, JTC 1/SC 34/WG 3 (Topic Maps)
Editor, OpenDocument Format TC (OASIS), Project Editor ISO/IEC 26300
Co-Editor, ISO/IEC 13250-1, 13250-5 (Topic Maps)

Another Word For It (blog):
Twitter: patrickDurusau

