Probabilistic Merging?

Patrick Durusau patrick at durusau.net
Sat Jun 30 15:15:03 CEST 2012


Greetings!

Ran across an article that will appear in my blog soon but wanted to 
share it ahead of time.

Probabilistic merging across databases.

Well, the actual title is:

> SkyQuery: An Implementation of a Parallel Probabilistic Join Engine 
> for Cross-Identification of Multiple Astronomical Databases

The abstract:

> Multi-wavelength astronomical studies require cross-identification of 
> detections of the same celestial objects in multiple catalogs based on 
> spherical coordinates and other properties. Because of the large data 
> volumes and spherical geometry, the symmetric N-way association of 
> astronomical detections is a computationally intensive problem, even 
> when sophisticated indexing schemes are used to exclude obviously 
> false candidates. Legacy astronomical catalogs already contain 
> detections of more than a hundred million objects while the ongoing 
> and future surveys will produce catalogs of billions of objects with 
> multiple detections of each at different times. The varying 
> statistical error of position measurements, moving and extended 
> objects, and other physical properties make it necessary to perform 
> the cross-identification using a mathematically correct, proper 
> Bayesian probabilistic algorithm, capable of including various priors. 
> One time, pair-wise cross-identification of these large catalogs is 
> not sufficient for many astronomical scenarios. Consequently, a novel 
> system is necessary that can cross-identify multiple catalogs 
> on-demand, efficiently and reliably. In this paper, we present our 
> solution based on a cluster of commodity servers and ordinary 
> relational databases. The cross-identification problems are formulated 
> in a language based on SQL, but extended with special clauses. These 
> special queries are partitioned spatially by coordinate ranges and 
> compiled into a complex workflow of ordinary SQL queries. Workflows 
> are then executed in a parallel framework using a cluster of servers 
> hosting identical mirrors of the same data sets. 

Source: http://arxiv.org/abs/1206.5021

Key sentence: "One time, pair-wise cross-identification of these large 
catalogs is not sufficient for many astronomical scenarios. "

I suspect that to be the case for many scenarios, not just those in 
astronomy.

But how would I reliably interchange the parameters for such queries?

Standards anyone?

Hope everyone is having a great weekend!

Patrick




-- 
Patrick Durusau
patrick at durusau.net
Former Chair, V1 - US TAG to JTC 1/SC 34
Convener, JTC 1/SC 34/WG 3 (Topic Maps)
Editor, OpenDocument Format TC (OASIS), Project Editor ISO/IEC 26300
Co-Editor, ISO/IEC 13250-1, 13250-5 (Topic Maps)

Another Word For It (blog): http://tm.durusau.net
Homepage: http://www.durusau.net
Twitter: patrickDurusau



More information about the sc34wg6 mailing list