PaNdata Open Data Infrastructure

The PaN-data collaboration brings together eleven large multidisciplinary Research Infrastructures which operates hundreds of instruments used by over 30,000 scientists each year. They support fields as varied as physics, chemistry, biology, material sciences, energy technology, environmental science, medical technology and cultural heritage. Applications are numerous, for example, crystallography can reveal the structures of viruses and proteins important for the development of new drugs; neutron scattering can identify stresses within engineering components such as turbine blades, and tomography can image microscopic details of the 3D-structure of the brain. Commercial users include the pharmaceutical, petrochemical and microelectronic industries.

PaNdata-ODI will develop, deploy and operate an Open Data Infrastructure across the participating facilities with user and data services which support the tracing of provenance of data, preservation, and scalability through parallel access. It will be instantiated through three virtual laboratories supporting powder diffraction, small angle scattering and tomography.

PaN-data ODI in brief

PaNdata Open Data Infrastructure is a proposal to construct and operate a sustainable data infrastructure for European Photon and Neutron laboratories. This will enhance all research done in the neutron and photon communities by making scientific data accessible allowing experiments to be carried out jointly in several laboratories.

Formed in 2008, the PaNdata collaboration currently brings together eleven major world class European Research Infrastructures to construct and operate a common data infrastructure for the European Neutron and Photon large facilities (See www.pandata.eu). In 2010, the consortium began a Support Action which is focusing on standardisation activities in the areas of: data policy, user information exchange, scientific data formats, interoperation of data analysis software, and integration and cross-linking of research outputs. These standards form the baseline for PaNdata ODI and will ensure that the research and development activities deliver outputs that can readily be deployed into common services which integrate data across the consortium to create a fully integrated, pan-European, research data infrastructure supporting numerous scientific communities across Europe.

Scientifically, neutron and photon laboratories are complementary research facilities, often focussing on different aspects of the wide research spectrum covered by these facilities. They support experiments in many scientific fields as varied as physics, chemistry, biology, material sciences, energy technology, environmental science, medical technology and even cultural heritage investigations. Industrial applications are growing, notably in the fields of pharmaceuticals, petrochemicals and microelectronics. A variety of experimental techniques are deployed in these facilities including photoemission and spectromicroscopy, macromolecular crystallography, low-angle scattering, dichroic absorption spectroscopy, and neutron and x-ray imaging. Applications are numerous and varied. For example, crystallography reveals the structures of viruses and proteins which are important for the development of new drugs to fight everything from flu to HIV and cancer. Penetration deep inside materials such as steel can identify stresses and strain within engineering components such as turbine blades. Tomography investigations reveal microscopic details of the 3D- structure of the brain. Observation under changing conditions can help improve process for the manufacture of plastics and foods and develop ever smaller magnetic recording materials important for data storage in computers.

The digital revolution has enabled rapid advances and opened up huge opportunities for all these research fields while at the same time bringing some significant challenges. The research community has begun to address unresolved challenges in long-term preservation and access to information by setting up repositories, some focusing on documents, some on data, others on both, with many serving specific disciplines, and devising sound policies to encourage the sharing of the data. Whilst the more general aspects of European data infrastructure are being coordinated by various initiatives and projects such as the Alliance for Permanent Access, e-IRG, ESFRI, many of which involve the PaNdata partners, the PaNdata ODI project addresses the specific, urgent, and pragmatic needs for a data infrastructure serving the Photon and Neutron science communities in Europe.
The participating facilities serve an expanding user community of well in excess of 30,000 visiting scientists each year across Europe and are major producers of scientific data. Three new light sources became operational relatively recently (SOLEIL, DIAMOND, PETRA-III) and several other facilities are being planned, under construction or upgrade (ALBA, EU-XFEL, FERMI, ESRF, ILL, ISIS, SwissFEL). Taken together these facilities will soon produce enormous quantities of scientific data, more, for example, than is planned for the Large Hadron Collider (LHC) at CERN. This upcoming ―data avalanche‖, a result of the increased capability of modern electronic detectors and high-throughput automated experiments, makes it essential that forces are joined to implement and deploy a framework for efficient and sustainable data management and analysis.

The facilities are in the centre of scientific activity of this community proving a focus to activities and producing the data which are the raw materials for science. The experiments in these facilities are of increasing complexity, and increasingly performed in more than one laboratory by collaborations between international research groups. The resulting raw and processed data need to be accessible over the Internet across facilities and user institutions. It should remain on-line at least until the results are published, in many cases much longer to allow re-processing and the preservation of knowledge.

Historically, the situation at many of the facilities, and in particular at the photon sources, has left data management largely up to the individual users who often literally carried data away on portable media. These media are notoriously unsuitable to guarantee the longevity and availability of precious and costly experimental data. Not only is this becoming unfeasible considering the dramatic increase in size of some of the data sets, it is also counterproductive for the scientific workflow, verifiability of the data analysis and ultimately constitutes a dramatic loss for the scientific community. Presently, access to instruments, data, software and e-infrastructure is being standardised between the facilities through the PaNdata Support Action. This will tremendously simplify the landscape for multi-disciplinary exploitation of the instruments and lay the groundwork for common implementation of data management infrastructure across these participating facilities and beyond.

Once agreement is reached on data standards for European synchrotrons and neutron sources and implemented through open networked interfaces, this will allow industry to utilise publicly available data, processing or reordering the data in such a way that it could be presented with added value to commercial market segments like, for example, life science, engineering or material science.

The potential and progress of the project will be readily disseminated to the scientific community through other relevant Integrated Infrastructure Initiatives (I3), specifically, NMI3 for neutrons and ELISA for synchrotrons and FELs. NMI3 and ELISA are each coordinated by one of the partners of PaNdata Europe. Links to other relevant types of multidisciplinary RIs, such as lasers or NMR, will be made through the I3 Network which is also coordinated by one of the partners. These will also enable rapid roll-out to other neutron and photon RIs. Cooperative knowledge exchange between PaNdata and e-infrastructure providers like EGI and PRACE will strongly benefit from the standardisation efforts and significantly enhance the research opportunities of photon and neutron user communities.

The clear benefit of an EU-funded collaborative project will be the strong incentive and timescale for initiating and completing actions. Considering the demonstrated success of collaborative ventures within the NMI3 and ELISA projects and their successful routine operation, we expect the same to evolve from this work. This project also provides an opportunity for wider collaborations between similar relevant European initiatives and will ensure integration into the wider data infrastructure supporting multi-disciplinary science. And last but not least, PaNdata ODI will stimulate discussions and possibly collaborations with North American neutron and photon laboratories which are currently lacking similar initiatives.