Earth science communities need to rely on access to large and growing amounts of curated data to make progress in research and provide adequate education. Existing infrastructures pose significant barriers to this access, especially for small to mid-size research groups and primarily undergraduate institutions: cloud services disappear when funding runs out to pay for them and therefore do not provide the long-term availability required for curated data. Similarly, in-house IT infrastructure is maintenance-intensive and requires dedicated resources for which long-term funding is often unavailable. The goal of the Big Weather Web is to make in-house IT infrastructure affordable by combining the application of three recent technologies: virtualization, federated smart storage, and big data management. Virtualization allows push-button deployment and maintenance of complex systems, smart storage provides automatic, community-wide data availability guarantees, and big data management allows for easy curation of data and its products. The combination of these three technologies allows communities to create a standard community-specific computational environment and efficiently refine it with minimal repetition of work, introducing a high degree of reproducibility in research and education. This reproducibility accelerates learning and amplifies everyone?s contribution. Due to virtualization, it can easily take advantage of cloud services whenever they become available, and it can run on in-house IT infrastructure using significantly reduced maintenance resources. The Big Weather Web will be developed in the context of numerical weather prediction with the expectation that the resulting infrastructure can be easily adapted to other data-intensive scientific communities.
The volume, variety, and velocity of scientific data generated is growing exponentially. Small to mid-size research groups and especially primarily undergraduate institutions (PUIs) do not have the resources to manage large amounts of data locally and share their data products globally at high availability. This lack of resources has a number of consequences in education and research that have been well-documented in recent EarthCube workshops: (1) data-intensive scientific results are not easily reproducible, whether in the context of research or education, (2) limited or non-existent availability of intermediate results causes a lot of unnecessary duplication of work and makes learning curves unnecessarily steep, and consequently (3) scientific communities of practice are falling behind technological innovations. This Big Weather Web project focuses on the numerical weather prediction community. Numerical weather models produce terabytes of output per day, comprising a wealth of information that can be used for research and education, but this amount of data is difficult to transfer, store, or analyze for most universities. The Big Weather Web addresses this situation with the design, implementation, and deployment of 'nuclei,' which are shared artifacts that enable reliable and efficient access and sharing of data, encode best practices, and are sustainably maintained and improved by the community. These nuclei use existing and well-established technologies, but the integration of these technologies will significantly reduce the resource burden mentioned above. Nucleus 1 is a large ensemble distributed over seven universities. Nucleus 2 is a common storage, linking, and cataloging methodology implemented as an appliance-like Data Investigation and Sharing Environment (DISE) that is extremely easy to maintain and that automatically ensures data availability and safety. Nucleus 3 is a versioned virtualization and container technology for easy deployment and reproducibility of computational environments. Together, these nuclei will advance discovery and understanding through sharing of data products and methods to replicate scientific results while promoting teaching, training, and learning by creating a shared environment for scientific communities of practice. These shared environments are particularly important for underrepresented groups who otherwise have limited access to knowledge that is primarily propagated by social means. Our approach is a significant step towards improving reproducibility in the complex computational environments found in many scientific communities.
|Effective start/end date
|8/1/15 → 7/31/18
- National Science Foundation: $99,953.00