Cockrell School of Engineering
The University of Texas at Austin

UT PGE assistant professor Maša Prodanović and a team of leading UT Austin scientists are looking to change the way researchers distribute data.

On Sept. 1, 2015, Prodanović, Dr. Maria Esteva (Texas Advanced Computing Center) and Dr. Richard Ketcham (Jackson School of Geological Sciences) received a two-year, $600,000 National Science Foundation (NSF) grant to build a Digital Rocks Portal utilizing the latest technologies in data storage.

Recent advances in high-resolution imaging techniques have provided a wealth of 3D datasets that reveal the microstructure of rocks and soil, which in turn serve as the basis for sophisticated computer modeling of fluids moving through pore networks. This emerging research can inform important decisions in petroleum, environmental and civil engineering while addressing key geological questions. However, there is not yet a large, highly organized platform for sharing and downloading this valuable data.

“This grant is a part of a larger NSF project, the EarthCube initiative, which aims to create a strong infrastructure for pulling all available earth system data together to make it more easily accessible and useable,” said Ketcham.

Modern 3D datasets of pore networks are typically several gigabytes in size, leading to significant challenges for researchers seeking to store and share them. There is also a lack of standardization for characterizing image types and associated information. Even when they are made available, data sets only typically live online for a matter of months before they are cleared due to space issues. This impedes scientific cross-validation of the simulation approaches and limits the development of studies that span length scales from a micrometer (a millionth of a meter, the size of individual pores and grains making up a rock) to a kilometer (the level of a petroleum reservoir, geological basin or aquifer).

“The current set up for data management has a lot of friction,” said Prodanović. “Friction being, ‘ok I have to figure out how to post 25 gigabytes of data’ – it’s not something you can email someone.”

Prodavonic, seated, poses with stack of books.

Dr. Maša Prodanović

Downloading and uploading large data sets is just one piece of the portal. Another goal is developing a social aspect by visually presenting the information, encouraging communication and interaction among scientists. Prodanović says she thinks of the interface as a “Dropbox meets Facebook.”

Each of the three groups involved brings a different set of skills, ensuring expert knowledge is applied to all aspects of the portal.

“I’m interested in the data and modeling it,” said Prodanović. “Richard is on the production end because his lab outputs large amounts of data and his research is analyzing the data. Maria is interested in the information science aspect of development of the web-based portal:organizing a large datasets platform so that it is easy to search and researchers are inclined to use it."

The Texas Advanced Computing Center is a natural home for the large data sets since it already has the infrastructure in place and the CT lab in the Jackson School, which Ketcham directs, is one of the largest academic imaging facilities in the nation. 

Another benefit to the portal is the creation of a platform that will inform researchers of their data downloads. When researchers want to know how many times their papers have been viewed or cited they can use Google Scholar, but there is not a reliable tool for data tracking. By assigning a digital object identifier to each data set, researchers will now have a means for seeing where their data has been re-utilized.

The official project launched Sept. 1, but Prodanović and Esteva have been periodically working to develop a portal for about two years.

“The first mistakes have already been made, so we have a good prototype,” said Prodanović. “Within two years we will have the ability to make it operational at a high level, particularly in terms of speed.”

Once the project is complete, the hope is that federal rules will evolve to ensure that data sets are shared for the benefit of the entire scientific community. Agencies such as the U.S. Department of Energy (DOE) and NSF currently require data management plans for all projects, but have not mandated distribution of data due to the lack of infrastructure. “This is paving the way for a formalized management plan for this type of data,” said Ketcham. “The next big thing is people demanding public repositories.”