“We realized this is a huge challenge,” Jia says. “It’s an important issue to address because if people like data journalists cannot get data that they want, they cannot create quality news stories. And that’s really a hurdle for a well-informed society.”
The team is taking an approach that combines technical improvements and user perspectives to enhance dataset search. While Davison and Heflin’s work focuses on schema label generation and user interface design, Jia is studying different user cases and interviewing data journalists, scholars, professors, graduate students and librarians to compile as many scenarios as possible. That information will shape the design of a search tool prototype. Once the prototype is complete, the team will test its effectiveness to see if it’s an improvement on the current process.
“We have to really think about the characteristics of datasets and see what factors are most important when we try to decide the relevance of the search results for users,” Jia says. “How can we make a better ranking of the results?”
Most importantly, Jia says, they have to address the issues of indexing search results. Dataset searches tend to be very specific. She uses the example of a data journalist trying to find how Bethlehem, Pa., residents of a specific gender, in a specific part of the city, aged 25 to 35, voted in an election. The journalist might be able to find the voting data divided by ward, she says, but to find out if the data includes the actual age of voters, they would have to enter the dataset and look at each cell.
“When you are putting all these justifications together, it’s hard to determine if the dataset actually contains what you’re looking for by just looking at the description or the title itself,” Jia says. “It’s a ton of work to find that dataset, download it to clean it up and then decide if it’s even usable. So can we, maybe at the search level, give users a sneak peek and understand whether the dataset is actually something useful for their project or their research?”
Or, she says, the users may be looking for a “Facebook for datasets.” Other than getting highly relevant search results, the team found in their latest survey study that users were interested in communicating with the data creators and data providers, as well as other users, to learn more information about the datasets and what stories or insights have been generated using them.