Text Mining for Scholarly Communications and Repositories

12:50 pm in Uncategorized by Josh Brown

On the 28th and 29th of October, I attended a joint workshop, organised by the National Centre for Text Mining (NaCTeM) and UKOLN in Manchester, on Text Mining for Scholarly Communications and Repositories.

Text mining achieves something which I think is quite unusual. It manages to be both fiendishly technical (operating somewhere between computer science, linguistics and information retrieval) and absolutely fascinating to the non-specialist (in this case, me). When you see what it can do, and get a sense of what the future could have in store for the technology, it gets very interesting and exciting indeed.

The aim of this two-day workshop was to showcase a few of the clever ways that text mining is already being used, and to sketch a possible future in which text mining tools are deployed in all sorts of ways across scholarly communications. The talks were, taken individually, occasionally baffling but in combination gave some fascinating insights into what text mining can do for researchers, librarians and publishers.

There were demonstrations of the work being done at European Bioinformatics Institute/UKPMC and the Royal Society of Chemistry that showed the potential of text mining applications to find useful scientific information hidden or buried in the literature. For example, some of the most useful information for chemists is about what didn’t work – if you can be sure that a given set of reactions won’t work, you can save a lot of time and money by not bothering to repeat them. This information is often very difficult to find in the literature, but by letting a computer sift through the text of many thousands of reports and datasets for you, you can find it with a huge saving in time.

A repository specific set of applications, UKOLN’s FixRep project and Intute Repository Search, demonstrated that text mining is set to change the way we handle our metadata. FixRep scans the full text and uses the information in it to complete metadata fields – an obvious benefit – while Intute Repository search complements metadata by extracting keywords from the full text content of repositories and exposing them for retrieval alongside more formal metadata. If we add our own MERLIN project, which takes the idea of extracting key words from full text and using them to automatically create a subject tree, opening up new relations between items for searchers, then we get a vision of just how sophisticated text mining-based automatic metadata generation, usage and mapping could become, and how powerful the new tools it offers will be.

Examples of text mining applications being used by real-life researchers provided a fascinating insight into the ways in which the technologies can enable researchers to efficiently exploit what us now literature overload. There were demonstrations, from the Institute of Education of how text mining can make systematic reviews quicker and more effective by pre-digesting a huge number of papers and drawing out the relevant ones for reviewers to use, and from Cambridge of how text mining can expand citation analysis by enabling researchers to separate out positive and negative citations.

Tony Hey from Microsoft External Research and Sidi Rafael from Elsevier, gave us some ideas about how the current technological landscape offers the potential for the tools and technologies that we had been shown to explode in a way that will change the landscape of scholarly communication dramatically. Tony Hey emphasised the raw computing power available via cloud computing services, enabling vastly power-hungry calculations and processes to be undertaken muchmore cheaply at a huge data farm. He argued that today’s research has so much data at its disposal that we are entering a new paradigm which is computationally intensive and involves sifting and combining huge data sets as a core activity.

Sidi Rafael emphasised the way in which more and more big companies are opening up their development process to create new products using the ingenuity of their users. Elsevier will be opening up platforms for opportunistic developers to create new tools, a development that will offer rapid evolution of the services available. Taken together, this vision of huge computational power and rapidly evolving services suggests that the power and potential of text mining tools is about to explode in all sorts of ways.

The workshop rounded up with what is usually politely termed a ‘lively discussion’, in which some of the legal issues that remain to be addressed (open access, copyright, re-use and so on) were clearly named as barriers to the effective exploitation of text mining tools. Those issues aside, the tone was overwhelmingly optimistic for the future of text mining as a new and emerging engine for scholarly communication.