An introduction to text mining with AntLab and Voyant Tools

By MmIT Committee Member Antony Groves

Image of Antony Groves
Antony Groves

Increasingly you may hear researchers, librarians and other information professionals talk about “text mining”. Although this is a process aligned with information retrieval, it is not always clear how we can support and engage with these related activities. The following post brings together a number of resources that show the value and benefits of text mining, and introduces two free tools to help you start exploring this growing area of work.

The introduction to the PLOS Text Mining Collection, a useful selection of open access research and essays relating to text mining, explains that:

“The maturing field of Text Mining aims to solve problems concerning the retrieval, extraction and analysis of unstructured information in digital text, and revolutionize how scientists access and interpret data that might otherwise remain buried in the literature”.

An example of this is Yale University’s Robots Reading Vogue project where a huge volume of text and data (over 6 TB) has been analysed to show, amongst other things, how the use of particular words has risen and fallen over the publication’s history (the n-gram Search). At the University of Sussex there are numerous projects coming from the Text Analysis Group and the Sussex Humanities Lab exploring large corpora (collections of written text by particular authors or about particular subjects) through text mining. We have even started to run workshops in the Library introducing tools to help students who are interested in this area of research. I would like to share two of these resources here: AntLab and Voyant Tools (you can find even more in the TAPoR collection).

AntLab contains a number of freely available tools (although donations and patronage are welcome) built by Dr Laurence Anthony, which can be found on the Software section of his website. For the purpose of this post, I would like to highlight AntFileConverter, a tool for converting PDF and Word files into plain text for analysis – something that can also be helpful for improving accessibility. To use AntFileConverter download and open the appropriate software version for your computer, drag the file you wish to convert into the ‘Input Files’ box, and click ‘Start’. For this demonstration I have used the PDF of the first Open Access volume of the MmIT Journal:

image1

As explained in the user support, “the converted files will be saved in the same directory as the original files with the same name but with the “.txt” extension added”. This .txt file can then be used with other AntLab software, although here will be analysed with Voyant Tools, a free “web-based reading and analysis environment for digital texts”. To do this, upload the .txt file created with AntFileConverter into the Voyant Tools box:

Image2

Click on ‘Reveal’ to run the analysis and view the results:

image 3

The default tools include Cirrus, Reader, Trends, Summary and Contexts, which you can learn more about in the Getting Started Guide. There are also a number of additional tools, including the TermsBerry. To use this particular tool, click on TermsBerry next to Reader above the second panel:

image 4

The TermsBerry shows how often particular terms occur and how frequently they appear next to other terms. The TermsBerry I have shared above shows that in Volume 43 of the MmIT Journal, the words ‘library’ and ‘information’ are two of the most common (they are in larger bubbles). If you hover over one of the terms, for example ‘digital’, you will see that this word appears 121 times in the text, most commonly co-occurring with ‘literacy’ (29 times), followed by ‘skills’, ‘media’ and ‘information’; topics that should interest MmIT readers!

To enable this mining and sharing, reforms to Copyright legislation mean that copies of a work can be made for the purposes of text and data analysis (providing you have lawful access to the original work, which in this case is open access). Additionally, as explained in the ‘Sharing outputs’ section of this Jisc guide, the results of the analysis can usually be shared with anyone (although there are exceptions to this when the analysis goes beyond counts and ‘facts’ about the work, and includes large amounts of the original copyright material). So armed with a few tools, and copyright law on our side, it’s time to make text mining yours.

 

 

 

Advertisements

Discovery AND disorder

185495
Antony Groves

Committee member Antony Groves from The University of Sussex writes about the issue of Discovery and how sometimes a curve ball can be thrown at you when you least expect it.

Discovery is not a straightforward process, if it were some of us would be out of the job. However this should not excuse unpredictable tools and searches; some obstacles are reasonable to expect and some are not. How would a 110m hurdler feel if an extra barrier were added or if the first was moved 10ft forward? The answer is that we’d only know how they felt if we asked them or maybe observed their next race. The focus of this post is not intended to be UX though, but instead teaching, specifically how we talk to our users about fallible discovery services.

The anomaly that has prompted this post is the re-ordering of results when inserting AND between search terms in Ex Libris Primo (as of March 8th this appears to be happening at 15 Russell Group Libraries). This can be tested by typing the search terms academic integrity into your discovery tool, then academic AND integrity, and comparing the two. Although the number of results stays the same, some of you will see that the order of the items changes. Predominantly this appears to be a Primo issue (although is not happening everywhere with Primo) but Summon has its own mysteries. If you compare the above two searches in Summon, at several Russell Group libraries you will get a different number of results (although admittedly only a very slight difference).

The Association of College & Research Libraries (ACRL) Framework for Information Literacy for Higher Education establishes “Searching as Strategic Exploration” as one of its six concepts, furthermore explaining that “searching for information is often nonlinear” (ACRL 2015). However is this intended to excuse tools giving inconsistent results or instead explain that searching is an iterative process, or both? Yes, if we’re teaching our users to search for resources in a strategic and systematic way we should also be showing them the other databases we subscribe to and not solely relying on our discovery tools, but shouldn’t this be providing a solid foundation on which to build? If our discovery services are not as good as they can possibly be, students will very quickly turn to Google instead.

When we have noticed anomalies we have reported them to Ex Libris who have worked to resolve them or provided an answer as to why certain things are happening. The answer to a previous irregularity was that “the results of different searches aren’t necessarily comparable in a linear relation” (Ex Libris Knowledge Center, 2017). Is this a satisfactory response though? Within the Library we continue to user test our discovery tool (as do Ex Libris) and during our next round of testing we may find that students don’t mind these minor aberrations or perhaps are already used to shifting results from using Google. It could be that they haven’t asked, or even noticed, but as information professionals we should be ready to help those looking for the answer. Evidently including/excluding AND between search terms does make a difference, perhaps not to the number of results but certainly to the way they are ordered. I cannot currently explain to users why this is happening or which set of results really is more relevant. What I can do is show them other ways of sorting and narrowing their searches. Like that first 110m hurdle, it is an obstacle that can still be cleared, I just feel I would be a better coach if I could explain why it’s moved 10ft forward.

References

ACRL (2015) Framework for Information Literacy for Higher Education. Available at: http://www.ala.org/acrl/standards/ilframework#exploration (Accessed: 5 March 2017).

 

Ex Libris Knowledge Center (2017) Boolean searches in Primo don’t work as expected. Available at: https://knowledge.exlibrisgroup.com/Primo/Knowledge_Articles/Boolean_searches_in_Primo_doesn’t_work_as_expected (Accessed: 5 March 2017).

A note from the chair – Cloudbusting

A note from the chair on ‘Cloud busting: demystifying ‘the Cloud’ and its impact on libraries’

With just under two weeks to go until our national ‘Cloudbusting’ conference it’s safe to say MmIT is getting quite excited. With such a rich programme and so many great speakers it looks set to be a truly great conference. We hope you are able to join us in Sheffield on April 5th as there are still a few places left.

The concept of ‘the Cloud’ has been around for several years. Over that time the term has become ubiquitous with a general acceptance that ‘the Cloud’ has a definite impact on the way in which we use computers and information technology and how individuals interact with information. It is widely regarded that cloud computing can simplify processes for organisations and save them money and as a result many of the benefits associated with the ‘Cloud’ have been around efficiencies and effectiveness of services.

Many services that libraries have traditionally offered have been migrated into cloud solutions. For example the use of OpenURL providers and federated and pre-indexed search engines allowing users to search all of a library’s collections through a single search box. Discovery layers such as Serials Solutions’ Summon, EBSCO’s EDS or Ex Libris’s Primo Central allow access to all of a library’s collections, not simply those found on the library catalogue. Such discovery layers can provide enhanced service such as access to special collections, digital collections and institutional repositories.

Similarly, the ‘Cloud’ allows libraries to share data about their collections and the bibliographic management activities that they are engaged in. This includes licensing data, common vendor files, serials publications patterns and MARC records.

Add to this entire systems hosted in the Cloud, such as the Koha and Ex Libris Alma library management systems, or reference and citation management systems such as Mendeley, then it is simple to see the impact that the Cloud has on libraries, and indeed vice versa.

Even simple initiatives such as collaborative working through Google Docs, enabling a library community through Facebook or storing photographic collections in Flickr are all examples of how the Cloud has become part of the day to day computing and technology activity of the library.

MmIT strives to raise awareness amongst library and information professionals about current trends and topics in library and information technology and ‘Cloud’ initiatives and innovations, and how they are currently being used within the sector will be of interest to many librarians and information professionals who may not even realise the wealth of ‘Cloud’ activities and solutions available to them. The conference includes a series of workshops, each one focusing on a particular ‘Cloud’ initiative. This includes topics such as implementing Opensource library management systems; How libraries can make the most of mobile devices to access cloud resources; Creating media-rich e-book resources; Implications for research data management; Copyright and licensing issues associated with the ‘Cloud’, and much more. The keynote presentation will be from Karen Blakeman and will focus on search and discovery within the ‘Cloud’ and the conference will also include a series of rapid fire technical innovation presentations and a panel question and answer session.

For further information please see the MmIT Events pages:
http://www.cilip.org.uk/get-involved/special-interest-groups/multimedia/events/pages/default.aspx

Search cheatsheets and guides – the 2012 list (so far)

Search tools are constantly changing and there are, let’s face it, a million ways to search for information online. There’s also a healthy debate around library search’s reliance on Boolean operators and other specialist (and often legacy) techniques. We still have a ways to go until we’ve found the perfect balance between simplicity and advanced techniques in web searching (and, incidentally, if you’re interested in this area I recommend Dave Pattern’s posts about University of Huddersfield’s experiences with Summon).

I used to use various search cheatsheets in training but lost track after Google’s umpteenth search update so I was happy to stumble upon a bunch of new guides to search engines. Like all lists on this blog, this is a work in progress and suggestions are welcome.

Daniel Russell is a research scientist at Google and recently gave a talk to a group of investigative journalists about smart Google search techniques John Tedesco, an investigative reporter, has written these up in a handy summary on his blog.

Possibly as a result of the large amount of interest Tedesco’s post generated, Google have now announced a series of online search classes. 

And if Google is not your data-mining bag of chips there’s also these handy guide to Duck Duck Go’s search shortcuts on Ghacks.net

Wolfram Alpha remains a specialist search tool and I haven’t really seen a comprehensive starter guide to it in my travels. The knowledge base has a *lot* of helpful examples to refer to though. In my experience, the dictionary search results are far superior to the ones returned on Google and I’m sure there are plenty of other reasons to use it for non-statistical searches so I’m on the lookout for an introductory guide to add to the list.

www.wolframalpha.com/examples

Like I mentioned above, this is a list in progress so any new guides discovered will be added (there are plenty of search tools not yet covered). If you’ve found any guides that you’d like to share, feel free to add them in the comments.

SEO, Google and search algorithms : a quick look under the hood

There are reams (or the digital equivalent) of advice about Search Engine Optimisation (SEO) available online, but a lot of it relies on popular SEO myths and ill-advised attempts to game the system in order to boost search engine rankings. But for those of us who are simply interested in improving the discoverability of websites, it’s harder to find straightforward advice without all the bogus tips and SEO myth pepetuation.

If you are delivering services online, it is helpful to keep up to date with how search engines index and present search results. The biggest factor that no amount of trickery can avoid, is that content is king. Providing regular new content with descriptive titles is the simplest and best way to improve your search engine ranking. Another step in the right direction is to make sure that you use clean and descriptive URLs rather than the non-descriptive dynamic URLs produced by some Content Management Systems.

Link referrals is an important but contentious area, if only because it’s open to abuse. While manipulating this by creating link farms or other dubious means will rightly get your hand slapped by the search engine, there is undeniably value in participating in the ecosystem of the web by having people link to your site.

Google occasionally rolls out updates to its search algorithm with vague names such as Caffeine (which introduced real-time search) and Panda. The Panda update, in 2011, was aimed at reducing the rankings of link and content farming sites. And apparently there are still more changes afoot.

To find out what’s happening under the hood of the Google search engine, the best place to start is right at the source with the Google Technology overview  and the Webmaster Guidelines (taking these with the required grains of salt, of course).

This post has focused on Google search, but if you’re more generally interested in the technology behind search engines, have a look at the tech running other, open search platforms such as  Duck Duck Go or YaCy. Or you can go back and have a look at where it all began.

Further reading:

The story so far: Google+

June 28th 2011 was a big day in search engine & social media land, seeing the  launch of Google+ (pronounced ‘Googleplus’ or ‘Googleplussed’?). Well, ‘launch’ is perhaps the wrong word, with only a small amount of early testers having access to it; the general launch date apparently “won’t  be long” (https://plus.google.com/). Essentially, Google+ may be seen the introduction of social networking elements with the ubiquitous Google search interface…why use a search engine and a social networking site when you can do both at the same time? Google+ allows users to log into the Google environment and personalise it as usual, with the addition of a live and customisable newsfeed stream called ‘Sparks’ and a way of putting contacts into groups for social networking known as ‘Circles’. ‘Hangouts’ allow a small group of contacts (10) to link up for a webcast session, and a ‘Mobile’ element most notably allows group instant messaging chats. For a fuller description of features, check out the official Google blog at http://googleblog.blogspot.com/2011/06/introducing-google-project-real-life.html.

Overall, the jury is currently split. Clearly Google is trying to take on Facebook with this venture, with the aim of drawing all users into one information finding & sharing tool. This is not lost on a great many commentators (cf. xkcd’s rather amusing strip), and it’s true that most people are focusing on the looming assault on Facebook. With high-profile failures in the form of Google Buzz and Google Wave, Google really need to do well with this product, though the project is not an off-the-cuff venture and has been in development for some time (cf. this very positive review from Wired). But it seems to be trying to do more…certainly one can see the appeal of having a tool which makes it easy to search and share, and addition of web-conferencing and mobile tools is a powerful incentive to try it. There are downsides with the current version (read Phil Bradley’s blog posting, which highlights the confusion about the ‘+1’ function for web-links which doesn’t seem to act like a ‘like’ button on Facebook), but it’s too early to really tell what will happen. Perhaps the big question many will keep asking is ‘Would it replace Facebook?’, though (speaking personally) this author of this posting would be tempted to try it in a workplace setting before deciding whether or not to shift lock/stock to Google+. Certainly this has the potential to be far more than ‘just  another social networking tool’.

Will you be planning to use Google+? Join the debate below!

 

Duck Duck Go teams up with WolframAlpha

Everyone’s favourite underdog search engine DuckDuckGo has officially teamed up with Wolfram|Alpha , they just announced. DuckDuckGo already utilises the Wolfram|Alpha API, but this will mean further integration and  other neat developments in the near future.