An introduction to text mining with AntLab and Voyant Tools

By MmIT Committee Member Antony Groves

Image of Antony Groves
Antony Groves

Increasingly you may hear researchers, librarians and other information professionals talk about “text mining”. Although this is a process aligned with information retrieval, it is not always clear how we can support and engage with these related activities. The following post brings together a number of resources that show the value and benefits of text mining, and introduces two free tools to help you start exploring this growing area of work.

The introduction to the PLOS Text Mining Collection, a useful selection of open access research and essays relating to text mining, explains that:

“The maturing field of Text Mining aims to solve problems concerning the retrieval, extraction and analysis of unstructured information in digital text, and revolutionize how scientists access and interpret data that might otherwise remain buried in the literature”.

An example of this is Yale University’s Robots Reading Vogue project where a huge volume of text and data (over 6 TB) has been analysed to show, amongst other things, how the use of particular words has risen and fallen over the publication’s history (the n-gram Search). At the University of Sussex there are numerous projects coming from the Text Analysis Group and the Sussex Humanities Lab exploring large corpora (collections of written text by particular authors or about particular subjects) through text mining. We have even started to run workshops in the Library introducing tools to help students who are interested in this area of research. I would like to share two of these resources here: AntLab and Voyant Tools (you can find even more in the TAPoR collection).

AntLab contains a number of freely available tools (although donations and patronage are welcome) built by Dr Laurence Anthony, which can be found on the Software section of his website. For the purpose of this post, I would like to highlight AntFileConverter, a tool for converting PDF and Word files into plain text for analysis – something that can also be helpful for improving accessibility. To use AntFileConverter download and open the appropriate software version for your computer, drag the file you wish to convert into the ‘Input Files’ box, and click ‘Start’. For this demonstration I have used the PDF of the first Open Access volume of the MmIT Journal:

image1

As explained in the user support, “the converted files will be saved in the same directory as the original files with the same name but with the “.txt” extension added”. This .txt file can then be used with other AntLab software, although here will be analysed with Voyant Tools, a free “web-based reading and analysis environment for digital texts”. To do this, upload the .txt file created with AntFileConverter into the Voyant Tools box:

Image2

Click on ‘Reveal’ to run the analysis and view the results:

image 3

The default tools include Cirrus, Reader, Trends, Summary and Contexts, which you can learn more about in the Getting Started Guide. There are also a number of additional tools, including the TermsBerry. To use this particular tool, click on TermsBerry next to Reader above the second panel:

image 4

The TermsBerry shows how often particular terms occur and how frequently they appear next to other terms. The TermsBerry I have shared above shows that in Volume 43 of the MmIT Journal, the words ‘library’ and ‘information’ are two of the most common (they are in larger bubbles). If you hover over one of the terms, for example ‘digital’, you will see that this word appears 121 times in the text, most commonly co-occurring with ‘literacy’ (29 times), followed by ‘skills’, ‘media’ and ‘information’; topics that should interest MmIT readers!

To enable this mining and sharing, reforms to Copyright legislation mean that copies of a work can be made for the purposes of text and data analysis (providing you have lawful access to the original work, which in this case is open access). Additionally, as explained in the ‘Sharing outputs’ section of this Jisc guide, the results of the analysis can usually be shared with anyone (although there are exceptions to this when the analysis goes beyond counts and ‘facts’ about the work, and includes large amounts of the original copyright material). So armed with a few tools, and copyright law on our side, it’s time to make text mining yours.

 

 

 

Advertisements

Discovery AND disorder

185495
Antony Groves

Committee member Antony Groves from The University of Sussex writes about the issue of Discovery and how sometimes a curve ball can be thrown at you when you least expect it.

Discovery is not a straightforward process, if it were some of us would be out of the job. However this should not excuse unpredictable tools and searches; some obstacles are reasonable to expect and some are not. How would a 110m hurdler feel if an extra barrier were added or if the first was moved 10ft forward? The answer is that we’d only know how they felt if we asked them or maybe observed their next race. The focus of this post is not intended to be UX though, but instead teaching, specifically how we talk to our users about fallible discovery services.

The anomaly that has prompted this post is the re-ordering of results when inserting AND between search terms in Ex Libris Primo (as of March 8th this appears to be happening at 15 Russell Group Libraries). This can be tested by typing the search terms academic integrity into your discovery tool, then academic AND integrity, and comparing the two. Although the number of results stays the same, some of you will see that the order of the items changes. Predominantly this appears to be a Primo issue (although is not happening everywhere with Primo) but Summon has its own mysteries. If you compare the above two searches in Summon, at several Russell Group libraries you will get a different number of results (although admittedly only a very slight difference).

The Association of College & Research Libraries (ACRL) Framework for Information Literacy for Higher Education establishes “Searching as Strategic Exploration” as one of its six concepts, furthermore explaining that “searching for information is often nonlinear” (ACRL 2015). However is this intended to excuse tools giving inconsistent results or instead explain that searching is an iterative process, or both? Yes, if we’re teaching our users to search for resources in a strategic and systematic way we should also be showing them the other databases we subscribe to and not solely relying on our discovery tools, but shouldn’t this be providing a solid foundation on which to build? If our discovery services are not as good as they can possibly be, students will very quickly turn to Google instead.

When we have noticed anomalies we have reported them to Ex Libris who have worked to resolve them or provided an answer as to why certain things are happening. The answer to a previous irregularity was that “the results of different searches aren’t necessarily comparable in a linear relation” (Ex Libris Knowledge Center, 2017). Is this a satisfactory response though? Within the Library we continue to user test our discovery tool (as do Ex Libris) and during our next round of testing we may find that students don’t mind these minor aberrations or perhaps are already used to shifting results from using Google. It could be that they haven’t asked, or even noticed, but as information professionals we should be ready to help those looking for the answer. Evidently including/excluding AND between search terms does make a difference, perhaps not to the number of results but certainly to the way they are ordered. I cannot currently explain to users why this is happening or which set of results really is more relevant. What I can do is show them other ways of sorting and narrowing their searches. Like that first 110m hurdle, it is an obstacle that can still be cleared, I just feel I would be a better coach if I could explain why it’s moved 10ft forward.

References

ACRL (2015) Framework for Information Literacy for Higher Education. Available at: http://www.ala.org/acrl/standards/ilframework#exploration (Accessed: 5 March 2017).

 

Ex Libris Knowledge Center (2017) Boolean searches in Primo don’t work as expected. Available at: https://knowledge.exlibrisgroup.com/Primo/Knowledge_Articles/Boolean_searches_in_Primo_doesn’t_work_as_expected (Accessed: 5 March 2017).

Search cheatsheets and guides – the 2012 list (so far)

Search tools are constantly changing and there are, let’s face it, a million ways to search for information online. There’s also a healthy debate around library search’s reliance on Boolean operators and other specialist (and often legacy) techniques. We still have a ways to go until we’ve found the perfect balance between simplicity and advanced techniques in web searching (and, incidentally, if you’re interested in this area I recommend Dave Pattern’s posts about University of Huddersfield’s experiences with Summon).

I used to use various search cheatsheets in training but lost track after Google’s umpteenth search update so I was happy to stumble upon a bunch of new guides to search engines. Like all lists on this blog, this is a work in progress and suggestions are welcome.

Daniel Russell is a research scientist at Google and recently gave a talk to a group of investigative journalists about smart Google search techniques John Tedesco, an investigative reporter, has written these up in a handy summary on his blog.

Possibly as a result of the large amount of interest Tedesco’s post generated, Google have now announced a series of online search classes. 

And if Google is not your data-mining bag of chips there’s also these handy guide to Duck Duck Go’s search shortcuts on Ghacks.net

Wolfram Alpha remains a specialist search tool and I haven’t really seen a comprehensive starter guide to it in my travels. The knowledge base has a *lot* of helpful examples to refer to though. In my experience, the dictionary search results are far superior to the ones returned on Google and I’m sure there are plenty of other reasons to use it for non-statistical searches so I’m on the lookout for an introductory guide to add to the list.

www.wolframalpha.com/examples

Like I mentioned above, this is a list in progress so any new guides discovered will be added (there are plenty of search tools not yet covered). If you’ve found any guides that you’d like to share, feel free to add them in the comments.

SEO, Google and search algorithms : a quick look under the hood

There are reams (or the digital equivalent) of advice about Search Engine Optimisation (SEO) available online, but a lot of it relies on popular SEO myths and ill-advised attempts to game the system in order to boost search engine rankings. But for those of us who are simply interested in improving the discoverability of websites, it’s harder to find straightforward advice without all the bogus tips and SEO myth pepetuation.

If you are delivering services online, it is helpful to keep up to date with how search engines index and present search results. The biggest factor that no amount of trickery can avoid, is that content is king. Providing regular new content with descriptive titles is the simplest and best way to improve your search engine ranking. Another step in the right direction is to make sure that you use clean and descriptive URLs rather than the non-descriptive dynamic URLs produced by some Content Management Systems.

Link referrals is an important but contentious area, if only because it’s open to abuse. While manipulating this by creating link farms or other dubious means will rightly get your hand slapped by the search engine, there is undeniably value in participating in the ecosystem of the web by having people link to your site.

Google occasionally rolls out updates to its search algorithm with vague names such as Caffeine (which introduced real-time search) and Panda. The Panda update, in 2011, was aimed at reducing the rankings of link and content farming sites. And apparently there are still more changes afoot.

To find out what’s happening under the hood of the Google search engine, the best place to start is right at the source with the Google Technology overview  and the Webmaster Guidelines (taking these with the required grains of salt, of course).

This post has focused on Google search, but if you’re more generally interested in the technology behind search engines, have a look at the tech running other, open search platforms such as  Duck Duck Go or YaCy. Or you can go back and have a look at where it all began.

Further reading:

Duck Duck Go teams up with WolframAlpha

Everyone’s favourite underdog search engine DuckDuckGo has officially teamed up with Wolfram|Alpha , they just announced. DuckDuckGo already utilises the Wolfram|Alpha API, but this will mean further integration and  other neat developments in the near future.

Checking your social media ranking

There are plenty of reasons to keep an eye on social media rankings,  from finding out what’s being said about your organisation (or anything else for that matter) to measuring the impact of a particular promotional campaign.

Menae is a new tool that let’s you check your website’s ranking across a number of avenues. It  gives you a search engine score, social media score, traffic score, social bookmarking score and blog score. Fun to play with but it could sure use an ‘About’ page and it’s a new entry into a pretty crowded field.  SocialScan offers something similar by checking a URL against the main social sites, including Delicious, StumbleUpon, Digg and Twitter.

For a more general overview based on keywords, username or trends,  Social Mention is hard to beat. You can set up an alert to receive regular updates.

And if you’re just looking at your Twitter usage, the Twitter Reality Check is a handy tool and TweetStats generates some great graphs (magic happening!).

And for making the case, the Search Engine Journal explains why social networking is important for SEO.

Social search with Aardvark (Q&A series)

Choosing topics

Aardvark describes itself as ‘social search’; whereby questions are answered by a person (or people) rather than a webpage. More specifically, they propose that seeking answers from people in your ‘networks’ can be more effective and provide better information than reliance on answers from the  ‘documents’ that search engines return.

While the platform doesn’t seem to be exactly heaving with activity since they were acquired by Google early last year, it has its loyal fans and questions get answered in good time. However, the real strength of Aardvark is in the instant messaging and mobile access options in what is just generally an impressive but neglected platform.

Aardvark is a more flexible Q&A service than some of the other main players. While the website vark.com is a good starting point, it soon becomes clear that the main emphasis of Aardvark is on mobile access; (via iPhone app) and integration with instant messaging.

Continue reading “Social search with Aardvark (Q&A series)”