An introduction to text mining with AntLab and Voyant Tools

By MmIT Committee Member Antony Groves

Image of Antony Groves
Antony Groves

Increasingly you may hear researchers, librarians and other information professionals talk about “text mining”. Although this is a process aligned with information retrieval, it is not always clear how we can support and engage with these related activities. The following post brings together a number of resources that show the value and benefits of text mining, and introduces two free tools to help you start exploring this growing area of work.

The introduction to the PLOS Text Mining Collection, a useful selection of open access research and essays relating to text mining, explains that:

“The maturing field of Text Mining aims to solve problems concerning the retrieval, extraction and analysis of unstructured information in digital text, and revolutionize how scientists access and interpret data that might otherwise remain buried in the literature”.

An example of this is Yale University’s Robots Reading Vogue project where a huge volume of text and data (over 6 TB) has been analysed to show, amongst other things, how the use of particular words has risen and fallen over the publication’s history (the n-gram Search). At the University of Sussex there are numerous projects coming from the Text Analysis Group and the Sussex Humanities Lab exploring large corpora (collections of written text by particular authors or about particular subjects) through text mining. We have even started to run workshops in the Library introducing tools to help students who are interested in this area of research. I would like to share two of these resources here: AntLab and Voyant Tools (you can find even more in the TAPoR collection).

AntLab contains a number of freely available tools (although donations and patronage are welcome) built by Dr Laurence Anthony, which can be found on the Software section of his website. For the purpose of this post, I would like to highlight AntFileConverter, a tool for converting PDF and Word files into plain text for analysis – something that can also be helpful for improving accessibility. To use AntFileConverter download and open the appropriate software version for your computer, drag the file you wish to convert into the ‘Input Files’ box, and click ‘Start’. For this demonstration I have used the PDF of the first Open Access volume of the MmIT Journal:

image1

As explained in the user support, “the converted files will be saved in the same directory as the original files with the same name but with the “.txt” extension added”. This .txt file can then be used with other AntLab software, although here will be analysed with Voyant Tools, a free “web-based reading and analysis environment for digital texts”. To do this, upload the .txt file created with AntFileConverter into the Voyant Tools box:

Image2

Click on ‘Reveal’ to run the analysis and view the results:

image 3

The default tools include Cirrus, Reader, Trends, Summary and Contexts, which you can learn more about in the Getting Started Guide. There are also a number of additional tools, including the TermsBerry. To use this particular tool, click on TermsBerry next to Reader above the second panel:

image 4

The TermsBerry shows how often particular terms occur and how frequently they appear next to other terms. The TermsBerry I have shared above shows that in Volume 43 of the MmIT Journal, the words ‘library’ and ‘information’ are two of the most common (they are in larger bubbles). If you hover over one of the terms, for example ‘digital’, you will see that this word appears 121 times in the text, most commonly co-occurring with ‘literacy’ (29 times), followed by ‘skills’, ‘media’ and ‘information’; topics that should interest MmIT readers!

To enable this mining and sharing, reforms to Copyright legislation mean that copies of a work can be made for the purposes of text and data analysis (providing you have lawful access to the original work, which in this case is open access). Additionally, as explained in the ‘Sharing outputs’ section of this Jisc guide, the results of the analysis can usually be shared with anyone (although there are exceptions to this when the analysis goes beyond counts and ‘facts’ about the work, and includes large amounts of the original copyright material). So armed with a few tools, and copyright law on our side, it’s time to make text mining yours.

 

 

 

Advertisements

Preparing for #GDPR

The new General Data Protection Regulation (GDPR) legislation, which replaces the Data Protection Act, is due to come into effect in May 2018.  With only a month to go until GDPR is introduced, your employers will almost certainly have taken steps to ensure compliance and (we trust) briefed their employees.  However, if you’re looking to extend your awareness of what is involved, we’ve rounded up some resources that may help:

We need to talk about #Storify…..

It would not be exaggerating to say that there were groans of anguish across the library and information community when Storify (now owned by Adobe) announced that the service will close in May 2018.  While the lengthy notice period was appreciated, with only a month until Storify closes how can we ensure that we preserve existing stories and what can we use as an alternative?

Archiving existing Storify stories

Wakelet very quickly rose to the challenge with a two step process to Import your Stories to Wakelet.  Storify users can create Stories until the end of April 2018, and have until May 16 to move their Stories across.

Alternatives to Storify

Wakelet, obviously.  However, it has taken the recent introduction of the Import from Twitter feature to make it more of a Storify experience: see the brief Twitter Import video.

If you are primarily interested in curating content, there are still many alternative social bookmarking sites that can fill the void e.g. Scoop.It or Pocket.  The excellent C4LPT website has a list of Curation & Social Bookmarking Tools.

 

Why is Storify closing? 

If you are interested in why social media service Storify is coming to an end, it is due to a sequence of acquisitions plus the growth in chronology and curation tools.  In a blog post Ian Milligan reminds us of how vulnerable user-generated content can be online,  and that we need to steward our data responsibly.

 

Discovery AND disorder

185495
Antony Groves

Committee member Antony Groves from The University of Sussex writes about the issue of Discovery and how sometimes a curve ball can be thrown at you when you least expect it.

Discovery is not a straightforward process, if it were some of us would be out of the job. However this should not excuse unpredictable tools and searches; some obstacles are reasonable to expect and some are not. How would a 110m hurdler feel if an extra barrier were added or if the first was moved 10ft forward? The answer is that we’d only know how they felt if we asked them or maybe observed their next race. The focus of this post is not intended to be UX though, but instead teaching, specifically how we talk to our users about fallible discovery services.

The anomaly that has prompted this post is the re-ordering of results when inserting AND between search terms in Ex Libris Primo (as of March 8th this appears to be happening at 15 Russell Group Libraries). This can be tested by typing the search terms academic integrity into your discovery tool, then academic AND integrity, and comparing the two. Although the number of results stays the same, some of you will see that the order of the items changes. Predominantly this appears to be a Primo issue (although is not happening everywhere with Primo) but Summon has its own mysteries. If you compare the above two searches in Summon, at several Russell Group libraries you will get a different number of results (although admittedly only a very slight difference).

The Association of College & Research Libraries (ACRL) Framework for Information Literacy for Higher Education establishes “Searching as Strategic Exploration” as one of its six concepts, furthermore explaining that “searching for information is often nonlinear” (ACRL 2015). However is this intended to excuse tools giving inconsistent results or instead explain that searching is an iterative process, or both? Yes, if we’re teaching our users to search for resources in a strategic and systematic way we should also be showing them the other databases we subscribe to and not solely relying on our discovery tools, but shouldn’t this be providing a solid foundation on which to build? If our discovery services are not as good as they can possibly be, students will very quickly turn to Google instead.

When we have noticed anomalies we have reported them to Ex Libris who have worked to resolve them or provided an answer as to why certain things are happening. The answer to a previous irregularity was that “the results of different searches aren’t necessarily comparable in a linear relation” (Ex Libris Knowledge Center, 2017). Is this a satisfactory response though? Within the Library we continue to user test our discovery tool (as do Ex Libris) and during our next round of testing we may find that students don’t mind these minor aberrations or perhaps are already used to shifting results from using Google. It could be that they haven’t asked, or even noticed, but as information professionals we should be ready to help those looking for the answer. Evidently including/excluding AND between search terms does make a difference, perhaps not to the number of results but certainly to the way they are ordered. I cannot currently explain to users why this is happening or which set of results really is more relevant. What I can do is show them other ways of sorting and narrowing their searches. Like that first 110m hurdle, it is an obstacle that can still be cleared, I just feel I would be a better coach if I could explain why it’s moved 10ft forward.

References

ACRL (2015) Framework for Information Literacy for Higher Education. Available at: http://www.ala.org/acrl/standards/ilframework#exploration (Accessed: 5 March 2017).

 

Ex Libris Knowledge Center (2017) Boolean searches in Primo don’t work as expected. Available at: https://knowledge.exlibrisgroup.com/Primo/Knowledge_Articles/Boolean_searches_in_Primo_doesn’t_work_as_expected (Accessed: 5 March 2017).