My brother Sean is working on post-doctoral research in linguistics, especially the use of language in Shakespeare’s plays. Which may seem like a domain far removed from the interests of the technologists who read these blogs, but stick with me. This connects in unexpected ways to analytics of interest to us techies, and ultimately to a topic of interest to every reasonable person worldwide.
Let me start with Sean’s research. His goal has been to understand the different use of language, for example pronouns, between soliloquies in the comedies, history plays and tragedies. I won’t tax the patience of SemiWiki readers by going into the details – if you want to know more, there’s a link at the end of this blog. His approach is based on something called Corpus Linguistics – analysis of a body of writing to find trends and correlations.
Since Shakespeare’s works, prolific though he was, fit comfortably into one large, small-print volume, analysis of an electronic version can be performed easily with desktop software. Think of a statistical analysis package applied to language rather than numbers, looking at frequencies of word usage, or words used in close proximity. There are multiple software packages (from small and probably mostly academic vendors) for this type of analysis.
Automated analysis of language depends on recognition, and recognition at a basic word level can be very straightforward; even recognizing inflected words as variants of the base word is not complex in English. Going further than word recognition requires tagging the text (“this is the subject in this sentence” for example) or some level of natural language recognition, which gets you into the domain of Google’s SyntaxNet and deep-learning technologies.
Corpus Linguistics methods are not limited to published works. Domains within the Internet are obvious candidates for analysis, where Big Data analytics and deep learning methods can be valuable. But to what purpose? There are perhaps lots of interesting market analyses that could be done in this way, but one much more compelling application is to detect impending terrorist attacks.
Sean’s own department (at Lancaster University in the UK) is active in research in this area, as are a number of other universities. Each group is predominantly looking at social media posts from identified terrorists. The Lancaster group are looking at word “collocation”, measuring the closeness of connection between significant words and the name of a person or place. “Attack” and “crowded” would be an obvious example. This can be used to establish positive or negative associations; increasing frequency of such connections then potentially indicates an upcoming attack.
While approaches like this are clearly not foolproof, they can provide valuable supporting evidence when combined with other indicators. Also for me this general domain illustrates opportunities we often miss in sticking to our own silos of expertise. Technologies that we do understand are often used in domains far from those we might expect. And bigger pictures, combining needs and techniques from widely differing domains, can often suggest solutions that silo experts might miss.
You can learn more about Sean’s research HERE and the work on terrorist post analysis HERE.
Share this post via:
TSMC 16th OIP Ecosystem Forum First Thoughts