“Text mining” refers to software that can find patterns in text and extract meaning from them. It offers plenty of useful applications. For instance, law departments can benefit from text mining any time they collect a fair number of comments from surveys. I have used it myself for client projects.
Here’s how it might be useful for you. Let’s assume that a large number of your internal clients completed a satisfaction survey. One question was open-ended: “Overall, what would you like to say to the law department?” As a complement to coding those remarks by hand, text-mining software can spot commonly used words, classify the comment as favorable or unfavorable and even tease out thematic topics.
Here’s how it works. You collect all of the comments and the commenters’ job levels, which constitutes your text-mining corpus (body of texts). The text needs to be in one or more files with the suffix “.txt” or without any suffix (just a plain-vanilla file with only ASCII characters).
If you use open-source R and its powerful tidytext package, it will extract each word in the comments into what’s called a data frame. Think of the data frame as an alphabetical listing of each word in each comment in its own row of a spreadsheet, where a new column shows the frequency of the word in the comment (so the number of rows equals the number of words in all of the comments). Another column shows which survey response the term comes from, and yet another column might hold the job level of the commenter (VP, manager, director, etc.). While quickly extracting and rearranging all of the words, the software converts every word to lowercase and removes all punctuation. If you want, the software can stem or lemmatize the words – shorten them to their roots – but we will pass over that step.
The data frame allows many kinds of manipulations and statistical analyses, which is fundamentally what text mining does. But first you have to make some decisions.
1. What stop-words should you remove?
Once you have prepared the text, a fundamental decision awaits you. Many words add no informational value, such as “the” and “a.” So one of the decisions in text mining is which trivial words, known as stop-words, to take out. People have compiled dictionaries of stop-words. In one of my projects for a client, I combined three of those dictionaries and kept the unique words in them as my consolidated stop-word list.
Even a large list, however, will leave words that from your background knowledge you know are unimportant. You can add those stop-words to your list and clean up your text so that what’s left are words that represent the more important terms in the comments. Fortunately, when you work with similar texts later, you can reuse your enhanced stop-words list and proceed more quickly.
2. What paired terms or synonyms should you identify?
English is splattered with combinations of words that have special meaning together: “general counsel,” “external counsel,” and so forth. Linguistic analysts call these bigrams (there are also trigrams, such as “year over year”). You can choose and mark bigrams to improve your insights.
The software can pick out every one, count their frequency, and thereby tell you which ones are used most often. You can then store a list of them and keep the bigrams together rather than treat them as independent words.
A related step helps the text mining process improve. You can decide to tell the software to treat words with similar meanings as synonyms; for instance, “GC” equals “general counsel” or “litigation” equals “lawsuit.” To do this well, it helps to have domain-specific knowledge, because there seems to be no thesaurus resource for text mining.
In combination, stripping out stop-words and treating some terms as merged or synonymous will help you fathom your client’s attitudes. Since the software can count them and list them in descending frequency, you can quickly grasp the most important words in the comments. You can also create word clouds to highlight the most frequent words in font sizes that are scaled to their frequency.
3. How should you classify words grammatically and use that information?
Another set of decisions concerns parts of speech. Stated broadly, the nouns and verbs in your text typically constitute the core terms. The adjectives and adverbs provide nuance and coloring to them. Therefore, as part of text mining, you may decide to tag each word as a part of speech so that you can do analysis at that level. You might want to concentrate on just the nouns or just the verbs. (This step adds another column in the virtual spreadsheet.)
Tagging sounds simple, but in fact it is hard to unambiguously and with certainty pin down quite a few English words. If you are a computer and can’t understand context, is the word “contract” a noun, verb or an adjective?
As with stop-words, people have compiled massive lists of words and identified them by their parts of speech. Those lists, however, have holes. You can review the comments and decide the most appropriate classification and thereby get closer and closer to a useful classification. But, in the end, you will realize that some words can fall into more than one category.
You could analyze the survey comments for just the verbs. As a group, they may give you a good sense of what clients want the department to do, and how strongly they advocate those actions (how many times are “report” or “prioritize” or “standardize” used?).
4. Should you rate how positive the text is?
For years, marketers have looked at customers’ comments and classified them as positive, negative or neutral. This is sentiment analysis. The words in the comments should reflect some degree of appreciation (positive) for the department or a leave-us-alone view (negative).
You can revise the sentiment analysis database or add to it. Again, the more someone knows about the domain, the more accurately texts can be characterized as positive or negative. Does “timely” imply a good thing or a bad thing? The software does the best that it can, but our slippery, polysemic language leaves ample room for improvement. But you, the human, are likely to know the answer to that question.
A sentiment analysis should give a law department at least a clue as to the overall attitude of survey respondents. If tracked over time, the department can also identify trends.
Text mining should not stop with tables of output, or listings of words and topics. We mentioned word clouds, but many additional capabilities are available beyond the scope of this introductory article. Here is a sampling:
Plots and graphs. You also should produce plots that display the findings. Plots make visual any oddities or strong patterns. They also communicate findings effectively. A scatterplot of the 30 most frequent verbs, with one axis showing levels of respondents and the other axis showing the number of times the verb was used, can be eye-catching and insightful.
Correlations of words. Text mining software can tell you the associations of words with other words: that is, how likely one word is to appear with another. For example, “delay” might be highly correlated with “slow.”
Topic modeling. Text mining allows a latent Dirichlet allocation (LDA) analysis of topic modeling. This powerful tool extracts groups of words that the software has identified as constituting a topic. It could be very useful to learn that one topic has to do with international work, if you wouldn’t have picked up on that theme just from reading the comments.
Social network graphs. These graphs show linkages between words and make the links thicker or thinner depending on the frequency of the linkage. You might see in one of these a word such as “risk” and several terms arrayed around it, such as “cautious” or “averse.”
Text mining has many applications for law departments, but you need to be aware of the choices available. Curating and organizing text several times over the course of a project can multiply the benefits. With each of the four decisions above, resources are at hand. You can readily craft each of them to the terminology of your particular text and the needs of your particular analysis.