Posted April 16, 2019 by Philip Burns
This is the second article of four about Natural Language Processing. The first article introduced basic concepts and fundamental methods.
In the previous article in this series, we listed six basic natural language processing methods upon which we can build more advanced techniques.
- Language detection
- Tokenization
- Sentence splitting
- Part of speech tagging
- Lemmatization
- Parsing
What can we do once we add basic structure to a text using these methods?
- Keyword and named entity extraction identifies persons, places, organizations, dates, times, and important phrases in a text. The discussion analytics tool developed at Northwestern incorporates both entity and keyword extraction.
- Automatic translation allows a computer to quickly translate a complex piece of text from one language into another. Because different languages are highly nuanced and idiosyncratic, this is an area where machine learning techniques are extremely useful. This is the technology that allows Google to translate pages from French or Japanese into English. By looking at the way language is actually used across millions of web pages, the computer is able to offer more accurate and idiomatic translations than a dictionary-based process alone.
- Automatic summarization creates a short summary of a longer piece of text that captures the most relevant information. Think of the abstracts or executive summaries found at the beginnings of research papers and longer reports. A summarizer extracts key sentences and combines them into a concise paragraph, or generates an original summary from keywords and phrases. At Northwestern, this has been used to summarize forum discussions in a class as well as free-form answers in surveys.
- Geographic location assignment determines the location of place names (gathered using named entity extraction) in terms of latitude and longitude. At Northwestern this has been used to generate Google Maps of the locations of participants in online courses and MOOCs.
- Natural Language Generation combines data analysis and text generation to take data and convert it into prose. For example, this is often used to take weather data from various instruments and prepare a weather report. At Northwestern this could be used to generate articles about “Northwestern in the News” by scanning current news-wire articles for references to people and places at Northwestern. Northwestern computer science professors Larry Birnbaum and Kris Hammond founded a company called Narrative Science to investigate commercial applications of this technology.
- Readability analysis provides measures of the education level required to comprehend a text. This allows you to target an article to the expected readership. At Northwestern this has been used to measure the increase in use of sophisticated vocabulary in course peer reviews. Some of these measures are provided by Northwestern’s discussion analytics tool for Canvas hosted courses.
- Speech processing allows virtual assistants to translate verbal commands into discrete actions for the computer to perform. This technology allows Amazon Echo to translate your request for the current weather, or Siri to turn your question about local hot spots into a Yelp search for dinner recommendations. This technology can also be used to create “chatbots” that provide a virtual help system. Northwestern’s Canvas Chatbot exemplifies the back-end for such a system.
- Topic extraction and segmentation divides text into meaningful units. This can be used to extract topic hierarchies from free-text survey responses, a set of blog postings, forum discussion messages, and research articles.
- Sentiment and mood analysis assigns numbers to emotions expressed in text. Marketers use sentiment analysis to inform brand strategies. Customer service and product departments use sentiment analysis to identify bugs, product enhancements, and possible new features. At Northwestern, sentiment analysis is used to gauge the reaction of respondents to various entities and topics in free-form text responses to survey questions. Sentiment analysis also features in Northwestern’s discussion analytics tool for Canvas courses.
- Relation extraction determines how two or more entities are related and how they affect one another. For example, from a phase such as “Philip bought stock in Apple”, we can use relation extraction to determine that the person Philip and the company Apple are related through the action of buying stock.
- Coreference resolution locates mentions of the same entity in a text, even when the names are expressed differently. For example, in the text: “Philip Burns is a developer in Research Computing Services. He has worked with natural language processing technology extensively in his career. Phil also has experience with statistical analysis.”
- “Philip Burns”, “He”, and “Phil” all refer to the same person. These connections are necessary when applying sentiment and mood analysis so that the sentiment values can be attributed to an entity regardless of which form the entity takes.
- Coreference resolution is also important in determining which characters interact in literary works.
The third article in this series will offer more examples of how natural language processing has been used at Northwestern University.