CS 410 Text Information Systems

Goals and Objectives

Explain some basic concepts in natural language processing.
Explain different ways to represent text data.
Explain the two basic types of word associations and how to mine paradigmatic relations from text data.

Guiding Questions

What does a computer have to do in order to understand a natural language sentence?
What is ambiguity?
Why is natural language processing (NLP) difficult for computers?
What is bag-of-words representation?
Why is this word-based representation more robust than representations derived from syntactic and semantic analysis of text?
What is a paradigmatic relation?
What is a syntagmatic relation?
What is the general idea for discovering paradigmatic relations from text?
What is the general idea for discovering syntagmatic relations from text?
Why do we want to do Term Frequency Transformation when computing similarity of context?
How does BM25 Term Frequency transformation work?
Why do we want to do Inverse Document Frequency (IDF) weighting when computing similarity of context?

Additional Readings and Resources

C. Zhai and S. Massung, Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining. ACM and Morgan & Claypool Publishers, 2016. Chapters 1-4, Chapter 13.
Chris Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing. MIT Press. Cambridge, MA: May 1999. Chapter 5 on collocations.
Chengxiang Zhai, Exploiting context to identify lexical atoms: A statistical view of linguistic context. Proceedings of the International and Interdisciplinary Conference on Modelling and Using Context (CONTEXT-97), Rio de Janeiro, Brazil, Feb. 4-6, 1997, pp. 119-129.
Shan Jiang and ChengXiang Zhai, Random walks on adjacency graphs for mining lexical relations from big text data. Proceedings of IEEE BigData Conference 2014, pp. 549-554.

Key Phrases and Concepts

Part of speech tagging
Syntactic analysis
Semantic analysis
Ambiguity
Text representation, especially bag-of-words representation
Context of a word; context similarity
Paradigmatic relation
Syntagmatic relation