CS 410 Text Information Systems

Goals and Objectives

Explain the basic ideas of Logistic Regression, K-Nearest Neighbors (k-NN), and how K-NN works.
Explain how to evaluate categorization results.
Explain the tasks of opinion mining and sentiment analysis and why they are important tasks from an application perspective.
Explain how sentiment analysis can be done using text categorization techniques and why a straightforward application of regular text categorization techniques may not be adequate.
Give examples of both simple and complex features that are used for characterizing text data and explain how NLP can enable complex features to be generated from text.

What’s the general idea of the logistic regression classifier? How is it related to Naïve Bayes? Under what conditions would logistic regression cover Naïve Bayes as a special case for two-category categorization?
What’s the general idea of the k-Nearest Neighbor classifier? How does it work?
How do we evaluate categorization results?
How do we compute classification accuracy, precision, recall, and F score?
Why is harmonic mean as used in F better than the arithmetic mean of precision and recall?
What’s the difference between macro and micro averaging?
Why is it sometimes interesting to frame a categorization problem as a ranking problem?
What is an opinion? How is it different from a factual statement?
What’s an opinion holder? What’s an opinion target?
What’s the goal of opinion mining?
What is sentiment analysis? How is it similar to and different from a text categorization task such as topic categorization?
Why are unigram features generally insufficient for accurate sentiment classification?
What’s the concern of using too many complex features such as frequent substructures of parse trees?
What are some commonly used features to represent text data?

C. Zhai and S. Massung, Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining. ACM and Morgan & Claypool Publishers, 2016. Chapters 15 & 18.
Yang, Yiming. An Evaluation of Statistical Approaches to Text Categorization. Inf. Retr. 1, 1-2 (May 1999), 69-90. doi: 10.1023/A:1009982220290
Bing Liu, Sentiment analysis and opinion mining. Morgan & Claypool Publishers, 2012.
Bo Pang and Lillian Lee, Opinion mining and sentiment analysis, Foundations and Trends in Information Retrieval 2(1-2), pp. 1–135, 2008.

Generative classifier vs. discriminative classifier
Training data
Logistic regression
K-Nearest Neighbor classifier
Classification accuracy, precision, recall, F measure, macro-averaging, and micro-averaging
Opinion holder, opinion target, sentiment, and opinion representation
Sentiment classification
Features, n-grams, frequent patterns, and overfitting

Classical Scheduling algorithms, including FIFO, Shortest Task First, and Round Robin
Popular Hadoop schedulers including Capacity Scheduler and Fair Scheduler
Internals of Apache Storm, a stream processing engine
Internals of distributed graph processing engines, e.g., Google’s Pregel