CS 410 Text Information Systems


Goals and Objectives

  • Explain the basic ideas of Logistic Regression, K-Nearest Neighbors (k-NN), and how K-NN works.
  • Explain how to evaluate categorization results.
  • Explain the tasks of opinion mining and sentiment analysis and why they are important tasks from an application perspective.
  • Explain how sentiment analysis can be done using text categorization techniques and why a straightforward application of regular text categorization techniques may not be adequate.
  • Give examples of both simple and complex features that are used for characterizing text data and explain how NLP can enable complex features to be generated from text.

Guiding Questions

  • What’s the general idea of the logistic regression classifier? How is it related to Naïve Bayes? Under what conditions would logistic regression cover Naïve Bayes as a special case for two-category categorization?
  • What’s the general idea of the k-Nearest Neighbor classifier? How does it work?
  • How do we evaluate categorization results?
  • How do we compute classification accuracy, precision, recall, and F score?
  • Why is harmonic mean as used in F better than the arithmetic mean of precision and recall?
  • What’s the difference between macro and micro averaging?
  • Why is it sometimes interesting to frame a categorization problem as a ranking problem?
  • What is an opinion? How is it different from a factual statement?
  • What’s an opinion holder? What’s an opinion target?
  • What’s the goal of opinion mining?
  • What is sentiment analysis? How is it similar to and different from a text categorization task such as topic categorization?
  • Why are unigram features generally insufficient for accurate sentiment classification?
  • What’s the concern of using too many complex features such as frequent substructures of parse trees?
  • What are some commonly used features to represent text data?

Additional Readings and Resources

  • C. Zhai and S. Massung, Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining. ACM and Morgan & Claypool Publishers, 2016. Chapters 15 & 18.
  • Yang, Yiming. An Evaluation of Statistical Approaches to Text Categorization. Inf. Retr. 1, 1-2 (May 1999), 69-90. doi: 10.1023/A:1009982220290
  • Bing Liu, Sentiment analysis and opinion mining. Morgan & Claypool Publishers, 2012.
  • Bo Pang and Lillian Lee, Opinion mining and sentiment analysis, Foundations and Trends in Information Retrieval 2(1-2), pp. 1–135, 2008.

Key Phrases and Concepts

  • Generative classifier vs. discriminative classifier
  • Training data
  • Logistic regression
  • K-Nearest Neighbor classifier
  • Classification accuracy, precision, recall, F measure, macro-averaging, and micro-averaging
  • Opinion holder, opinion target, sentiment, and opinion representation
  • Sentiment classification
  • Features, n-grams, frequent patterns, and overfitting

Video Lecture Notes

11-1 Text Categorization

11-1-1 Discriminative Classifier Part 1

img
img
img
img
img
img
img
img

11-1-2 Discriminative Classifier Part 2

img
img
img
img
img
img
img
img
img
img
img
img
img

11-1-3 Evaluation Part 1

img
img
img
img
img
img
img
img
img
img
img

11-1-4 Evaluation Part 2

img
img
img
img
img
img

11-2 Opinion Mining and Sentiment Analysis

11-2-1 Motivation

img
img
img
img
img
img
img
img
img
img
img
img

11-2-2 Sentiment Classification

img
img
img
img
img
img
img
img
img

11-2-3 Ordinal Logistic Regression

img
img
img
img
img
img
img

CS 425 Distributed Systems


Goals

  • Apply classical scheduling algorithms.
  • Apply popular Hadoop scheduling algorithms.
  • Know the internals of Apache Storm, a stream processing engine.
  • Know how enormous graphs can be processed in clouds.

Key Concepts

  • Classical Scheduling algorithms, including FIFO, Shortest Task First, and Round Robin
  • Popular Hadoop schedulers including Capacity Scheduler and Fair Scheduler
  • Internals of Apache Storm, a stream processing engine
  • Internals of distributed graph processing engines, e.g., Google’s Pregel

Guiding Questions

  • Why is shortest-task-first optimal?
  • What is the difference between the Capacity and Fair schedulers in Hadoop?
  • What is a Storm topology?
  • What is Gather-Apply-Scatter paradigm in distributed graph processing?

Readings and Resources

Apache Storm
Optional: Spark Streaming
Pregel

Video Lecture Notes

Stream Processing

img
img
img
img
img
img
img
img
img
img
img
img
img
img
img
img
img

Distributed Graph Processing

img
img
img
img
img
img
img
img
img
img
img
img
img
img
img

Structure of Networks

img
img
img
img
img
img
img
img
img
img
img
img
img
img

Scheduling

Single-processor Scheduling

img
img
img
img
img
img
img
img
img
img

Hadoop Scheduling

img
img
img
img
img
img
img
img
img
img
img

Dominant-resource Fair Scheduling

img
img
img
img
img
img
img
img
img
img

CS 427 Software Engineering


Goals and Objectives

Video Lecture Notes