CS 410 Text Information Systems

Goals and Objectives

What is clustering? What are some applications of clustering in text mining and analysis?
How does hierarchical agglomerative clustering work? How do single-link, complete-link, and average-link work for computing group similarity? Which of these three ways of computing group similarity is least sensitive to outliers in the data?
How do we evaluate clustering results?
What is text categorization? What are some applications of text categorization?
What does the training data for categorization look like?
How does the Naïve Bayes classifier work?
Why do we often use logarithm in the scoring function for Naïve Bayes?

C. Zhai and S. Massung, Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining. ACM and Morgan & Claypool Publishers, 2016. Chapters 14 & 15.
Manning, Chris D., Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge: Cambridge University Press, 2007. Chapters 13-16.
Yang, Yiming. An Evaluation of Statistical Approaches to Text Categorization. Inf. Retr. 1, 1-2 (May 1999), 69-90. doi: 10.1023/A:1009982220290

Clustering, document clustering, and term clustering
Clustering bias
Perspective of similarity
Hierarchical Agglomerative Clustering, and k-Means
Direction evaluation (of clustering), indirect evaluation (of clustering)
Text categorization, topic categorization, sentiment categorization, email routing
Spam filtering
Naïve Bayes classifier
Smoothing

differences between mixture and topic model:
- choice of using a distribution is made once in mixture model , but made multiple times in topic model
- document clustering has word distribution used to regenerate all the words for a document, But, in the case of one distribution doesn’t have to generate all the words in the document. Multiple distribution could have been used to generate the words in the document.

Know how Remote Procedure Calls (RPCs) work.
Check a run of transactions for correctness (serial equivalence).
Design systems that use optimistic or pessimistic approaches to ensure correctness in spite of many concurrent clients.
Detect and avoid deadlocks.
Calculate nines availability for a replicated system.
Know how to ensure correctness (consistency) in spite of multiple servers.

Does an RPC always cross machine boundaries?
Why is marshaling needed at all?
What are conflicting operations and how can you use them to detect serial equivalence among transactions?
Is locking a form of pessimistic or optimistic concurrency control?
Does Google docs use pessimistic or optimistic concurrency control?
What is one way to prevent deadlocks among transactions?
What does “three nines availability” really mean?
Why is Two-phase commit preferable over One-phase commit?