CS 410 Text Information Systems


Goals and Objectives

  • Explain the concept of text clustering and why it is useful.
  • Explain how Hierarchical Agglomerative Clustering and k-Means clustering work.
  • Explain how to evaluate text clustering.
  • Explain the concept of text categorization and why it is useful.
  • Explain how Naïve Bayes classifier works.

Guiding Questions

  • What is clustering? What are some applications of clustering in text mining and analysis?
  • How does hierarchical agglomerative clustering work? How do single-link, complete-link, and average-link work for computing group similarity? Which of these three ways of computing group similarity is least sensitive to outliers in the data?
  • How do we evaluate clustering results?
  • What is text categorization? What are some applications of text categorization?
  • What does the training data for categorization look like?
  • How does the Naïve Bayes classifier work?
  • Why do we often use logarithm in the scoring function for Naïve Bayes?

Additional Readings and Resources

  • C. Zhai and S. Massung, Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining. ACM and Morgan & Claypool Publishers, 2016. Chapters 14 & 15.
  • Manning, Chris D., Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge: Cambridge University Press, 2007. Chapters 13-16.
  • Yang, Yiming. An Evaluation of Statistical Approaches to Text Categorization. Inf. Retr. 1, 1-2 (May 1999), 69-90. doi: 10.1023/A:1009982220290

Key Phrases and Concepts

  • Clustering, document clustering, and term clustering
  • Clustering bias
  • Perspective of similarity
  • Hierarchical Agglomerative Clustering, and k-Means
  • Direction evaluation (of clustering), indirect evaluation (of clustering)
  • Text categorization, topic categorization, sentiment categorization, email routing
  • Spam filtering
  • Naïve Bayes classifier
  • Smoothing

Video Lecture Notes

10-1 Text Clustering

10-1-1 Motivation

img
img
img
img
img
img
img
img
img

10-1-2 Generative Probabilistic Models Part 1

  • differences between mixture and topic model:
    • choice of using a distribution is made once in mixture model , but made multiple times in topic model
    • document clustering has word distribution used to regenerate all the words for a document, But, in the case of one distribution doesn’t have to generate all the words in the document. Multiple distribution could have been used to generate the words in the document.

img
img
img
img
img
img
img
img
img
img
img
img

10-1-3 Generative Probabilistic Models Part 2

img
img
img

10-1-4 Generative Probabilistic Models Part 3

img
img
img
img
img
img
img
img

10-1-5 Similiarity Based Approaches

img
img
img
img
img
img
img
img
img
img

10-1-6 Evaluation

img
img
img
img
img
img

10-2 Text Categorization

10-2-1 Motivation

img
img
img
img

10-2-2 Methods

img
img
img
img
img
img
img
img

10-2-3 Generative Probabilistic Models

img
img
img
img
img
img
img

CS 425 Distributed Systems


Goals

  • Know how Remote Procedure Calls (RPCs) work.
  • Check a run of transactions for correctness (serial equivalence).
  • Design systems that use optimistic or pessimistic approaches to ensure correctness in spite of many concurrent clients.
  • Detect and avoid deadlocks.
  • Calculate nines availability for a replicated system.
  • Know how to ensure correctness (consistency) in spite of multiple servers.

Key Concepts

  • LPCs vs RPCs
  • Marshaling
  • Serial Equivalence
  • Pessimistic Concurrency Control
  • Optimistic Concurrency Control
  • Deadlocks and their detection/avoidance/prevention
  • ACID Properties
  • Nines Availability
  • Active and Passive Replication
  • Two-phase commit

Guiding Questions

  • Does an RPC always cross machine boundaries?
  • Why is marshaling needed at all?
  • What are conflicting operations and how can you use them to detect serial equivalence among transactions?
  • Is locking a form of pessimistic or optimistic concurrency control?
  • Does Google docs use pessimistic or optimistic concurrency control?
  • What is one way to prevent deadlocks among transactions?
  • What does “three nines availability” really mean?
  • Why is Two-phase commit preferable over One-phase commit?

Readings and Resources

  • There are no readings required for this week.

Video Lecture Notes

Concurrency Control

RPCs

img
img
img
img
img
img
img
img
img
img
img
img
img
img
img

Transactions

img
img
img
img
img
img
img
img

Serial Equivalence

img
img
img
img
img
img
img
img

Pessimistic Concurrency

img
img
img
img
img
img
img
img
img
img
img
img
img
img

Optimistic Concurrency Control

img
img
img
img
img
img
img
img

Replication Control

Replication

img
img
img
img
img
img
img
img
img
img
img
img

Two-Phase Commit

CS 427 Software Engineering


Goals and Objectives

Video Lecture Notes

JUnit

Unit Testing

img
img
img
img
img
img
img

Junit Overview

img
img
img
img
img
img

JUnit Assertions

img
img
img
img
img

JUnit Assumptions

img
img
img
img
img

Parameterized Tests

Basics

img
img
img
img
img
img
img
img
img
img

Complex Parameters

img
img
img
img
img
img
img

Writing Parameterized Tests

img
img
img
img
img
img
img
img
img
img
img
img

Test Generalization

img
img
img
img
img
img
img
img
img