CS 410 Text Information Systems


(under constructior construction)

Goals and Objectives

  • Explain some basic concepts in natural language processing and text information access.
  • Explain why text retrieval is often defined as a ranking problem.
  • Explain the basic idea of the vector space retrieval model and how to instantiate it with the simplest bit-vector representation.

Guiding Questions

  • What is entropy? For what kind of random variables does the entropy function reach its minimum and maximum, respectively?
  • What is conditional entropy?
  • What is the relation between conditional entropy H(X Y) and entropy H(X)? Which is larger?
  • How can conditional entropy be used for discovering syntagmatic relations?
  • What is mutual information I(X;Y)? How is it related to entropy H(X) and conditional entropy H(X Y)?
  • What’s the minimum value of I(X;Y)? Is it symmetric?
  • For what kind of X and Y, does mutual information I(X;Y) reach its minimum? For a given X, for what Y does I(X;Y) reach its maximum?
  • Why is mutual information sometimes more useful for discovering syntagmatic relations than conditional entropy?
  • What is a topic?
  • How can we define the task of topic mining and analysis computationally? What’s the input? What’s the output?
  • How can we heuristically solve the problem of topic mining and analysis by treating a term as a topic? What are the main problems of such an approach?
  • What are the benefits of representing a topic by a word distribution?
  • What is a statistical language model? What is a unigram language model? How can we compute the probability of a sequence of words given a unigram language model?
  • What is Maximum Likelihood estimate of a unigram language model given a text article?
  • What is the basic idea of Bayesian estimation? What is a prior distribution? What is a posterior distribution? How are they related with each other? What is Bayes rule?

Additional Readings and Resources

  • C. Zhai and S. Massung. Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining, ACM and Morgan & Claypool Publishers, 2016. Chapters 13, 17.

Key Phrases and Concepts

  • Entropy
  • Conditional entropy
  • Mutual information
  • Topic and coverage of topic
  • Language model
  • Generative model
  • Unigram language model
  • Word distribution
  • Background language model
  • Parameters of a probabilistic model
  • Likelihood
  • Bayes rule
  • Maximum likelihood estimation
  • Prior and posterior distributions
  • Bayesian estimation & inference
  • Maximum a posteriori (MAP) estimate
  • Prior model
  • Posterior mode

Video Lecture Notes

img

8-1 Syntagmatic Relation Discovery

8-1-1 Entropy

img
img
img
img
img
img

8-1-2 Conditional Entropy

  • Entropy of a word being present is minimum when given exact word is present (ie 0)
  • Entropy of a word being present is maximum when given unrelated word is present (ie, H(Xword))

img
img
img
img
img

8-1-3 Mutual Information Part 1

img
img
img
img
img
img
img

8-1-4 Mutual Information Part 2

img
img
img
img
img

8-2 Topic Mining and Analysis

8-2-1 Motivation and Task Definition

img
img
img
img

8-2-2 Term as Topic

img
img
img
img
img
img

8-2-3 Probabalistic Topic Models

img
img
img
img
img
img
img
img
img
img

8-2-3-1 Overview of Statistical Language Models Part 1

img
img
img
img

8-2-3-2 Overview of Statistical Language Models Part 2

img
img
img
img

8-2-3-3 Mining One Topic

img
img
img
img

CS 425 Distributed Systems


No new content this week, midterm.

CS 427 Software Engineering


Goals and Objectives

  • Design pattern
  • Observer pattern
  • Composite pattern
  • Interpreter Pattern
  • Visitor Pattern
  • Template Method Pattern
  • Iterator Pattern
  • Strategy Pattern

Video Lecture Notes

5-7 Design Pattern

  • A design pattern is the description of communicating objects and classes that are customized to solve a general design problem in a particular context, identifying:
    • The participating classes and objects
    • Their roles and collaborations
    • The distribution of responsibilities
  • Gang of Four Book: Design Patterns, elements of reusable object-oriented software
    • presents 23 object-oriented patterns:
      • creational patterns (5)
      • structural patterns (7)
      • behavioral patterns (11)

5-8 Observer Pattern

  • Using an observer pattern can reduce cycles of dependency (one to many dependency)
    • Cycles of dependence
      • If package A depends on package B, then you cannot run the tests for A unless you also have B
        • “A depends on B” means you cannot use A unless you have B
        • “Package A depends on packageB” means that something in A deponds on something in B
      • IF package A depends on package B, then package B should NOT depend on package A
        • If classes C and D both depend on each other, put them in the same package

5-9 Composite Pattern

  • Composite
    • problem:
      • Complex part-whole hierarchy has lots of similar classes
        • Example: book, chaptter, section, paragraph
    • Objectives
      • Simplicity - treat composite (ie, composition of parts) like a part
      • Power - create new kind of part by composing existing ones
      • Safety - treat parts and composites uniformly (no special cases)

img

  • Behavioral Patterns:
    • Observer: Define a one-to-many dependency between objects so that when one object changes state, all its dependents are notified and updated automatically.
    • Visitor: Represent an operation to be performed on the elements of an object structure. Visitor lets you define a new operation without changing the classes of the elements on which it operates.
    • Interpreter: Given a language, define a representation for its grammar along with an interpreter that uses the representation to interpret sentences in the language.
    • Template Method: Define the skeleton of an algorithm in an operation, deferring some steps to subclasses. Template Method lets subclasses redefine certain steps of an algorithm without changing the algorithm’s structure.
    • Iterator: Provide a way to access the elemetns of an aggregate object sequentially without exposing its underlying representation.
    • Strategy: Define a family of algorithms, encapsulate each one, and make them interchangeable. Strategy lets the algorithm vary independently from clients that use it.
  • Structural Patterns
    • Composite: Compose objects into tree structures to represent part-whole hierarchies. Composite lets clients treat individual objects and compositions of objects uniformly.

img