CS 410 Text Information Systems


Goals and Objectives

  • Explain how we can extend a retrieval system to perform content-based information filtering (recommendation).
  • Explain how we can use a linear utility function to evaluate an information filtering system.
  • Explain the basic idea of collaborative filtering.
  • Explain how the memory-based collaborative filtering algorithm works.

Guiding Questions

  • What is content-based information filtering?
  • How can we use a linear utility function to evaluate a filtering system? How should we set the coefficients in such a linear utility function?
  • How can we extend a retrieval system to perform content-based information filtering?
  • What is the exploration-exploitation tradeoff?
  • How does the beta-gamma threshold learning algorithm work?
  • What is the basic idea of collaborative filtering?
  • How does the memory-based collaborative filtering algorithm work?
  • What is the “cold start” problem in collaborative filtering?

Additional Readings and Resources

  • C. Zhai and S. Massung. Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining, ACM Book Series, Morgan & Claypool Publishers, 2016. Chapters 10 - Section 10.4,Chapters 11

Key Phrases and Concepts

  • Content-based filtering
  • Collaborative filtering
  • Beta-gamma threshold learning
  • Linear utility
  • User profile
  • Exploration-exploitation tradeoff
  • Memory-based collaborative filtering
  • Cold start

Video Lecture Notes

6-1 Learning to Rank Part 1

  • How can we combine many features? (learning to rank)
    • General Idea:
      • Given a query-doc pair (Q,D), define various kinds of features Xi(Q,D)
      • Examples of feature: the number of overlapping terms, BM25 score of Q and D, p(Q|D), PageRank of D, p(Q|Di), where Di may be anchor text or big font text, “does the URL contain ‘~’?”…
      • Hypothesize p(R=1|Q,D)=s(X1(Q,D)),…,Xn(Q,D),λ) where λ is a set of parameters
      • Learn λ by fitting functions with training data, ie, 3-tuples like (D,Q,1)(Dis relevant to Q) or (D,Q,0)(D is non-relevant to Q)

6-2 Learning to Rank Part 2

img

6-3 Learning to Rank Part 3

  • More Advanced Learning Algorithms
    • Attempt to directly optimize a retrieval measure (eg MAP, nDCG)
      • More difficult as an optimization problem
      • Many solutions were proposed
    • Can be applied to many other ranking problems beyond search
      • Recommender systems
      • Computational advertising
      • Summarization
  • Summary
    • Machine Learning has been applied to text retrieval since many decades ago (eg, Rocchio feedback)
    • Recent use of machine learning is driven by
      • large-scale training data available
      • need for combining many features
      • need for robust ranking (again spams)
    • Modern Web search engines all use some kind of ML technique to combine many features to optimize ranking
    • Learning to ranki is still an active research topic
  • Next generation search engines
    • more specialized/customized (vertical search engines)
      • special group of users (community engines, eg, Citeseer)
      • Personalized (better understanding of users)
      • Special genre/domain (better understanding of documents)
    • Learning over time (evolving)
    • Integration of search, navigation, and recommendation/filtering (full-fledged information management)
    • Beyond search to support tasks (eg shopping)
    • Many opportunities for innovations!

img
img

  • Above image shows what current search engines do (search, keyword queries, bag of words), and what people are working to expand it to do.

img

6-5 Recommender Systems-Content-Based Filtering Part 1

  • Two modes of text access: Pull vs Push
    • Pull mode (search engines)
      • users take initiative
      • Ad hoc information need
    • Push mode (recommender systems)
      • systems take initiative
      • stable information need or system has good knowledge about a user’s need
  • Recommender ≈ Filtering system
    • Stable & long term interest, dynamic info source
    • System must make a delivery decision immediately as a document “arrives”

img

  • Basic Filtering Question: Will User U Like Item X?
    • Two different ways of answering it
      • look at what items U likes, and then check if X is similar
        • Item similarity => content-based filtering
      • Look at who likes X, and then check if U is similar
        • User similarity => collaborative filtering
    • Can be combined

img

  • Three Basic Problems in content-based filtering
    • making filtering decision (binary classifier)
      • Doc text, profile text –> yes/no
    • Initialization
      • initialize the filter based on only the profile text or very few examples
    • Learning from
      • Limited relevance judegements (only on “yes” docs)
      • Accumulated documents
    • All trying to maximize the utility
  • Extend a retrieval system for information filtering
    • “Reuse” retrieval techniques to score documents
    • use a score threshold for filtering decision
    • learn to improve scoring with traditional feedback
    • new approaches to threshold setting and learning

img

6-6 Recommender Systems-Content-Based Filtering Part 2

img

  • Empirical utility optimization
    • Basic idea
      • compute the utility on the training data for each candidate score threshold
      • choose the threshold that gives the maximum utility on the training data set
    • difficulty: biased training sample!
      • we can only get an upper bound for the true optimal threshold
      • could a discarded item be possibly interesing to the user?
    • solution:
      • heuristic adjustment (lowering) of threshold

img

  • Beta-Gamma thereshold learning
    • Pros
      • explicitly addresses exploration-exploitation tradeoff (“safe” exploration)
      • Arbitrary utility (with appropriate lower bound)
      • empirically effective
    • Cons
      • Purely heuristic
      • Zero utility lower bound often too conservative
  • Summary
    • Two strategies for recommendation/filtering
      • content-based (item similarity)
      • collaborative filtering (user similarity)
    • Content-based recommender system can be built based on a search engine system by
      • adding threshold mechanism
      • adding adaptive learning algorithms

6-7 Recommender Systems-Collaborative Filtering Part 1

  • What is Collaborative filtering (CF) ?
    • Making filtering decisions for an individual user based on the judgements of other users
    • Inferring individual’s interest/preferences from that of other similar users
    • General idea
      • Given a user u, find similar users {u1, …, um}
      • Predict u’s preferences based on the preferences of u1, …, um
      • User similarity can be judged based on their similarity in preferences on a common set of items.
  • CF: Assumptions
    • Users with the same interest will have similar preferences
    • Users with similar preferences probably share the same interest
    • Examples
      • “interest is infomration retrieval” => “favor SIGIR papers”
      • “favor SIGIR papers” => “interest is information retrieval”
    • Sufficiently large number of user preferences are avaialable (if not, there will be a “cold start” problem)

img

6-8 Recommender Systems-Collaborative Filtering Part 2

img

img

  • Improving User similarity meeasures
    • Dealing with missing values: set to default ratings (eg, average ratings)
    • Inverse user frequency (IUF): similar to IDF

6-9 Recommender Systems-Collaborative Filtering Part 3

  • Summary of Recommender Systems
    • Filtering/Recommendations is “easy”
      • The user’s expectation is low
      • Any recommendation is better than none
    • Filtering is “hard”
      • Must make a binary decision, though ranking is also possible
      • Data sparseness (limited feedback information)
      • “cold start” (little information about users at the beginning)
    • Content-based vs Collaborative filtering vs Hybrid
    • Recommendation can be combined with search –> Push+Pull
    • Many advanced algorithms have been proposed to use more context information and advanced machine learning.

CS 425 Distributed Systems


Goals

  • Design an algorithm to calculate a distributed snapshot.
  • Assign FIFO/Causal/Total ordering to multicast messages.
  • Design a reliable multicast protocol.
  • Know the working of the industry-standard protocol called Paxos.
  • Know why consensus is hard to solve.

Key Concepts

  • Global Snapshots
  • Multicast Ordering
  • Multicast Reliability
  • Paxos
  • Impossibility of Consensus

Guiding Questions

  • What is the difference between a safety property and a liveness property?
  • How does the Chandy-Lamport algorithm work?
  • How do you assign FIFO/Causal timestamps to multicasts in a distributed system?
  • How does Paxos use quorums to ensure safety?
  • Why is consensus impossible to solve in asynchronous systems?

Readings and Resources

Video Lecture Notes

6-1-1 What is Global Snapshot

  • Distributed Snapshot
    • More often, each country’s representative is sitting in their respective capital, and sending messages to each other (say emails)
    • How do you calculate a “global snapshot” in that DS ?
    • what does a “global snapshot” even mean?
  • In the Cloud
    • each application or service is running on multiple servers
    • servers handling concurrent events and interacting with each other
    • the ability to obtain a “global photograph” of the system is important
    • some uses of having a global picture of the system
      • checkpointing: can restart distributed application on failure
      • Garbage collection of objects: objecst at servers that don’t have any other objects (at any servers) with pointers to them
      • Deadlock detection: useful in database transaction systems
      • Termination of computation: useful in batch computing systems like Folding@Home, SETI@Home
  • What’s a global snapshot?
    • Global snapshot = global state = individual state of each process in the DS + Individual state of each communication channel in the DS
    • Capture the instantaneous state of each process
    • And the instantaneous state of each communication channel, ie messages in transit on the channels
  • Obvious first solution
    • Synchronize clocks of all processes
    • ask all processes to record their states at known time t
    • Problems?
      • time synchronization always has error
        • your bank might inform you, “we lost the state of our distritbuted cluster due to a 1 ms clock skew in our snapshot algorithm”
      • also, does not record the state of messages in the channels
    • again: synchronization not reuired – causality is enough!
    • Why does time synchronization not work as a way to calculate a global snapshot in a distributed system? (select all correct answers)
      • time sync is inaccurate
      • it may not capture channel states
      • it may not capture process states
  • Moving from state to state
    • whenever an event happens anywhere in the system, the global state changes
      • process receives message
      • process sends message
      • process takes a step
    • state to state movement obeys causality
      • next: causal algorithm for global snapshot

6-1-2 Glboal snapshot algorithm

  • System model
    • Problem: record a global snapshot (state for each process, and state for each channel)
    • System model:
      • N processes in the system
      • there are two uni-directional communication channels between each ordered process pair: Pj –> Pi and Pi –> Pj
      • Communication channels are FIFO-ordered
      • No failure
      • All messages arrive intact, and are not duplicated
        • other papers later relaxed some of these assumptions
  • Requirements
    • snapshot should not interfere with normal application actions, and it should not require application to stop sending messages
    • Each process is able to record its own state
      • process state: applicaiton-defined state or, in the worst case:
      • its heap, registers, program counter, code, etc (essentially the coredump)
    • Global state is collected in a distributed manner
    • Any process may initiate the snapshot
      • We’ll assume just one snapshot run for now
  • Chandy-Lamport global snapshot algorithm
    • first, initiator Pi records its own state
    • initiator process creates special messages called “Marker” messages
      • not an application message, does not interfere with application messages
    • for j=1 to N except i
      • Pi sends out a Marker message on outgoing channel Cij
      • N-1 channels
    • Starts recording the incoming messages on each of the incoming channels at Pi: Cji (for j= 1 to N except i)
    • Whenever a process Pi receives a Marker message on an incoming channel Cki
      • if (this is the first Marker Pi is seeing)
        • Pi records its own state first
        • Marks the state of channel Cki as “empty”
        • for j = 1 to N except i
          • Pi sends out a Marker message on outgoing channel Cij
        • Starts recording the incoming messages on each of the incoming channels at Pi: Cji (for j = 1 to N except i and k)
      • else // already seen a Marker message
        • Mark the state of channel Cki as all the messages that have arrived on it since recording was turned on for Cki
    • The algorithm terminates when
      • All processes have received a Marker
        • to record their own state
      • all processes have received a Marker on all the (N-1) incoming channels at each
        • to record the state of all channels
    • Then, if needed, a central server collects all these partial state pieces to obtain the full global snapshot.
    • Which of the following does a process NOT do when it receives its first marker message?
      • Starts recording the state of some incoming channels
      • Starts recording the state of some outgoing channels
      • Records its own state
      • Marks the state of the incoming channel as empty

6-1-3 Consistent cuts

  • Cuts
    • cut = time frontier at each process and at each channel
    • events at the process/channel that happen before the cut are “in the cut”
      • and happening after the cut are “out of the cut”
  • Consistent cuts
    • a cut that obeys causality
    • a cut C is a consistent cut iff for (each pair of events e, f in the system)
    • such that event e is in the cut C, and if f –> e (f happens-before e)
      • then: event f is also in the cut C

img
img

  • Any run of the Chandy-Lamport Global snapshot algorithm creates a consistent cut

img
img

6-1-4 Safety and liveness

  • Correctness in DS
    • can be seen in two ways
    • liveness and safety
    • often confused – it’s important to distinguish from each other.
  • liveness
    • guarantee that something good will happen, eventually, meaning if you let the system run long enough, then
    • Examples in real world
      • guarantee that “at least one of the atheletes in the 100m final will win gold” is liveness
      • a criminal will eventually be jailed
    • Examples in a DS
      • distributed computation: Guarantee that it will terminate
      • “Completeness” in failure detectors: every failure is eventually detected by some non-faulty process
      • In consensus: all processes eventually decide on a value
  • Safety
    • guarantee that something bad will never happen
    • examples in real world
      • a peace treaty between two nations provides safety
        • war will never happen
      • an innocent person will never be jailed
    • examples in DS:
      • There is no deadlock in a distributed transaction system
      • No object is orphaned in a distributed object system
      • “Accuracy” in failure detectors
      • in consensus: no two processes decide on different values
  • Can we guarantee both?
    • Can be difficult to satisfy both liveness and safety in an asynchronous DS!
      • failure detector: completeness (liveness) and accuracy (safety) cannot both be guaranteed by a failure detector in an asynchronous DS
      • Consensus: Decisions (liveness) and correct decisions (Safety) cannot both be guaranteed by any consensus protocol in an asynchronous DS
      • Very difficult for legal systems (anywhere in the world) to guaranteed that all criminals are jailed (liveness) and no innocents are jailed (safety)
  • In the language of global states:
    • recall that a DS moves from one global state to another global state, via causal steps
    • liveness with respect to a property Pr in a given state S means
      • S satisfies Pr, or there is some causal path of global states from S to S’ where S’ satisfies Pr
    • Safety with respect to a property Pr in a given state S means
      • S satisfies Pr, and all global states S’ reachable from S also satisfy Pr
  • using global snapshot algorithm
    • chandy-lamport algorithm can be used to detect global properties that are stable
      • stable = once true, stays true forever afterwards
    • stable liveness examples
      • computation has terminated
    • stable non-safety examples
      • there is a deadlock
      • an object is orphaned (no pointers point to it)
    • all stable global properties can be detected using the Chandy-Lamport algorithm
      • due to its causal correctness
  • Summary
    • the ability to calculate global snapshots in a DS is very important
    • but don’t want to interrupt running DS application
    • chandy-lamport algorithm calculates global snapshot
    • obeys causality (creates a consistent cut)
    • can be used to detect stable global properties
    • safety vs liveness

6-2-1 Multicast ordering

img
img
img
img
img
img
img
img
img
img
img

6-2-2 Implementing Multicast Ordering 1

img
img
img
img
img
img

6-2-3 Implementing Multicast Ordering 2

img
img
img
img
img

6-2-4 Reliable Multicast

img
img
img
img
img

6-2-5 Virtual Synchrony

img
img
img
img
img
img
img
img
img
img
img
img
img
img
img

6-3-1 The Consensus Problem

img
img
img
img
img
img
img
img
img
img

6-3-2 Consensus in Synchronous Systems

img
img
img
img

6-3-3 Paxos put Simply

img
img
img
img
img
img
img
img
img
img
img
img
img

6-3-4 The FLP Proof

img
img
img
img
img
img
img
img
img
img
img
img
img
img
img
img
img
img
img
img

CS 427 Software Engineering


Video Lecture Notes

5-1 Object-Oriented Modeling

  • Abstractions: Software under development
    • Architecture
    • OO Design
  • Benefits of Modeling Notations
    • Communication
    • Documentation
    • Quality Assurance
    • Code Generation
  • Unified Modeling Language (UML)
    • Class Diagram
    • Sequence Diagram

img

5-2 Class Diagram-Overview

  • Purposes of class diagram
    • Analysis - conceptual
      • model problem, not software solution
      • can include actors outside system
    • Design - specification
      • the structure of how a software system will be written
    • Design - implementation
      • Actual classes of implementation
    • It DOES NOT capture how:
      • classes interact with each other
      • algorithm or behavior detail

img
img
img

5-3 Class Diagram-Relationships

img
img
img
img
img
img

5-4 Class Diagram-Miscellaneous

img
img

5-5 Requirements to Class Diagram

  • Nouns are good candidates for classes/attributes
  • Adjectives are good candidates for interfaces
  • Verbs are good candidates for methods or relationships between classes

5-6 Sequence Diagram

  • Purposes of UML Sequence Diagram
    • Used during requirements analysis
      • to refine use case descriptions
      • to find additional objects (“participating objects”)
    • Used during system design
      • to refine subsystem interfaces

img
img