CS 410 : Text Information Systems

Guiding Questions

What are some different ways to place a document as a vector in the vector space?
What is term frequency (TF)?
What is TF transformation?
What is document frequency (DF)?
What is inverse document frequency (IDF)?
What is TF-IDF weighting?
Why do we need to penalize long documents in text retrieval?
What is pivoted document length normalization?
What are the main ideas behind the retrieval function BM25?
What is the typical architecture of a text retrieval system?
What is an inverted index?
Why is it desirable to compress an inverted index?
How can we create an inverted index when the collection of documents does not fit into the memory?
How can we leverage an inverted index to score documents quickly?

Key Phrases and Concepts

Term frequency (TF)
Document frequency (DF) and inverse document frequency (IDF)
TF transformation
Pivoted length normalization
sBM25
Inverted index and postings
sBinary coding, unary coding, gamma-coding, and d-gap
Zipf’s law

Video Lecture Notes

2.1 Vector Space Model - Improved Instantiation

Slide 2 : Two problems with Simplest VSM

Matching a keyword more times deserves more credit towards a document
Matching some keywords may be more important than other keywords (not all keywords are created equal).

Slide 3 : Improved Vector Placement: Term Frequency (how many times it occurs in document) Vector

Use counts of each word in query/document instead of 1’s and 0’s
Will solve term frequency problem, but will not fix hierarchy of keywords

Slide 6 : Keyword Hierarchy

use Inverse Document Frequency of word to help differentiate between low and high priority words
IDF(W) = log((M+1)/k), M = Total number of docs in collection, k = total number of docs containing W (words that appear in many documents have lower priority than words that occur in only a relatively small number.

Slide 12 : VSM with TF-IDF effective?

works well for some cases, but introduces new problems in other situations
Still have problems with words that have high frequency in a document but missing other important keywords.

2.2 : TF Transformation

Slide 5 : TF Transformation: c(w, d) -> TF(w,d)

A way to lessen impact of c(w,d) of unimportant word when using TF-IDF weighting.
Best developed so far is BM25: y = (k+1)x/(x+k) (upper bound of k+1)

2.3 : Doc Length Normalization

Length Normalization can be a double edged sword, need to be careful to heavily penalize those with more words and lightly penalize those with more relevant content.
Have length normalizer depend on document length and average document length
normalizer = 1 - b + b( d )/avdl where b is between 0 and 1.
State of the Art VSM Ranking Functions:

Slide 7 : Further improvement of VSM?

Improved instantiation of dimension?
Improved instantiation of similarity function?

Slide 8 : Further improvements of BM25

BM25F (combine frequency counts of terms in all fields and then apply BM25 (instead of the other way)
BM25+ (add small constant to TF)

2.4 : Implementation of TR (Text Retrieval) Systems

rationale behind why inverted index is so fast compared to just sequential search through a document:

Two datastructures that can be used: Dictionary or Posting (inverted index)

2.5 : System Implementation - Inverted Index Construction

Most of this lecture spoke about different compression procedures:
- unary
- γ-encoding
- δ-encoding
The methods in the links above appear different to what Cheng describes in his slides:

2.6 : System Implementation - Fast Search

CS 425 : Distributed Systems

Goals

Know the key differences between cloud computing and previous generations of distributed systems.
Design MapReduce programs for a variety of problems.
Know how Hadoop schedules jobs.

Key Concepts

Clouds
MapReduce
Hadoop YARN
Sumeet Singh Interview

Guiding Questions

Why is cloud computing popular today?
What is different in cloud computing compared to previous generations of distributed systems?
How does one program in MapReduce?
How does the MapReduce system schedule jobs?

Video Lecture Notes

1.1 : Why Clouds?

Many different Cloud services (AWS, Google App Engine, etc)

Slide 5 : Two categories of clouds

Private Clouds (within company only)
Public Clouds for customers
- Amazon S3 (Simple Storage Service): store data
- Amazon EC2 (Elastic Compute Cloud): upload and run OS images, pay per hour
- Google App Engine/ Compute Engine

Slide 6 : Customers Save Time and Money

1.2 : What is a Cloud?

Slide 1 : What is it?

cluster!
supercomputer!
datastore!
superman!
none or all of the above!
Course Definition: Cloud = Lots of Storage + compute cycles nearby

Slide 2 : What is it continued

Single or multiple geographical data centers

Slide 3 : Sample Cloud topology

1.3 : Intro to Clouds: History

1.4 : Intro to Clouds: What’s New in Today’s Clouds

Massive Scale
On-demand access (pay as you go, anybody!)
Data-intensive Nature
New Cloud Programming Paradigms (MapReduce/Hadoop, NoSQL/Cassandra/MongoDB, etc)
WUE = Annual water usage / IT equipment energy (L/kWh)
PUE = Total Facility Power /IT Equipment Power
A datacenter consumes 2 MW of power, and out of this only 1.5 MW is used for the computing equipment. What is the PUE of this datacenter?
- 1.33

2.2 : What is a Distributed System?

Cloud is one form of a distributed system
Course Definition: A distributed system is a collection of entities(processes), each of which is autonomous, programmable, asynchronous (not working at same time, ie not parallel systems) and failure-prone, and which communicate through an unreliable communication medium.

3.1 : MapReduce Paradigm

comes from functional programming’s map and reduce
Use hashing to evenly distribute reduce tasks among workers
The primary difference between Map and Reduce is that Reduces process batches of values associated with the same key while Maps process keys independently.
A given Mapreduce program has the Map phase generate 100 key-value pairs with 10 unique keys. How many Reduce tasks can this program have when at least one Reduce task will certainly be assigned no keys when a hash partitioner is used (select all answers that are correct)?”
- Between 11 and 101.

3.2 : MapReduce Examples

There are three advantages of Maps and Reduces sorting their outputs:
- A Map can know which contiguous sets of its key-value pairs (sharing the same key) need to be sent to one Reduce
- A Reduce task can call the Reduce function on a contiguous set of key-value pairs (sharing the same key)
- Sorting as an application becomes very simple to write

3.3 : MapReduce Scheduling

Ensure no reduce tasks start before the map tasks all finish
Someone writes a variant of Mapreduce without a barrier, i.e., where Reduces can start before all Maps are done computing. Which of the following are TRUE statements?
- A Mapreduce run may be incorrect since some of the key-value pairs generated by Maps may never be processed by a Reduce if the reduce function is called exactly once per key
- All Mapreduce runs could be correct if Reduces maintain partial results for keys, and update these as new key-value pairs come in

FIFO task management.
In YARN, which of the following is dedicated to one job?
- Application Master

3.4 : MapReduce Fault-Tolerance

Speculative execution refers to (Replicative Execution is a better term, the point is to observe tasks, and if one is going super slow, replicate the task on other threads and whichever replica finishes first completes the task):
- Replicating slow tasks so that the first copy that finishes, completes the task

CS 427 : Software Engineering

Goals and Objectives

Explain differences between programming and software engineering.
Explain main goals of software engineering.
Explain main subareas of software engineering.

Video Lecture Notes

1.1 : Origins of Software Engineering

programming (single worker) vs software engineering (team of workers)

1.2 : Software

Malleable
Complex
Useful
Varies

1.3 : Software Project

Development Process (Productivity)
Software System (quality)
Software Users (Experience)

1.4 : Software Engineer

In real world, customer/user requirements for a software are ambiguous, constantly changing

MCS 1st Semester Week 2 Notes

CS 410 : Text Information Systems

Guiding Questions

Key Phrases and Concepts

Video Lecture Notes

2.1 Vector Space Model - Improved Instantiation

2.2 : TF Transformation

2.3 : Doc Length Normalization

2.4 : Implementation of TR (Text Retrieval) Systems

2.5 : System Implementation - Inverted Index Construction

2.6 : System Implementation - Fast Search

CS 425 : Distributed Systems

Goals

Key Concepts

Guiding Questions

Video Lecture Notes

1.1 : Why Clouds?

1.2 : What is a Cloud?

1.3 : Intro to Clouds: History

1.4 : Intro to Clouds: What’s New in Today’s Clouds

2.2 : What is a Distributed System?

3.1 : MapReduce Paradigm

3.2 : MapReduce Examples

3.3 : MapReduce Scheduling

3.4 : MapReduce Fault-Tolerance

CS 427 : Software Engineering

Goals and Objectives

Video Lecture Notes

1.1 : Origins of Software Engineering

1.2 : Software

1.3 : Software Project

1.4 : Software Engineer

Recent Posts

Tags