Module 5: The Role of Language in IR

Question #1:  Summarize the three sources discussed in this module:

“Modern Information Retrieval: A Brief Overview”

The core of “Modern Information Retrieval: A Brief Overview,” revisits three key information retrieval models that allow for relevance ranking, as well as techniques for implementing the term weighting that is key to each of these models.  The three models, and brief descriptions of them, are:

  • The Vector Space Model: In this model, term weighting is applied to index terms within queries and documents, and degrees of similarity between them are represented by angles between the query vector and the document vector.  The degrees in between allow for relevance ranking.  In addition to relevance ranking, another advantage of the Vector Space Model vs Boolean is the ability to leverage partial matching (in Boolean, a document is either True or False).
  • Probabilistic Model: This model estimates whether a document is relevant or not as compared to a user query, and is ranked accordingly.  The ranking is estimated as compared to what might be considered an ideal answer set.  All of our readings suggested that there different ways and thus different models for creating that estimation.
  • The Inference Network Model: In this model, a document is given a scored weight based on a compiled number of individual scores for the terms that are weighted in the document, and then ranked according to that collective scored weight.

Tf-idf Weighting:  This article then reviews the concepts of term frequency and inverse document frequency as methodologies for implementing the term weighting that drives the above models.  Term frequency, simply put, is the number of times a term appears within a given document.  There are benefits and drawbacks to using term frequency alone, two of them being that relevance can be skewed by the length of a document (length likely generating more instances of a term) as well as the frequency of the term throughout the documents in the document set.  To the first, term frequency can be normalized by comparing it to document length.

Inverse document frequency looks at the uniqueness of that term throughout the document set.  If the term is highly unique, it is more heavily weighted for relevance scoring purposes.

Tf-idf then, uses a combination of both concepts to score documents.

Chu’s Chapter 4, and Baeza-Yates/Ribeiro-Neto Chapter 6

With these IR models in mind, how are documents represented and retrieved?  Chu’s Chapter 4 discusses Natural Language, Controlled Vocabularies (thesauri, subject heading lists, and classification schemes).  Chu’s Chapter 4 also discusses, and Chapter 6 of Baeza-Yates and Ribeiro-Neto focuses on, document markup for digital content.

Natural Language:  Natural language is everyday language, without efforts to control vocabulary or syntax.  It consists of significant words, and function words or “stop” or common words.  These “stop” words like articles are not processed as index terms as they are not suitable for document representation purposes.

Controlled Vocabularies:  Thesauri, Subject Headings, and Classification Schemes are all artificial languages that provide additional controls and syntax.  These additional labels and synonyms and provide additional context and cross-referencing that natural language cannot.  As opposed to natural language, all of these controlled vocabularies require ongoing updating and thus maintenance costs.

Metadata and Markup Languages:  These tools provide document representations for digital documents.  They all fall under the definition of “metadata, which is “data about data.”

  • Document Formats: Text, Image and Video Formats – so, PDFs, JPGs, MPEGs and more
  • Markup languages like HTML and XML that annotate text in a way that structures the data in a document. These tags provide instructions for how to render a web document on a browser page.

Baeza-Yates/Ribeiro-Neto also discuss text operation processes that improve the results of a user’s search query, specifically lexical analysis, stemming, index term selection, and thesaurus construction.  Text compression techniques to reduce storage space and improve system performance were also reviewed.

Question #2:  Discuss the language (Natural language, Controlled Vocabulary, Metadata and markup languages) that works best for academic institutions and government organizations. 

This is a difficult question to answer as each language has its benefits and drawbacks, as well as application situations.   Focusing first on natural language vs controlled vocabulary, as our reading points out, natural language is easy to use – terms and phrases can be extracted from queries that users pose in the form of everyday questions.  However, cross-referencing between synonyms can be neglected, and term ambiguity can be a problem—as in my previous entry, when “rico” is entered, knowing when to process it as a name (Rico), a part of the name of the Commonwealth of Puerto Rico, or RICO the Racketeering statute needs context.

In an academic environment particularly, natural language has advantages in that it is easy to learn – there is little “learning,” actually, and as more and more IR systems adopt natural language processing they become more appealing to the “Google generation.”  The “Google generation,” then, may miss potentially relevant documents in their queries as they do not know to consider or accommodate for natural language drawbacks.

While controlled vocabularies can provide context as well as cross referencing, they require upkeep and maintenance to remain relevant, and this means costs, both in terms of monetary upkeep as well as an investment of time to learn and use them confidently.

I do believe that the inexperienced searcher can benefit from controlled vocabularies – if they are willing to take the time to learn them.  Classification schemes, for example, can lead users to add value to their queries that they would not have otherwise, particularly if they are unfamiliar with the topic they are researching.  In addition to the drawback of investment of time, there is also the drawback of investment of upkeep, and I would think this would be of particular concern to academic institutions and government institutions, where budget is often a big concern.

Separately, metadata and markup languages are used in structuring digital data and describing/rendering it in a certain way.  In today’s age I don’t think any academic or government collection would be complete without inclusion of digital data, so knowledge of these languages seems like table stakes.  Controlled vocabularies in contrast, have their origins in describing and organizing books.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s