Module 10: Digital Libraries Retrieval

This week’s assignment is to search and review three articles that describe the “good” and the “bad” of digital libraries as the new frontier for storing and retrieving data.  As the last article does not touch on it as much, I want to stress that no matter what route a library takes in deploying a “digital library,” careful planning regarding how that library’s bibliographic assets are going to be coded needs to be considered.  MARC records are available from vendors like OCLC and can be edited to suit cataloging purposes; the challenge with MARC is that the metadata is housed in a “silo” and partitioned off from the rest of the web where other valuable information might be found.  Dublin Core is also an option.   An exciting concept also under development is the Library of Congress’s BIBFRAME initiative, which leverages linked data concepts and expresses data in RDF “triples” so that information can be purposed and repurposed.  This will help users discover and retrieve library and web data in new ways that improve relevance.

The “Good” – Cottrell, M.  (2013). Paperless Libraries. American Libraries.  Retrieved July 7, 2015 from:

Some libraries have made the move to a completely digital environment and are serving their patrons in ways that are not possible in print or other media. The first article I have chosen offers some real-life examples of libraries that have migrated to digital, and the benefits to their patron populations.  In “Paperless Libraries,” featured in American Libraries magazine (2013), Megan Cottrell describes the bookless facility opened in Bexar County, TX called “Bibliotech,” as well as the all-digital Applied Engineering and Technology Library at the University of Texas at San Antonio (UTSA), and the services and benefits that they provide to their respective audiences.

The Bibliotech library leverages the 3M Cloud Library, one of the more prominent eLending solutions libraries can choose to implement for digital services, bringing more than 10,000 titles to patrons for digital downloading.  The physical library itself host workstations, meeting spaces, and a coffee shop, but no books.

The advantages and benefits are many, according to the head librarian.  They include many of those summarized in the material assigned in the Chu textbook this week:

  • Cost savings for materials and physical space: The efficiencies found in these areas enabled the library to deliver a brand new library service to an area where there previously was none
  • Easy customization of library “holdings”: The licensed service allows the library to quickly acquire books based on patron demand
  • Budget predictability
  • Delivery of technology education to a traditionally underserved population area

Similarly, the UTSA facility allowed its library staff to efficiently use its funding to provide information services while providing workspace for students to study and meet.  If a physical library were needed, the library knew it could not serve all of those needs.   Its librarian stated that she felt because the materials were technical in nature for an audience that is naturally tech savvy, that the learning curve was not steep, since many of the materials were already online anyway.

In both instances, the librarians felt that going digital enabled them to maximize the value received from their spend while continuing to provide the information services their patrons need – plus all of the other community services patrons are demanding from their libraries.

The “Not Necessarily Bad,” but Important ConsiderationsBreeding, M. (2012). Coping with complex collections: managing print and digital: the increasing richness and corresponding complexity of library collections are a reality that isn’t likely to abate, nor would we want it to. Computers in Libraries, (7). 23.

This article echoes some of the socio-economic and technical deployment considerations implicated in management of a digital library collection discussed in the Baeza-Yates and Ribeiro-Neto readings this week.

Acquisition Planning:  The author classifies two important considerations in this discussion, specifically budget planning and legal issues.  In addition to procurement of physical materials through standing orders and selective purchases, curators of digital collections must also navigate demand driven acquisitions, where a certain number of “clicks” on a title triggers a “purchase,” bundled ejournal subscription packages, up front charges to open access resources, and other business models.  These purchase choices, together with investigation and curation of truly no cost access materials, as well as maintenance of materials created in house, are essential to ensuring library patrons receive the most, high quality information resource access for the library budget spent.  Additionally, with the addition of digital materials to a library collection, the library must decide on policies and enforcement of ownership versus licensing and philosophies regarding how much of a collection is permanently “owned.”

Patrons’ Openness to Digital Adoption:  The author acknowledges that patrons’ comfort with digital access must be considered as practical constraints inhibit libraries’ abilities to provide access to the same resources in multiple media to serve individual tastes.

Searching Multiple Media for Easier Resource Discovery – ILS Integration:  The author also points out that in sourcing multiple media types from multiple providers that for patrons to easily discover and retrieve relevant resources, a single access point for patron experience of the library’s holdings is needed.  As such, selection of an Integrated Library System that supports connections to disparate digital resources, in addition to the print, through metadata such as MARC records must be considered.  For example, the popular platform OverDrive, used in over 22,000 libraries worldwide, supports access to a myriad of ILS/Library Management Systems.  Depending upon the ILS provider and available APIs, integration can be supported such that not only search and display but also checkouts, holds, etc. can also be performed from within the ILS interface, creating a unified patron view regardless of the resources integrated on the back end.   

Additional Considerations in a Specialized Library:  O’Grady, J. (2015, January 5). 12 Building Blocks Of A Digital Law Library – Law360. Retrieved July 7, 2015, from

In this final article, Jean O’Grady, a popular blogger in the legal information space, discusses the trend for large law firms – traditional users of printed materials in the most traditional types of libraries – to adopt digital libraries.  In this article, O’Grady focuses on digital libraries comprised of electronic versions of print legal reference books vs other types of media, as opposed to the popular legal research databases or other media a legal library might curate for reference purposes.  I found this article interesting, even though it does not touch on search and retrieval, because it demonstrates the stance of a conservative profession towards digital adoption.  In full disclosure, this is also the business I work in professionally at this time.

O’Grady notes that while younger attorneys have a preference for electronic access, print might need to be retained for a certain period of time to serve partners whose preference is print.  It is often though, the reduction of valuable office space, particularly in larger cities where real estate is a premium price that starts a firm on a quest to move to digital, and in these instances even traditional print researchers can be convinced to go digital.  Some libraries, for example, have converted almost entirely to digital, including Kaye Scholer, one of the AmLaw 50 firms.

O’Grady then notes twelve issues that must be considered in moving to digital in a law firm environment.  I will highlight a few of them as relevant to our studies/profession:

  • Licensing: Similar to the copyright/ownership issues raised in the Breeding article above and mentioned in the Baeza-Yates and Riberio-Net readings, O’Grady points out that licensing agreements for eBooks should be carefully reviewed and considered so as to ensure the firm is comfortable with vendor policies.  O’Grady points out newsletter licensing in particular, as assumedly a resetting of expectations regarding  permissible practices may be needed here.
  • Authentication: ID and Password management to ensure attorneys can always access their digital materials when and where they need them is important.  Digital library systems can offer integration with network authentication protocols, but the firm’s IT professionals will need to be involved to ensure compliance and functionality.
  • Return on investment: While implementation of a digital library can require significant technical investment, there are opportunities for cost savings including reduction in filing services, reduction in lost updates, time saved in check-in/check-out/routing, as well as improved analytics for learning what materials are used/unused so as to repurpose spending on less used resources in favor of materials that will be more highly valued.  (Again, in full disclosure, I’ve had librarians tell me that previously the only way they knew whether a book was being used or not was through tricks like cancelling publications and seeing whether anyone noticed, putting pennies on top of the books and seeing whether they’d fallen off, etc.)
  • Skills of a strategic information professional: O’Grady emphasizes that only a true information professional with legal research/practice skills as well as library/information management experience can truly assess how a digital library might integrate into a firm’s workflow and improve its attorneys’ ability to search and retrieve the research resources needed to support their practices, and work with the right technical sources within the firm to deliver those resources to the firm via desktop/mobile access.

Module # 9 Documents: languages and properties

Representation of documents through uncontrolled vocabularies is a topic that is of great interest to me, and is the subject I will be focusing on for my major paper in this class.    For the purposes of this assignment, I will be focusing in on the subject of hashtags for information representation, and reviewing three articles not covered in my paper – one academic, and two from general trade publications.  The first article provides an overview of hashtags, their origin, benefits and drawbacks; the second, an academic look at global use of hashtags on Twitter as compared to other methods of information organization, and the third, a very different angle – best practices for using hashtags to achieve broad reach in social media marketing campaigns.

The first article I read, “Twitter Tips: How And Why To Use Hashtags (#)” by Lynch in CIO Magazine, provided an overview that I have been looking for regarding the origins of hashtag use.  It referenced a community for early Twitter adopters that proposed creating “channels” for organizing topic discussions, and in that discussion, the pound symbol was offered as a way to denote a user-contributed vocabulary category.  Since first used in 2007, the hashtag (or some form of tagging) is available across social media platforms including Instagram, Pinterest, Facebook, Tumblr, Google+ and other user communities.

The author points out selective deficiencies or cautions in using hashtags for information dissemination and retrieval.  First, as we know, hashtags are largely an uncontrolled vocabulary – there are no strict guidelines, hierarchies that must be followed, etc.  What I learned from this article is that there are (were?) some sites/communities that have attempted to provide some controls through “dictionaries” or hashtag definitions; or registries that provide usage instructions.  However, no attempt was made to reconcile these sites/communities.  Note that I said “were (?)” – as the sites the author referenced are no longer in use! (Tagalus, HashtagNation).

Second, hashtags can be overused, or misused, creating extra noise in search results.  The author offers that a consequence of overuse/misuse on Twitter is that newcomers to the community feel like exactly that – newcomers.  This, the author believes, makes recent participants feel less welcome and more likely to abandon Twitter as an information resource.  The impact on relevance is not discussed by the author, but one can surmise that if salient information is available on Twitter and a user does not feel welcome/comfortable with using the site, that information would be missed.–how-and-why-to-use-hashtags—-.html

The second article I read, “How People use Twitter in Different Languages,” by Weerkamp, Carter and Tsagkias, discussed how users organize information on Twitter in different languages, including hashtags, links, mentions, and conversations ; and the variance in usage of each tool from language to language.   Eight languages used on Twitter were analyzed.

This article really made me “think.”  Before reading it, I admit I assumed that methods of organizing information on Twitter would be similar from language to language, culture to culture.  However, in analyzing hashtag usage, for example, the authors discovered that while the average number of hashtags used in each language analyzed ranged between one and two, German language users are far more likely to use hashtags to organize their tweets (one in four tweets) than any other language examined.  Conversely, Japanese language users rarely use hashtags, with only one in 25 Japanese tweets containing hashtags.

To demonstrate the nuances uncovered, I will offer the findings regarding use of “conversations” in comparison.  The authors categorized “conversations” as direct responses to another tweet.  Here, German and Japanese language use were opposite –one in four Japanese language tweets analyzed was part of a conversation, while 14% of German language tweets could be categorized this way, the second lowest percentage of the eight languages analyzed.

While the authors did not discuss the impact on information retrieval or relevance, the findings emphasize what we have learned about controlled vocabularies like LCSH – that cultural nuances must be acknowledged and accommodated for users to be able to retrieve all potentially relevant information.  From this article we learn, for example, that a user researching in the Japanese language would likely miss relevant information if over-reliant on hashtags as a vocabulary for discovery.

The final article, “16 Tips for Using #hashtags HINT: You’re doing it wrong 😛 #socialmedia,” written by Laurel Papworth, a noted social media consultant, is targeted to social media marketers who are looking to exploit hashtags as a means of achieving wider information distribution.

The intended audience’s likely objective in reading this article is to learn how to social media marketing ROI.  As such this article is written from the angle of “how to best achieve the broadest reach” when using hashtags.  However, it does highlight certain interesting and key qualities of hashtags and their use that can impact precision/recall for information seekers, and thus relevance. I will highlight a few of the points here.

First of all, the author points out that hashtag usage is so prevalent in web content creation at this point, that creating content and not tagging it severely hampers one’s reach.  Not doing so can limit information consumption to those the content is directly distributed to – there is a diminished chance of discoverability as the information seeker must rely on search engine indexing or another less classification scheme.

Also, very specific hashtags are transitory – the author states they rarely “make it into a second or third week.”  If an information seeker is looking for all discussions on a topic via hashtags, it is likely that relevant information will be omitted from results as a tag that was contemporary at one point may no longer be.

Conversely, use of general hashtags can inhibit precision, as a multitude of users who choose the same hashtag can bury relevant information for a researcher.  The author offers the tag #AusPol as an example, stating it “has more than 50 tweets per minute, non stop, day and night.”

The final point that I will highlight here is that the author advises careful selection of hashtags to avoid generic words.  As example, the author states “#blacklist” can mean many things, but “#Blacklist” is an attempt to reference a particular TV show.  From an information retrieval standpoint, this means a user’s search could retrieve a broad range of very irrelevant content if the vocabulary chosen to categorize relevant information is too vague.

Module # 7 User Retrieval Evaluation

In this module, we were asked to locate and review three articles that measure and discuss user retrieval evaluation. The article topics could discuss the subject of the user in search engines, library catalogs or even social media. We were also asked to discuss if the author addresses the subject of relevance, and to what extent.

In looking for articles for this module, I discovered three articles on usability tests of search, retrieval and display in prominent eCommerce websites from the Baymard Institute.  Together they address user search objectives, eCommerce query characteristics, search engine performance for top eCommerce sites, and UI strengths/weaknesses for these sites as found through their user observations.

Deconstructing eCommerce Search:  The Twelve Query Types:

In this article, the Baymard Institute assessed 19 major eCommerce sites including,, and others for support for query types that they consider to be essential for transactional inquiries. To do this, a usability test was conducted with subjects who articulated their search goals, search process, and subjective assessments of the results they received.   Through this test, the Institute identified twelve query types and rated sites’ performance in delivering results that met (or didn’t meet) user expectations.  These query types were divided into four groups and the Institute noted that users’ searches frequently included combinations across groups.  While I will not repeat all of their findings on all of the query types ,I will summarize some of their more interesting (to me) key findings, which came from the first functional group they called “Query Spectrum” searches that defined the domain of products/information of interest:

  • Support for phonetic misspellings and alternate names/titles is deficient: In eCommerce, users are often searching for an exact manufacturer or product number.   In fact, they often copy/paste from other sites into the one they are searching.  Not only were localized variations ignored (AT&T Wireless vs Cingular Wireless) but even exact match searching was an issue in 18% of the test searches.    These search engine deficiencies greatly impacted results relevance.  The Institute suggested partnering with industry databases to improve Exact Keyword search results.
  • Product category searches require a robust database of synonym surrogates,e.g. “hair dryer” vs “blow dryer,” to retrieve results that are perceived as relevant. Additionally, from a UI perspective, once the broad category results are retrieved (recall), filters are necessary for the user to effectively explore.
  • Symptom based searches: These are often a last resort for the user and are a way for the user to find a product-based solution to a problem.  The examples given were “stained rug” or “dry cough.”  These are difficult to support, and the Institute suggests that Help references and links be offered to aid the user in identifying what type of product would be of assistance.  Assumedly these help references would redirect the user to refine his/her search to an Exact Keyword or Product Category search.
  • Non-Product searches on eCommerce sites are generally responsive: The Institute placed searches for company information, return policies, shipping information and other ancillary queries related to a Product search into this category and rated the performance of the sites that were assessed, well.  They found that when searching on these types of terms, 86% of the sites returned the results users expected either as the top page, or one of the top pages.

eCommerce Sites Should Include Contextual Search Snippets (96% Get It Wrong)

In this next article, which was derived from the same study, the Institute discussed  the value that they called “search snippets” and what we have called “key words in context” to internet search engine results, and how eCommerce sites have failed to follow their lead.  The Institute emphasized that these “snippets” perform the valuable task of connecting the user’s query to the retrieved results, explaining why they were delivered as responsive/relevant.

Specifically, the Institute stated that users want to know the following information about any search results in order to feel confident about their queries and the search engine that delivers the response  to it:

  • Why was this search result included?
  • How does it relate to my search query? How is it relevant to me?
  • How does it differ from the next result?

Surprisingly, it was not eCommerce giant Amazon that performed the best, but rather Walmart.   In a demonstration of Amazon’s weaknesses, the article recounted a user’s search for “Steven Spielberg” on, which delivered the movies Lincoln, Schindler’s List, and Empire of the Sun, among others.  The user did not know why they were retrieved and wondered aloud, “Did he direct these movies?”

In contrast, a screen shot was shown for results from WalMart’s site for the search “high quality tea kettle.”  It was noted that the context sensitive snippets in the results affirmed the user’s understanding of why the results retrieved were selected, and created confidence that the results delivered were based on the user’s search, not Walmart’s preconceptions of what the user would think is a high quality tea kettle:


This article closed with the recommendation that all eCommerce sites should explore  offering user search terms in their search results to minimize site abandonment.

 eCommerce Search Field Design and Its Implications

 This final article is more focused on the aspects of user experience akin to those discussed in the Chu chapter that was assigned this week on Modes of User-System Interaction.  Specifically, this article discussed best practices for designing menu selection, graphical interaction, forms, hyperlinks, and other display dimensions that aid users in creating queries that will deliver more relevant results.

The institute recommends thinking about the approaches users take to locating products on a specific eCommerce site, and adjusting two important design features accordingly.  Specifically, it is recommended that search fields be more prominently featured when users are more likely to know the specifics regarding what they want – a particular model, brand, etc.  Further, the Institute recommends leveraging placement, size, color constrast, and other design tools to “nudge” the user towards the search field if keyword searching is expected.

In contrast, if users are more likely to browse categories for what they want, the Institute advocates that a hierarchy navigation (like a left nav consisting of words that denote categories and subcategories) be used.  In analyzing eCommerce sites, the Institute suggested that this technique works best on apparel sites and home furnishing sites where the visual appeal of the products among those similarly categorized is beneficial.  The Urban Outfitters site is highlighted, as the hierarchy navigation is so prominent that it is difficult to find the “search” function.

Building a Database! My Partial Discography for the Rolling Stones

This assignment was a lot of fun.  I got to start to learn Microsoft Access as well as spend time thinking about my favorite band.  I created a database to index information on over 250 songs by the Rolling Stones.  I threw in a couple of solo albums and some additional data so that I could experiment with relationships between my data tables.

Some of the things I learned:

-I needed to modify the entry standard for album release date as Access does not include “month, year” as a standard.  So I learned to do that.  I thought about separating out month from year, in keeping with the principles outlined in the Knaupf article to break down data strings into their smallest meaningful parts but ultimately decided I did not need “month” to be able to move independently from “year.”

-I normalized the entry options for vocalists, as in various publications “Keith Richards” will be referred to as “Keith Richard” and “Ron Wood” will be referred to as “Ronnie Wood” and I did not want to lose data.

-I created a couple of relationships and queries based on them, leveraging data from multiple tables.

-I was able to review my database design and eliminate a couple of redundancies.  I left a couple as I am not completely comfortable with Microsoft Access and setting an index number that doesn’t mean anything to me as the primary key.

While this database does not even begin to leverage the power of Microsoft Access I have begun to see the possibilities of how I can leverage it in my professional work and will be taking time to learn more.  Well worth the effort.

Module #6: Queries

In this module, three questions were posed:

What is the difference between structural queries and query protocols? 

Baeza-Yates and Ribeiro-Neto define a “structural query” as one that leverages organizational elements like parts of or divisions within in a document .   These types of queries allow the user to create more specific, directive queries that deliver more precise results, as retrieved documents must not only satisfy the query within basic text but also do so within the required constraints.  An example of a “structural query” would be a “fixed structure” that allows for use of certain fields.  For example, in Lexis Academic one can search the news databases in fields like headline, written by, or publication name.

A “structural query” is a type of query that an end user would enter.  A “query protocol,” in contrast, is one that is not meant to be used by humans.  In fact, Baeza-Yates and Ribeiro-Neto use the term “protocol” specifically, in contrast to “language,” to emphasize that point.  Z39.50, which is offered as an example, is a query protocol that the NISO’s primer describes as “a common language that all Z39.50 systems can understand” and “enables two computer systems on a network to communicate for the purpose of information retrieval.”  (See ).  In other words, it provides a common way that two systems can talk to each other and understand each other, rather than an end user and a system.

What type of query do you use most often? 

The type of query I use most, often depends upon the type of research I am doing.  If I am doing legal research and I have already carefully thought through my query and what types of documents might best satisfy it, I will employ Boolean search logic within a structured query.  For example, on Lexis Advance, if I were to search for caselaw in the state of Florida involving a party named “Smith” and heard by a judge named “Jones,” my query might look something like  name(smith) AND judge(jones) AND court(florida).  This would enable me to retrieve all of the documents that I want and definitely exclude those that do not meet my information need.

If I am searching for general information on Google though, and I’m not sure what document might satisfy my information need – or what my information need is to begin with – I might just start with a single term search.  Akin to “berry picking,” I will find myself following a hypertext link trail, collecting alternate terms, and meandering around the web rather than taking a methodical approach.  In fact, sometimes I will start my legal research this way, and once I am comfortable with the terminology, switch over to a legal database to perform a structured search.

Why we are still facing so many problems finding the right documents, images, video while we are searching?

I believe the answer here is that researchers are not patient and not willing to take the time to think about the following things:

(1) What they are looking for,

(2) What types of resources would be the most appropriate places to find that information,

(3) Once they figure out those resources, how to construct a valid query within them, and with what terms, and

(4) Take the time to understand how the system they are researching processed that query and retrieved the documents that it did, in the order that it did.

I think people too often just jump right in and start searching, which leads to “garbage in, garbage out.”

Follow Up Questions on Module 5

In follow up to my submission for Module 5, two questions have been posed:

(1) Do you think that in the future we will still use controlled vocabulary as it used in the library to organize its catalog?

I still believe that controlled vocabularies such as the LCSH, thesauri, etc. will still have their places in organizing library catalogs in the future, because of the advantages they will continue to impart including connections to broader/narrower/similar topics, improving recall, etc.  However, I think the expense of learning and maintaining controlled vocabularies, the user un-friendliness, and the lack of currentness in controlled vocabulary maintenance is difficult to overcome when we – and libraries — are constantly inundated with new amounts of information that are difficult to assess.  What I think would be interesting is a way to augment controlled vocabularies with user-supplied tagging so as to add new, additional ways that patrons can identify and access additional, relevant information.  The social aspect of patron-supplied tagging might have the added benefit of more greatly engaging the library’s patrons, and remind them that libraries can provide added value for their research in comparison to the open web.

This is a topic, coincidentally, that I am contemplating exploring for my paper for this class – whether folksonomies can be effectively used in public libraries.  Some of the reading I’ve done already indicates that ILS such as the SiriDynix Horizon Portal and a system from EOS International allow end user-specific customization such as personal lists and I am interested in learning more.

Second question, your opinion on Signals (2001) Modern Information Retrieval article?

With the explanations of each IR model covered in this article already summarized in my prior post, I’ll focus my assessment of the article on the techniques and applications for for evaluating search effectiveness.  What has struck me to this point is that our reading has focused heavily on “what documents should be relevant,” but the fact of the matter is “what documents are relevant” is very much in the eye of the researcher.  So, for example, with tf-idf the frequency of a term in a particular document as compared to its uniqueness within the document set should make that document more relevant to the researcher and thus places it higher in the answer set rankings, but at the end of the day that researcher’s opinion is what matters.  That is why the section on Query Modification resonated with me.  With relevance feedback taken into consideration, that subjective element is taken into consideration in the user’s next search.  I feel like this has been incorporated into Google’s search results since this article was published in 2001, because it seems like once I do one or two searches on Google and click on a few documents that a second or third search eliminates some of the documents that interpret my search terms in ways that are contrary to my intent.  It’s almost as if Google’s search engine has taken into account in my next search that “OK, she wants results that use the term “record” as a noun and to mean “the plastic album,” rather than as a verb so as to mean to make a copy of something.”  Which is a little scary because when I am logged into my Gmail account, those user preferences are attributed to me as a named person, but user privacy is another subject for another day.

Module 5: The Role of Language in IR

Question #1:  Summarize the three sources discussed in this module:

“Modern Information Retrieval: A Brief Overview”

The core of “Modern Information Retrieval: A Brief Overview,” revisits three key information retrieval models that allow for relevance ranking, as well as techniques for implementing the term weighting that is key to each of these models.  The three models, and brief descriptions of them, are:

  • The Vector Space Model: In this model, term weighting is applied to index terms within queries and documents, and degrees of similarity between them are represented by angles between the query vector and the document vector.  The degrees in between allow for relevance ranking.  In addition to relevance ranking, another advantage of the Vector Space Model vs Boolean is the ability to leverage partial matching (in Boolean, a document is either True or False).
  • Probabilistic Model: This model estimates whether a document is relevant or not as compared to a user query, and is ranked accordingly.  The ranking is estimated as compared to what might be considered an ideal answer set.  All of our readings suggested that there different ways and thus different models for creating that estimation.
  • The Inference Network Model: In this model, a document is given a scored weight based on a compiled number of individual scores for the terms that are weighted in the document, and then ranked according to that collective scored weight.

Tf-idf Weighting:  This article then reviews the concepts of term frequency and inverse document frequency as methodologies for implementing the term weighting that drives the above models.  Term frequency, simply put, is the number of times a term appears within a given document.  There are benefits and drawbacks to using term frequency alone, two of them being that relevance can be skewed by the length of a document (length likely generating more instances of a term) as well as the frequency of the term throughout the documents in the document set.  To the first, term frequency can be normalized by comparing it to document length.

Inverse document frequency looks at the uniqueness of that term throughout the document set.  If the term is highly unique, it is more heavily weighted for relevance scoring purposes.

Tf-idf then, uses a combination of both concepts to score documents.

Chu’s Chapter 4, and Baeza-Yates/Ribeiro-Neto Chapter 6

With these IR models in mind, how are documents represented and retrieved?  Chu’s Chapter 4 discusses Natural Language, Controlled Vocabularies (thesauri, subject heading lists, and classification schemes).  Chu’s Chapter 4 also discusses, and Chapter 6 of Baeza-Yates and Ribeiro-Neto focuses on, document markup for digital content.

Natural Language:  Natural language is everyday language, without efforts to control vocabulary or syntax.  It consists of significant words, and function words or “stop” or common words.  These “stop” words like articles are not processed as index terms as they are not suitable for document representation purposes.

Controlled Vocabularies:  Thesauri, Subject Headings, and Classification Schemes are all artificial languages that provide additional controls and syntax.  These additional labels and synonyms and provide additional context and cross-referencing that natural language cannot.  As opposed to natural language, all of these controlled vocabularies require ongoing updating and thus maintenance costs.

Metadata and Markup Languages:  These tools provide document representations for digital documents.  They all fall under the definition of “metadata, which is “data about data.”

  • Document Formats: Text, Image and Video Formats – so, PDFs, JPGs, MPEGs and more
  • Markup languages like HTML and XML that annotate text in a way that structures the data in a document. These tags provide instructions for how to render a web document on a browser page.

Baeza-Yates/Ribeiro-Neto also discuss text operation processes that improve the results of a user’s search query, specifically lexical analysis, stemming, index term selection, and thesaurus construction.  Text compression techniques to reduce storage space and improve system performance were also reviewed.

Question #2:  Discuss the language (Natural language, Controlled Vocabulary, Metadata and markup languages) that works best for academic institutions and government organizations. 

This is a difficult question to answer as each language has its benefits and drawbacks, as well as application situations.   Focusing first on natural language vs controlled vocabulary, as our reading points out, natural language is easy to use – terms and phrases can be extracted from queries that users pose in the form of everyday questions.  However, cross-referencing between synonyms can be neglected, and term ambiguity can be a problem—as in my previous entry, when “rico” is entered, knowing when to process it as a name (Rico), a part of the name of the Commonwealth of Puerto Rico, or RICO the Racketeering statute needs context.

In an academic environment particularly, natural language has advantages in that it is easy to learn – there is little “learning,” actually, and as more and more IR systems adopt natural language processing they become more appealing to the “Google generation.”  The “Google generation,” then, may miss potentially relevant documents in their queries as they do not know to consider or accommodate for natural language drawbacks.

While controlled vocabularies can provide context as well as cross referencing, they require upkeep and maintenance to remain relevant, and this means costs, both in terms of monetary upkeep as well as an investment of time to learn and use them confidently.

I do believe that the inexperienced searcher can benefit from controlled vocabularies – if they are willing to take the time to learn them.  Classification schemes, for example, can lead users to add value to their queries that they would not have otherwise, particularly if they are unfamiliar with the topic they are researching.  In addition to the drawback of investment of time, there is also the drawback of investment of upkeep, and I would think this would be of particular concern to academic institutions and government institutions, where budget is often a big concern.

Separately, metadata and markup languages are used in structuring digital data and describing/rendering it in a certain way.  In today’s age I don’t think any academic or government collection would be complete without inclusion of digital data, so knowledge of these languages seems like table stakes.  Controlled vocabularies in contrast, have their origins in describing and organizing books.