Module #4: IR Models

In this week’s assignment, we are asked to choose an IR Model — the Boolean Model, Term Weighting, the Vector Model, or the Probabilistic Model, and:

(1) Provide a detailed explanation of it, and
(2) Use that model in a search on Google and in the USF online library catalog, and

(3) Report where the model works best.

It may not come as a surprise given my post last week, but I am going to choose the Boolean search model. I love my Boolean search logic!

Boolean Search Logic: Explanation, Benefits, Drawbacks

In early retrieval systems, Boolean was the often only search model available. According to Baeza-Yates and Ribeiro-Neto, these earliest systems used three operators: AND, NOT, and OR. The biggest advantage of the Boolean system is its simplicity (to some users): a document either satisfies the query or it does not, period. For those who like Venn diagrams, relevance and non-relevance are also easily explained with these visual tools. Our textbook offers some examples and I am including one here from the New York Public Library’s website: http://www.nypl.org/blog/2011/02/22/what-boolean-search , and another from the UC Berkeley Library website:  http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/Boolean.pdf .

There are drawbacks to Boolean search logic, both in ease of search expression as well as quality of results. Our textbook states that for some it is difficult to express a search query in Boolean logic. Based on my teaching experience, I would agree also with the conclusion that because some researchers are challenged to create queries using Boolean search expressions that they tend to fall back on very basic searches – many times only one term. Then, as there is no ranking component, or term weighting, in classic Boolean, it’s easy to retrieve too few or too many results without a tool or indicator to use in deciding which documents may be most relevant.

By the time I learned Boolean in law school to search the commercial legal databases, enhancements including proximity connectors (W/n, /s, /p) had been added, as well as attempts at term weighting (the ATLEAST command, for example, which would return documents that contained a term n ATLEAST(n) times). However for purposes of this exercise I will stick with the classic AND, NOT, OR.

This will be interesting as I can’t recall having tried to use Boolean search logic on Google.  Per my last post, I usually search Google by casting a broad net, and treat Google more like a Magic Eight Ball — by shaking it and seeing whatever pops out, without too much thought regarding why.

Searching Google and the USF Library Catalog Using Boolean and Comparing Results

(1) Google

The first challenge I had was figuring out how to structure a Boolean query on Google.   It does not appear that Google offers a guide to typing in a Boolean query, or one that I could find anyway.  I was able to find some guides that others have created, for example:  http://www.slideshare.net/charlotteg/google-boolean-searching .

I am going to try several searches on RICO, the Racketeer Influenced and Corrupt Organizations Act, and compare them to searches on “rico.”  I’m going to do this because I want to see if my results contain information on the Racketeer Influenced and Corrupt Organizations Act and/or Puerto Rico.

Not surprisingly, Racketeer Influenced and Corrupt Organizations Act  (no quotes) produces lots of results on the act, over 266,000.  If I put quotes around it, the number of results is reduced to ~166,000.  No explanation as to “why,” but I suspect I eliminated any documents that contained racketeer and influenced and corrupt and organizations and act.

If I search rico (no quotes), I get over 798,000,000 results.  Surprisingly, many of the top ranked (how are they ranked?) ones are on the Statute, but there are some top results on a smarthome security device called Rico that are not on point.

I next entered rico -puerto  .  208,000,000 results, so clearly I’ve eliminated results that would include discussions of puerto rico.  

I’m not sure how I would return results that would be ensured to include both Puerto Rico and RICO.  I tried “puerto rico” AND (rico -puerto) .  With the search structured that way I did get 5 results that contained rico separate and apart from Puerto Rico, but they were not on the RICO statute.  Conversely when I searched “puerto rico” AND “Racketeer Influenced and Corrupt Organizations Act”  I got many documents that contained Puerto Rico and RICO, see e.g., http://en.wikipedia.org/wiki/Racketeer_Influenced_and_Corrupt_Organizations_Act .  So I’m not sure why my first attempt to retrieve results that included Puerto Rico and RICO did not work.

(2) The USF Library Catalog

I used the Advanced Search form and jumped right in, searching rico NOT puerto.  I retrieved over 1.6M results, and while a few of the first results discussed the statute, most discussed people named “Rico,” first or last name.

A search of the statute name produced 48,517 results.

Finally, I searched “puerto rico” AND (rico NOT puerto) and got zero results.  Not sure whether that was because my search logic was off, or because there truly aren’t documents that satisfy that query in the USF catalog.

(3) Where Does the Boolean Model Work Best?

If I had to choose which database delivered the results that looked most on point, it would be the USF catalog.  However, I’m guessing this is not so much due to the search syntax, but more due to the content of the database itself.  I also would have had the option on the USF catalog of using the limiters in the left navigation bar to further refine my results after a first search, unlike Google.

Mostly though, in both environments, analyzing my search queries exposed the deficiencies of Boolean discussed in our readings — the results were either too broad or too narrow, and I felt as though I was somehow “missing something” through faulty search logic in my last query in each database, where I tried to come up with discussions of Puerto Rico and RICO.

Module # 3 User Interface Search – Research Process and Interface Preferences

This week’s homework asks us to consider whether we perceive ourselves as dynamic vs. classic searchers. I realize that my response that I am probably a “classic” searcher makes me a bit of an outlier, but I can explain!

In my experience as a law student and then in one of my jobs teaching law students how to perform online legal research, it was very important to me (and to the students I taught) to carefully consider the first two parts of the classic notion of the information seeking process, (1) to identify the issues we want to research, and (2) what types of sources we think would best provide the answers we are looking for.  In fact, as students, before we signed on to any of the legal databases we had access to, we were told to script out on paper both (1) and (2), and then (3) our query — as at that time the only IR model available to us was Boolean search logic.  As our reading points out, Boolean can be difficult to learn, and if your query does not accurately capture the information you want, you may miss relevant information.  (I used to tell my students, “If you are going to add AND NOT or NOT W/n to your query, add it at the very end.” So many would add AND NOT (term) AND (term) and then wonder why their results got smaller.)

I like the classic approach even if the challenges of Boolean search logic are set aside.  I think forcing us through the classic search model made us better researchers and better critical thinkers.  The drawbacks of “diving into” legal research are many — without carefully considering the issues you want to research, you could ultimately focus on an argument that is not on point.  Time is also at a premium as a practitioner – not everyone has the luxury of “berry picking.”  And, when I learned how to do legal research, there were charges associated with each search that was conducted in the legal databases, so you wanted to be very careful about how many searches you ran!

All of that said, I do find myself doing a “first search” on Google to get the lay of the land and identify potentially relevant terminology before going to the legal databases now.  Usually these are one or two-term searches, very much “orienteering.”  So maybe my approach is evolving.

The second question we are asked to consider is what our interface preferences are.  Given my background it won’t come as a surprise that I like systems that allow for Boolean IR.  Open fields for typing in searches together with a field chooser are always nice – the Advanced Search option on the USF Library site is a good example.  I do really like databases that incorporate query visualization — or at least a list of which documents contain what terms – “X number of documents contain (term) but not (term)” for example.  I think that goes back to my preference for Boolean where a zero response answer set is in fact an answer.  (If the term “summary judgment” does not appear in a database containing all of the caselaw in a particular jurisdiction, then you know for a fact that “summary judgment” has not been considered in that jurisdiction, ever.  Unless of course whoever wrote the opinion spelled “summary judgment” as “summary judgement,” but that is a drawback of Boolean!)

Bells and whistles like point and click maps are nice — many retailers offer clickable maps on the “Find a Store” sections of their sites.  That said, if there is an option (and there usually is) to input a City, State or Zip Code I will choose that option rather than clicking.

I would like to add here an example of an interface I do not like, and why. I do not like Delta’s website, even though I use it quite frequently as I regularly travel for business. I think there are a lot of bells and whistles that are supposed to add visual appeal, but seem to slow down one’s search. Here is what annoys me most of all though. When I search for a flight, my default results list is sorted by “Best Match.” “Best Match” according to who? What criteria?

CaptureDelta

To circle back to the assigned reading, I do believe my preferences are shaped by familiarity – despite my frustration with Delta.  The assigned slides highlight that “often the preferred choice is the familiar one,” and this has been my personal experience as well as feedback I’ve received from my students.  I’ve asked them about their database preferences, and often they will say they like one service versus another simply because they learned the first one, first.

The assignment also asked us to add what we have learned about information retrieval and representation.  The answer is “a lot.”  I was particularly interested in the different approaches to design iteration, particularly the discussion of “longitudinal studies.”  In my various jobs I have commissioned and conducted website usability studies — I’m not a market researcher, but have made decisions based on the results of usability testing.  I’ve often wondered whether the results have been biased or forced due to the controlled one-hour, one on one style interview, and find the idea of testing a new interface approach on end users without them knowing that they are part of a test, a very interesting approach.

Information Representation in Libraries and Social Media

This week’s lecture focused on information representations via controlled vocabularies, including indexing, categorization, summarizations, and citations.

Where does this take place in libraries and social media?  In libraries, the catalog is the primary place where information representation takes place.  Dewey Decimal Classifications, Library of Congress Subject Headings, LC Call Numbers, abstracts, MARC Records, and other bibliographic formats are all examples of information representation.

In Social Media, citations are a common form of information representation.  Bitly links are shortened forms of hypertext links, and both are forms of indexing as identification of resources the web.  On Twitter and in other social media, the pound sign (#) is an important form of indexing as they turn a word or group of words into a searchable link.  Hashtags are supported not only on Twitter and Facebook but also Instagram, Google+ and more.  Hashtags organize users’ content, characterize their thoughts, and enable those thoughts (information) to be found more easily.

For purposes of this assignment, I’d like to focus on hashtags and the challenges of using them as information representations vs. what you might find in a library catalog.  While hashtags allow social media users to represent/summarize their content in their own vocabularies, the  lack of vocabulary control, hierarchy, quality assurance and other issues make them difficult to use as a reliable or accurate information representation mechanism.  Misspellings are also common.  Hashtags have also taken on a dual purpose of conveying or humor or irony in some circumstances, becoming less about a succinct representation of a longer thought and more about relaying extra layers of meaning.  All of these issues can lead to over-inclusion of retrieved information that is less relevant, or unintentional exclusion of potentially relevant information.

In contrast, the controlled vocabularies used in library catalogs increase the likelihood of a better match between information desired and information retrieved.  Authority controls can used to harmonize different names for the same subject, increasing a researcher’s efficiency.  They can also provide an organized structure or hierarchy among information resources.

Beginning “Pioneers of IR” Paper Research

I decided to get a head start on the “Pioneers of IR” paper assignment this weekend, and chose Karen Sparck Jones as my subject.  In reading our textbook about her work, I realized that I had prior experience with the concepts of term weighting and inverse document frequency as part of a project I worked on at my job many years ago, and decided it would be interesting to learn the theory behind it.

I went to the USF Library home page and for starters did a broad keyword search on Karen Sparck Jones with no limitations, no quotes, and retrieved 949 items.  I noticed that the list was retrieved in “relevance” order as a default and wondered what constituted “relevance,” but as 949 items was too many to think about anyway, I decided first to use quotes (708 items, still too many) and then no quotes, but the “Subject” field limiter (13 items – too few).

I have noticed that I am questioning “Why did I get what I got?” as part of this research project but also in everyday internet searching much more, and am more uneasy about what I am probably missing than I may have been previously.  Great. I thought I had good research skills, but am starting to question that.

Information Retrieval

According to Baeza-Yates and Ribeiro-Neto, information retrieval is defined as “providing the users with easy access to information of their interest,” and more specifically “deals with the representation, storage, organization, and access to information items.”  To examine this concept more fully, it’s important to carefully consider each term within the phrase “information retrieval.”

What is information?

Both the Baeza-Yates/Ribeiro-Neto and Chu textbooks talk about the concept of “information,” but I prefer to think about it as discussed in a textbook we are using in another course, Taylor, A. G., & Joudrey, D. N. (2004). The organization of information (p. 3). Westport, CT: Libraries Unlimited. In it, the term “information” is examined along the continuum, “data, information, knowledge, understanding, wisdom,” and the authors relay that “information” has a value-add component to it that “data” does not – a meaning or context has been added to the material, and in order for that to happen an organization process has to take place, transforming “data” into “information.”

What is retrieval?

Chu further considers “retrieval,” distinguishing “information access” from “information seeking” or “information searching,” stating that “access” focuses on the action of “getting,” while “seeking” centers on the user’s pursuit of information and “searching” is focused on the process of that pursuit.

The concept of “matching” as an aspect of “retrieval” as discussed in the Chu textbook also resonated with me. In order for an item of information to “match” a user’s query, there are a number of steps that must precede its delivery, including representation, indexing and ranking.  Each of these is a subject for further explanation of itself!