In this week’s assignment, we are asked to choose an IR Model — the Boolean Model, Term Weighting, the Vector Model, or the Probabilistic Model, and:
(1) Provide a detailed explanation of it, and
(2) Use that model in a search on Google and in the USF online library catalog, and
(3) Report where the model works best.
It may not come as a surprise given my post last week, but I am going to choose the Boolean search model. I love my Boolean search logic!
Boolean Search Logic: Explanation, Benefits, Drawbacks
In early retrieval systems, Boolean was the often only search model available. According to Baeza-Yates and Ribeiro-Neto, these earliest systems used three operators: AND, NOT, and OR. The biggest advantage of the Boolean system is its simplicity (to some users): a document either satisfies the query or it does not, period. For those who like Venn diagrams, relevance and non-relevance are also easily explained with these visual tools. Our textbook offers some examples and I am including one here from the New York Public Library’s website: http://www.nypl.org/blog/2011/02/22/what-boolean-search , and another from the UC Berkeley Library website: http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/Boolean.pdf .
There are drawbacks to Boolean search logic, both in ease of search expression as well as quality of results. Our textbook states that for some it is difficult to express a search query in Boolean logic. Based on my teaching experience, I would agree also with the conclusion that because some researchers are challenged to create queries using Boolean search expressions that they tend to fall back on very basic searches – many times only one term. Then, as there is no ranking component, or term weighting, in classic Boolean, it’s easy to retrieve too few or too many results without a tool or indicator to use in deciding which documents may be most relevant.
By the time I learned Boolean in law school to search the commercial legal databases, enhancements including proximity connectors (W/n, /s, /p) had been added, as well as attempts at term weighting (the ATLEAST command, for example, which would return documents that contained a term n ATLEAST(n) times). However for purposes of this exercise I will stick with the classic AND, NOT, OR.
This will be interesting as I can’t recall having tried to use Boolean search logic on Google. Per my last post, I usually search Google by casting a broad net, and treat Google more like a Magic Eight Ball — by shaking it and seeing whatever pops out, without too much thought regarding why.
Searching Google and the USF Library Catalog Using Boolean and Comparing Results
The first challenge I had was figuring out how to structure a Boolean query on Google. It does not appear that Google offers a guide to typing in a Boolean query, or one that I could find anyway. I was able to find some guides that others have created, for example: http://www.slideshare.net/charlotteg/google-boolean-searching .
I am going to try several searches on RICO, the Racketeer Influenced and Corrupt Organizations Act, and compare them to searches on “rico.” I’m going to do this because I want to see if my results contain information on the Racketeer Influenced and Corrupt Organizations Act and/or Puerto Rico.
Not surprisingly, Racketeer Influenced and Corrupt Organizations Act (no quotes) produces lots of results on the act, over 266,000. If I put quotes around it, the number of results is reduced to ~166,000. No explanation as to “why,” but I suspect I eliminated any documents that contained racketeer and influenced and corrupt and organizations and act.
If I search rico (no quotes), I get over 798,000,000 results. Surprisingly, many of the top ranked (how are they ranked?) ones are on the Statute, but there are some top results on a smarthome security device called Rico that are not on point.
I next entered rico -puerto . 208,000,000 results, so clearly I’ve eliminated results that would include discussions of puerto rico.
I’m not sure how I would return results that would be ensured to include both Puerto Rico and RICO. I tried “puerto rico” AND (rico -puerto) . With the search structured that way I did get 5 results that contained rico separate and apart from Puerto Rico, but they were not on the RICO statute. Conversely when I searched “puerto rico” AND “Racketeer Influenced and Corrupt Organizations Act” I got many documents that contained Puerto Rico and RICO, see e.g., http://en.wikipedia.org/wiki/Racketeer_Influenced_and_Corrupt_Organizations_Act . So I’m not sure why my first attempt to retrieve results that included Puerto Rico and RICO did not work.
(2) The USF Library Catalog
I used the Advanced Search form and jumped right in, searching rico NOT puerto. I retrieved over 1.6M results, and while a few of the first results discussed the statute, most discussed people named “Rico,” first or last name.
A search of the statute name produced 48,517 results.
Finally, I searched “puerto rico” AND (rico NOT puerto) and got zero results. Not sure whether that was because my search logic was off, or because there truly aren’t documents that satisfy that query in the USF catalog.
(3) Where Does the Boolean Model Work Best?
If I had to choose which database delivered the results that looked most on point, it would be the USF catalog. However, I’m guessing this is not so much due to the search syntax, but more due to the content of the database itself. I also would have had the option on the USF catalog of using the limiters in the left navigation bar to further refine my results after a first search, unlike Google.
Mostly though, in both environments, analyzing my search queries exposed the deficiencies of Boolean discussed in our readings — the results were either too broad or too narrow, and I felt as though I was somehow “missing something” through faulty search logic in my last query in each database, where I tried to come up with discussions of Puerto Rico and RICO.