Academic search engines
The challenge for librarians
We live in an era of instant information. Supposing, for example, I want to find out about Siamese cats, a search query in Google will instantly put me in touch with useful websites where I can find out all about the breed, and even purchase a cat or rehome one I can no longer look after (see Figure 1, below).
Figure 1. Google search results for "Siamese cat"
There has been much concern in academic library circles that students are infected by this sort of instant information gratification, which has set them against the more structured world of libraries. Many libraries have considerable resources in the form of databases to which they subscribe, recommended websites, e-journals, gateways, etc. For an example see Figure 2, below, of a screenshot depicting the University of Bedfordshire's "digital library".
Figure 2. Screenshot showing the University of Bedfordshire’s "digital library"
But for many, checking through all these resources is just too much trouble: users want the "one-stop-shop" approach of Google.
The OCLC Online Computer Library Center carried out a survey in 2005 on search behaviour, and found that 84 per cent of respondents used web search engines to look for information compared with 1 per cent who used the library web page. Libraries were regarded as places for books rather than electronic information and librarians were rarely consulted (Myhill, 2007).
The problem with this approach, however, is that whereas search engines are fine for general requirements, they are rarely suitable for academic information. Search engines do not have access to the important subscription-only databases or the "invisible web" – where many academic documents are stored, their retrieval process is too random and unstructured, and the results are open to manipulation (famously an example of which was the "Googlebomb" whereby entering the words "miserable failure" into Google brought up an entry on George Bush [Chen, 2006]).
Studies have shown (see the library management viewpoint on User studies) that users want to be able to search library resources, where they can be sure of the quality, but without having to learn complex search strategies or search across multiple databases (understandable, as many libraries subscribe to hundreds). Ideally they want a single interface which draws all these databases together into one search operation.
How then should the librarian advise, and how can the library itself create this quality controlled one-stop-shop search environment?
There are a number of search options for more academic material, which range from the freely provided services such as Google Scholar through to the complex federated search systems which enable the library to create a gateway to its resources. Before considering these options, we need to look in a bit more detail at the nature of the academic invisible web, and why it is inaccessible to "ordinary" search engines such as Google and Yahoo.
The invisible web
Much academic and scholarly content is now only available on the Web, which is why the latter is so important and why search engines need to crawl it thoroughly.
Unlike the general Web, which consists mainly of individual sites, the "academic web" is much more structured. In many disciplines, the most recognized form of publication is in a peer reviewed journal; even if the journal still exists in print, it will be most commonly accessed via a full-text database.
Then there are collections of digitized material, for example the British Library’s newspaper collection, the OPAC catalogues of major libraries with their bibliographic records, and the repositories of societies and corporations. Finally the Open Access movement, where scholars can publish without the user having to pay licences for large holdings, is becoming more and more significant, and is creating its own vast databases of material.
The problem, however, is that much of the above content is "invisible" to most search engines, for a variety of reasons (Ford and Mansourian, 2005):
- The pages are not static HTML, the format favoured by most search engines. Such pages include text documents, survey data in relational databases, and dynamic HTML pages. (Although the ability of the main search engines to search different formats is increasing.)
- Pages may be on a proprietary or private site, and require authentication or a password. This includes the majority of publishers’ databases.
- The pages are buried too deep – for example, if a website has a complex hierarchy, information at the lower levels may simply not be picked up by search engines.
- Information is constantly changing, at a faster rate than most search engines index.
- Other reasons – for example, broken links, or pages which simply are not well linked and which have not been submitted to search engines.
The main providers of such content, according to Lewandowski and Mayr (2006), are
- database vendors,
- commercial publishers of full-text material,
- libraries, and
- repositories, including open access ones.
Whatever the source, if the search engine is to provide a full picture it must be able to gain access to this content. See the diagram, Figure 3, below which shows the elements of the invisible web.
Figure 3. Diagram of the invisible web (Ford and Mansourian, 2005)
Freely available academic search engines
There are a number of search engines which aim to provide access to academic content, some of which are developed by commercial companies, others arising out of academic initiatives. The best known is Google Scholar.
Google Scholar was launched by Google in November 2004 to much publicity, with the aim of making academic content easily available. It claims:
"From one place, you can search across many disciplines and sources: peer-reviewed papers, theses, books, abstracts and articles, from academic publishers, professional societies, preprint repositories, universities and other scholarly organizations. Google Scholar helps you identify the most relevant research across the world of scholarly research".
Google Scholar searches publishers’ databases and open access repositories, thereby mimicking the attributes of a metasearch engine, which searches across particular resources and by means of other search engines as opposed to randomly searching the Web.
Google Scholar was the first academic search engine not to be limited to science. (Although that is its bias, as is evident from the fact that of the seven subject area options, five are from the sciences, one is for business, administration, finance and economics, and one for the social sciences, arts and humanities.)
One of the weaknesses of Google Scholar is that although the advanced search option offers the opportunity of searching for particular authors or journals, the criteria are limited compared with, for example, the Emerald database. The main options are title or full text (see below), whereas the Emerald database can also be searched by keyword or abstract. The latter are particularly useful when you want to ensure that your returns have more than a marginal relevance.
Figure 4. Google Scholar’s advanced search
Whereas a "normal" search engine only searches bibliographic metadata, Google Scholar also searches full text. (Note that here it has an advantage over many federated search engines, which can only search metadata.) Useful features are the ability to link quickly into a full search engine (e.g. in Google Scholar, web search is an option in the results, see Figure 5), and Library Link, which enables users to select a library of their choice and have links to the catalogue highlighted in their search results.
Despite the limited nature of its advanced search, Google is fast – it can retrieve a large number of items in less than a second – and easy to use. One study compared Uppsala University students’ experience of Google Scholar and MetaLib, and found that the former was generally more appreciated (Haya et al., 2007). They thought it more usable than MetaLib, liked its simple interface, and found that it produced a higher number of quality articles.
One of the most time-consuming aspects of searching involves going through search results, deciding which are relevant. Ranking is therefore very important. Google Scholar’s ranking takes account of the full text, the author, the publication and the number of times it has been cited. Sometimes, particularly if a search is for a full article title, it will pick up citations, but it clearly labels these as such ([CITATION]).
Whereas most abstracting and indexing services group results according to document or media type (book, journal article, etc.), and occasionally by refereed or non-refereed journal, Google Scholar to date has no clustering. This can make results confusing:
Figure 5. Results of a search in Google Scholar
The main criticisms of Google Scholar are that not only is its coverage incomplete, but also there is no way of knowing what the gaps are, as it is very secretive about its sources (Chen, 2006: p. 425). Whereas many other academic search engines list their sources, Google Scholar does not.
There have been a large number of evaluations and research studies of Google Scholar. One of the most comprehensive is that by Mayr and Walter (2007), who in August 2006 carried out a study based on queries against different journal lists (three from Thomson Scientific, one open access journal list, and the SOLIS social sciences literature database), looking at a total of around 9,500 journals. The study showed that although the majority of journals did appear in the results, many did so as citations rather than direct links. Moreover, open access sources were under-represented, and the service was not updated with sufficient regularity. The authors conclude:
"In comparison with many abstracting and indexing databases, Google Scholar does not offer the transparency and completeness to be expected from a scientific information resource. Google Scholar can be helpful as a supplement to retrieval in abstracting and indexing databases mainly because of its coverage of freely accessible materials."
(Mayr and Walter, 2007: p. 828.)
One of the most up-to-date studies of Google Scholar is that by Péter Jacsó (2008). He applauds the considerable growth in content and in particular the impact of the Google books project, but complains about the poor quality of the software which does a poor job of retrieving highly structured and tagged documents, and provides inadequate search facilities and no sorting of results. The whole article is well worth reading for any serious Google Scholar user.
Scirus was launched by Elsevier in 2001 as a scientific search engine. It searches (according to Robinson and Wusterman, 2007) the academic surface web as well as Elsevier collections. It currently claims to have over 450 million scientific items including (in addition to journal content):
- scientists' home pages
- pre-print server material
- institutional repository
- website information.
Unlike Google, it is open about its sources (see the list on http://www.scirus.com/srsapp/aboutus/) although Mayr and Walter (2007) are critical of the coverage. Its advanced search facility has more criteria, and it is possible to search within one or more broad subject area. It updates monthly. McKiernan (2005) provides a comprehensive review.
Figure 6. Advanced search options in Scirus
Other search engines
There are a number of more specialist academic search engines. Bielefeld Academic Search Engine (BASE) is a German product which searches open access collections; in 2006, it contained 2.7 million documents in 189 collections, including 500,000 digitized pages of historical journals and review organs of the German enlightenment (Pieper and Summann, 2006). It searches both metadata and full text, is open about its sources, and offers various options for searching.
Blogs are important sources of data for social scientists, in that they reveal public opinion. Some produce time series graphs for searches that show results over a six-month period, for example www.technorati.com, www.blogpulse.com, www.icerocket.com (Thelwall, 2007).
Federated search engines
Federated search engines work on similar principles to metasearch engines in that they do not index the Web themselves, but rather search other search engines, web directories, databases and other parts of the invisible web. There are currently a number of federated search options (also called metasearch) for libraries, often linked in with other systems supporting electronic products and services, for example MetaLib (developed by Ex Libris as part of their suite of information management tools), WebFeat (used by more than half of the top US public libraries), MetaFind, ENCompass and CentralSearch.
This option is popular with libraries because of the ability to search multiple databases through one interface; the user can plug in a word or a phrase into a single search box and end up with a number of relevant results in a merged list.
Another advantage is that libraries can retain control of content, how the search is organized for the user, and to some extent the display of the results. Federated search tools are customizable so the library can incorporate their own catalogues and databases, creating a portal or gateway through which the user can access the full range of electronic holdings.
Bristol University Library describes how the library portal/gateway is constructed by a combination of MetaLib and SFX, which together enable the user to:
- find and use information resources to which they have access, such as databases, library catalogues, subject-based web gateways, e-journals, e-books, and selected Internet resources;
- use a common interface to simultaneously cross-search these resources (up to ten at one time, where enabled by the supplier), then view and save results;
- locate journal titles available in print or online and link to full text where available, via Get it! (context-sensitive linking).
The library will need to define the targets to be searched, what categories these fit into (usually subject related), and the "search" groupings, as well as design the interface. Metasearch tools may bring their own collections – MetaLib for example brings a selection of resources and databases which it claims are regularly updated. Vendors will therefore need to be notified of any subscription changes.
A major problem with databases is that to many students they are just names with little attached meaning, so an inexperienced user might end up looking for Jane Austen in Medline, for example. Many libraries provide subject guides to their collections, and federated search tools allow for databases to be grouped according to subject categories, and described in the way the library chooses.
Federated search tools are powered so that users can search over multiple databases, and either be linked directly to the search interface of the database, or receive results in a merged list.
Standards for search and retrieve in databases have been developed – for example, Z39.50 or NISO Metasearch XML Gateway – which make searching easier. MetaLib for example uses Z39.50, which enables it to bypass the database’s native search interface and use its own search system to retrieve and display results. However, it is only possible to search those databases which have also implemented Z39.50, and while more and more publishers are seeing the need for standards, not all are compliant. A federated search engine is only as good as the databases it searches: it may translate the search into something the native search engine understands, but it cannot improve on the latter’s search interface.
Additionally, federated search engines can only search citations, not full-text or abstracts, so that will be the basis of their relevancy ranking.
Another very useful feature is the ability to personalize searching by creating one’s own lists of resources, save and retrieve searches, and create preferences for the display of results: for example, MetaLib’s My Space.
Most federated search tools offer both a simple search option across the total range of databases (cross-searching), and also the opportunity to narrow the search by subject, database, or resource type. How these options work, and how they are described, varies according not only to the system but also its implementation, but here is an example from the Pennsylvania State University’s MetaLib (Figure 7), which offers:
- A "quick search" where the user can do a single box search or a more advanced one with Boolean operators and the option of searching by title, subject, author, ISBN, ISSN or year.
- A "multisearch" option where the user can select databases according to category.
Figure 7. Multisearch option in Pennsylvania State University’s version of MetaLib
Results from the "quick search" option are presented according to database, as shown in Figure 8 below:
Figure 8. Results from the "quick search" option in Pennsylvania State University’s MetaLib
Exeter University Library uses MetaFind, which it is possible to access from an icon on the library catalogue page:
Figure 9. MetaFind on the University of Exeter Library’s catalogue page
Users can search the library’s electronic databases and other resources via a simple search with options of keyword, title, author, or subject, or they can go to an advanced search option which allows them to search individual databases and use Boolean operators.
Central search has a series of different check boxes to help indicate the subject area underneath the search box, and it is possible to cluster results under date (the default option), author, article title, and database.
There are drawbacks to federated search: despite the unified search interface, they cannot compete with Google for speed, and they are also complex, which puts some students off. Haya et al. (2007), in their study of students using MetaLib and Google Scholar, found that benefits of metasearching were offset by the complexity of the tool. They also criticized the lack of standard search rules, and the fact that the back button did not work.
Another study of MetaLib, this time at Carnegie Mellon University and involving usability testing with students using thinkaloud protocols (George, 2008), also found students bemused by the complexity of the search processes. There was confusion about authentication, the SFX linking symbol, the cross-search (choosing databases) option, and navigation. The authors are particularly critical of the text-based navigation, proposing a more graphic approach for future versions.
Myhill (2007) reports on a large survey of the various library search and discovery tools at the University of Exeter Library, concluding that although their chosen system (MetaFind) received relatively high ratings for ease of use, the library still needed to do more to ensure that students could retrieve information easily and quickly, and without being deluged.
Another problem is that although the chore of individual search is avoided, results are ranked by database and therefore students need to be familiar with these. (Chen, 2007: p. 417, observes that in effect the trawl through databases is postponed from before to after the search.) Oberhelman (2006) describes how he conducted an experiment with his library’s newly implemented central search system. Mimicking the behaviour of a naïve user, he typed in the term "model", which has different connotations in different academic disciplines, into the search box, only to get a total of 678,664 hits! It is possible to refine that search by database – however this is only useful if one has some knowledge of what databases are relevant in one’s own area.
There are other issues related to databases – such as will the library’s authentication system be recognized when the user goes in via the federated search engine? Inevitably, too, different databases will return the same results, and duplication will become a problem.
Linking search with discovery
Users of search engines expect to be able to go straight to the object searched. Libraries are recognizing that it is much easier to discover and use resources when they are linked direct from the federated search, rather than the user having to go to another database. Link resolvers (for example, SFX, produced by the vendor of MetaLib, Discovery Resolver, Article Linker, and WebBridge) are tools that enable users to navigate direct to the resource. They are able to do this providing the link complies with OpenURL, which is a syntax for identifying content and creating web-transportable packages of metadata. OpenURL can link from OPACs and both bibliographic and full-text databases.
Conclusion and references
Clearly, federated or metasearch systems are powerful tools which can create a single gateway linking to all the library’s resources and have great advantages for the library and for the user. The former can ensure that all its resources, together with knowledge as to which is most suitable for which category, are at the user’s disposal. The user can be easily directed to the most useful resources and be assured of their quality.
Chen (2007) concludes that although federated search engines cannot compete with the speed of Google, they are an improvement on carrying out multiple searches, and a good way of ensuring the quality of the content and the objectivity the retrieval and display of results.
For federated search engines to work well, however, certain things are necessary. First, standards for transferring metadata, as well as search and retrieval in databases, should be adhered to. Publishers who fail to adopt Z39-50, for example, will inhibit a global search. Second, vendors of federated search systems need to ensure that such systems are as simple as possible to the user, and return results rapidly with appropriate ranking.
And yet, no system can ever be so simple that it bypasses basic information literacy. To gain reasonable results will require some human manipulation: it may be possible to incorporate all human knowledge in one system, but the user will still need to be able to devise strategies to find it.
Chen, X. (2006), "MetaLib, WebFeat, and Google: The strengths and weaknesses of federated search engines compared with Google", Online Information Review, Vol. 30 No. 4, pp. 413-427.
Ford, N. and Mansourian, Y. (2006), "The invisible web: an empirical study of 'cognitive invisibility'", Journal of Documentation, Vol. 62 No. 5, pp. 584-596.
George, C. (2008), "Lessons learned: usability testing a federated search product", The Electronic Library, Vol. 26 No. 1, pp. 5-20.
Haya, G., Nygren, E. and Widmark, W. (2007), "Metalib and Google Scholar: a user study", Online Information Review, Vol. 31 No. 3, pp. 365-375.
Jacsó, P. (2008), "Google Scholar revisited", Online Information Review, Vol. 32 No. 1, pp. 102-114.
Lewandowski, D. and Mayr, P. (2006), "Exploring the academic invisible web", Library Hi Tech, Vol. 24 No. 4, pp. 529-539.
Mayr, P. and Walter, A.-K. (2007), "An exploratory study of Google Scholar", Online Information Review, Vol. 31 No. 6, pp. 814-830.
McKiernan, G. (2005), "E-profile: Scirus: for scientific information only", Library Hi-tech News, Vol. 22 No. 3, pp. 18-25.
Myhill, M. (2007), "Canute rules the waves: Hope for e-library tools facing the challenge of the Google generation", Program: electronic library and information systems, Vol. 41 No. 1, pp. 5-19.
Oberhelman, D. (2006), "The time machine: federated searching today and tomorrow", Reference Reviews, Vol. 20 No. 3, pp. 6-8.
Pieper, D. and Summann, F. (2006), "Bielefeld Academic Search Engine (BASE): An end-user oriented institutional repository search service", Library Hi Tech, Vol. 24 No. 4, pp. 614-619.
Robinson, M. and Wusteman, J. (2007), "Putting Google Scholar to the test: a preliminary study", Program: electronic library and information systems, Vol. 41 No. 1, pp. 71-80.
Thelwall, M. (2007), "Blog searching: The first general-purpose source of retrospective public opinion in the social sciences?", Online Information Review, Vol. 31 No. 3, pp. 277-289.