The Semantic Web – a new tool for libraries?

By Margaret Adolphus

The Web is enormous – there are currently (as of December 2009) more than 20 billion pages, 230 million web servers and 681 million hosts (Hall and Shadbolt, 2009). It pervades every area of life: we use it for keeping up with old friends, buying presents, booking holidays, browsing library catalogues, reading academic journals, and much, much more.

And it keeps evolving, to the point where we can do more and more with it.

Initially, it was a collection of documents, so we used it to look up information and make purchases. Then along came Web 2.0 and we could upload our own content in the form of blogs, social networking sites, etc.

Now there is potential for the Web to be an even greater source of information. Imagine, for example, that I want to find out the names and locations of special schools near my home. This is what I can find by going to the portal, EduBase:

Figure 1. Map of schools linked to postcode © Crown copyright all rights reserved (Ordnance Survey Licence number 1000384332009)

Data from EduBase have been merged with that from the Ordnance Survey so I can tell exactly where the schools are in relation to my home; back on EduBase I can consult data on individual schools.

The Web has now evolved to the point where it is possible to extract information in a meaningful way to meet our immediate needs. This is what is meant by the Semantic Web – semantics being the science of meaning.

This is what Tim Berners-Lee said about the Semantic Web, at the turn of the millennium:

"The Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation. The first steps in weaving the Semantic Web into the structure of the existing Web are already under way. In the near future, these developments will usher in significant new functionality as machines become much better able to process and 'understand' the data that they merely display at present" (Berners-Lee et al., 2001; quoted in Macgregor, 2008).

The basic goal of the Semantic Web is to tighten the Web's structure and so create an even more powerful fabric of information. For example, if you are planning a trip to Paris, you will probably need to search several different sites for airlines, hotels, museums and other tourist attractions. Imagine if an intelligent software agent could do that for you – wouldn't that make the task simpler?

And the Semantic Web application EduBase helps you to consult nearby institutions in one operation, not two.

What makes all this possible is that the Web has evolved from being a collection of documents to one of data, so one can search over several databases, and combine datasets.

There is nothing new in combining different data: in 1854, London surgeon John Snow mapped data on cholera to that of water sources and discovered that the disease was waterborne, not airborne. What is new is the immediacy with which the Web is able to retrieve a large amount of information.

Data can be collected from anywhere on the Web, from any type of resource (publications, multimedia, databases or scientific study workflow, for example) and in any format (text, html, xml, Excel, etc.). A standardized structured syntax renders documents in different formats machine readable.

Content can be extracted because it is possible to retrieve not just pages, but objects, whether these be restaurants, schools, people, books, etc. In other words, immediate access to the "thing" itself, rather than to the page where the "thing" occurs.

All this has the potential for a greatly improved search experience. At the moment, search engines such as Google reach only the tip of the Web's iceberg; much lies hidden behind authentication walls in publishers' databases, at low levels in deep hierarchies, or in difficult to search formats such as text documents. (Read more about the invisible Web.)

Traditional information retrieval is also haphazard: appearance of sites depends not on their real relevance, but on their popularity, i.e. the number of hits and links. Moreover, search is based on keywords which may mean different things. For example, a search on "libraries in Brazil" may bring up a public library in the US rather than the country in South America.

The Semantic Web attempts to take some of the drudgery out of human search and hand it over to machines. Code enables pages to be read by machines and software agents search over multiple databases to extract information relevant to a very specific query.

Some of the principles of the Semantic Web are similar to that of Web 2.0; the latter uses informal "folksonomies" to tag objects, and you can create your own mashups of data. The difference, however, is that the Semantic Web uses standard language and vocabulary, whereas mashups may group different sites with different types of data, and personally selected tags are replaced by formal ontologies, ensuring consistency of use.

Semantic search is also less haphazard than keyword search, in that if terms with similar meanings are grouped together under an ontology, the results are more likely to be relevant.

Take for example the search query "telecom company" Europe director: a semantic search engine would search not only the actual words of the search string, but also other related terms, for example, different sorts of telecom companies, cities that were in Europe, and different sorts of director. So pages describing the appointment of a chief technical officer of a mobile company in London would be retrieved (Davies, 2009).

The building blocks of the Semantic Web

The World Wide Web Consortium (W3C), the international community that develops common protocols to ensure the long-term growth of the Web, is working on standards to build a "technology stack" to support the Semantic Web. These standards have a common aim: to create a uniform way of accessing heterogeneous data sources.

The main principles have been summarized by Burke (2009) as follows:

Metadata, in other words, resource description format (RDF) technologies which identify and exploit relationships between items.
Ontologies, which provide vocabularies for the description of properties and classes.

Image: Figure 2. The Semantic Web Layer Cake.

Figure 2. The Semantic Web Layer Cake (Berners-Lee, 1999; Swartz-Hendler, 2001, from Hall and Shadbolt, 2009)

Uniform resource identifiers

Whereas URLs refer to location, uniform resource identifiers (URIs) refer to objects. That object may be an information resource, a real world entity, a person, a term in a vocabulary, or even a phrase denoting a relationship, for example "is a". This URI needs to be capable of being "de-referenced", in other words, we should be able to get something back.

Linked data and RDF

Linked data are the basis of the Semantic Web. However, these data are available in many different formats – for example, relational, XML, HTML. For these data to be searchable and manageable they need to be in a standard format – which is what RDF does.

Based on ideas from artificial intelligence, RDF provides additional metainformation; it is also a way of decomposing knowledge into its constituent parts.

There are three standards concepts in RDF: resources, properties and statements. All RDF statements are represented as triples, with a subject, predicate and object, and each part of a triple is represented by a URI.

The following example in Table I is based on the statement: "An apple is a fruit" (Krötzsch, 2008):

Table I. Example of the RDF statement: "An apple is a fruit"
Construct	RDF-Type	Part of the sentence
Resource	rdf:subject	an apple
Property	rdf:predicate	is a
Resource	rdf:object	fruit

The code would look like this:

<rdf:RDF>
<rdf:Statement>
<rdf:subject rdf:resource"Apple" />
<rdf:predicate rdf:resource="onto;is a" />
<rdf:object rdf:resource="Fruit" />
</rdf:Statement>
</rdf:RDF>

As of March 2009, there were 4.5 million triples on the Web (Hall and Shadbolt, 2009).

Image: Figure 3. Datasets on the Web as of March 2009 (Hall and Shadbolt, 2009).

Figure 3. Datasets on the Web as of March 2009 (Hall and Shadbolt, 2009)

W3C has developed a new syntax, RDFa, which is simpler and – herein lies its beauty – can be embedded in XHTML documents. Thus Web pages can be transformed, by a simple piece of script, into items that can be semantically searched and retrieved, without changing the way they are viewed in a web browser.

Whitehouse.gov is incorporating RDFa into its site, with property, rel. and xlms attributes to provide better structure (Peterson, 2008).

Ontologies

Ontologies, or vocabularies, are a form of taxonomy. An ontology has been described as:

"a schema that formally defines the hierarchies and relationships between different resources. Semantic Web ontologies consist of a taxonomy and a set of inference rules from which machines can make logical conclusions" (Altova, 2009).

In other words, they are a domain-specific shared vocabulary. They provide additional meaning to the data, and so make it more flexible. They can be used for integrating data, for example when new relationships may give rise to new knowledge.

In the field of health care, medical and pharmaceutical knowledge could be combined with patient data for epidemiological research, and information about treatment efficacy (W3C, 2009).

They can also help reduce ambiguity. For example anyone seeking information on the British prime minister would have to use those terms, as well as his name: ontologies can be developed which link the name with the function. Another example would be a bookseller or library trying to build a databases from lots of different publishers' datasets. The latter may use different terms for author, for example, creator or editor, and the ontology can clarify that these are variants.

Two of the main techniques to describe vocabulary terms in standard form are Web Ontology Language (OWL), which can add more vocabulary for describing properties and classes, and Simple Knowledge Organization System (SKOS). The latter is used to design knowledge organization systems and has clear applications to libraries.

Query languages

The Semantic Web needs its own query language, which relates to RDF just as SQL relates to XML. This language is known as SPARQL.

SPARQL, like RDF, is based on triples, with the exception that one or more references is a variable; results are returned which match the RDF triple.

Semantic Web applications

While the Semantic Web has not yet achieved its full potential, there are a number of quite well known projects, some of which include:

DBpedia applies the principles of the Semantic Web to Wikipedia. It comprises a database of facts about various things: people, places, music albums, video games, organizations, species, diseases, for example. It also incorporates geographical datasets from GeoNames, making it another example of two very specific datasets being combined.
Friend of a Friend (FOAF, www.foaf-project.org) describes people, their activities and how they are related to other people. This can be used for site personalization.
MusicBrainz (http//musicbrainz.org) is a musical database with metadata for albums, discographies, biographies, etc.
Semantically-Interlinked Online Communities (SIOC, http://sioc-project.org) links online communities (blogs, mailing lists, etc.)

It will come as no surprise that the Semantic Web has huge implications for science, society, government and business.

In the field of conventional search, search engines such as Google and Yahoo are adopting semantic technology.

For example, Yahoo's SearchMonkey is an open search platform which leverages structured data to build more effective and detailed search results (e.g. a search for a restaurant might return reviews as well as basic information).

Comparison websites such as comparethemarket.com have become popular as a way of comparing different services, and companies are using semantics to enable faster discovery of a suite of services and respond more quickly to changing customer need.

Scientific applications include computer modelling to aid seismic hazard research in the South California Earthquake Center.

Towards more open government

One of the most exciting applications of the Semantic Web lies in the area of government. In the US, Data.gov's aim is to:

"increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government".

There are also links to sites built by individual states, where these have published their own data.

The award for the most enterprising use of the Semantic Web in government should, however, surely go to Recovery.gov, described by David Peterson as a move

"of radical transparency the likes of which has not been seen in the US or any other country for that matter" (Peterson, 2009).

Recovery is the brainchild of President Obama's Government and shows how the $800 billion dollar stimulus package will be spent. Powered by the open source software Drupal, it shows you exactly where the money is going, and how many jobs have been created/saved.

The UK is not being left behind: data.gov.uk is based on similar principles to data.gov, and aims to have all government datasets available to the general public. It's a major project, with Tim Berners-Lee and Nigel Shadbolt acting as advisers, and over 1,000 registered developers. The code is in the public domain, at code.google.com, and the data are provided by different agencies using common standards.

There are currently (December 2009) over 1,100 datasets, and a beta version is expected in January 2010. There is a schedule for release of data from all government departments, starting with the Ordnance Survey, which provides mapping and geographical data for the UK.

Transparency and the need to be more accountable are key strategic drivers, as is a desire to improve public service, and a recognition of the value of data both in its own right and re-used in other contexts. It has necessitated a change in mindset: civil servants need to be open, believing that information belongs to people and should be held in the public domain.

The Semantic Web and the library

It might be thought that there was a high degree of relevance of the Semantic Web to the library and information science (LIS) community. Not only does it affect silos of information which they service, but also knowledge organization is librarians' traditional expertise. In recent years this has extended out from traditional resources (books, journals, databases, etc.) to the "visible web", through provision of "approved" sites lists.

There are also obvious benefits. Digital libraries structure electronic information and aid search; the Semantic Web potentially offers tools to enhance resource discovery. The former offer a degree of structured data relatively unusual on the Web, and are therefore prime candidates for semantic enhancement (Prasad and Madalli, 2008).

Ontologies, which are by their nature domain specific, are similar to taxonomies but make the content easier to reuse, so are therefore of benefit to subject portals.

The skill set which librarians draw on also maps neatly onto semantic techniques: classification, cataloguing, ontologies, the creation of metadata – all skills on which the Semantic Web draws.

The overlap extends to digital library technologies. RDF can work well with Dublin Core standard for metadata (Joint, 2008), widely used by digital libraries. And many libraries use XML for their websites – and thus have already gone some way towards a disciplined web structure (Joint, 2008).

RDF is highly compatible with XML, so adding semantic elements to an XML based website is not a major undertaking.

Bygstad et al. (2009) describe how adding semantic content to the National Library of Norway was a relatively simple matter from a technical point of view, because the library was considered one of the world's top ten of its kind.

The reluctant librarian

Yet despite all these synergies, there has been no rush among librarians to embrace Semantic Web principles. Writing in 2007, Joint (2008) proclaimed:

"At the everyday level of library practice, just now the 'Semantic Web' looks like a great idea which is still awaiting its big opportunity for a wide-ranging relevant application."

Two years later, there are few signs of that "wide-ranging relevant application", and librarians are more likely to be interested in Library 2.0 than in the Semantic Web.

Various explanations have been given for this lack of enthusiasm. Krause (2008) compares ontologies with thesauri, and believes that the latter are preferred by the LIS community. This is because although the greater precision of the former does away with a need for the human element, the considerable effort in their development is far greater than the small extra search effort needed by thesaurus-driven products.

The reality is, libraries have put considerable work into developing their catalogues and systems, and may not consider the extra effort needed to become "semantic" worthwhile. RDF compliant metadata still require a high overhead of time to create, according to Burke (2009), who also points out that libraries may want to stay loyal to existing (non-semantic) library management systems.

There is also the question of licence restrictions: certain databases may be confined to library members only and there is therefore little incentive to expose the collection more widely, according to Joint (2008).

Library semantic applications

Notwithstanding this reluctance, there are a few interesting applications:

1. vascoda and Sowiport

vascoda is a German, scholarly portal for scientific information, which combines different discipline-based portals into an overriding one. As each portal uses different terms, a considerable mapping exercise was required. Sowiport is a social science portal based on the same principles; both have been developed by the GESIS – Leibniz Institute for the Social Sciences in Bonn, Germany (Krause, 2008).

2. JeromeDL

JeromeDL describes itself as a "social semantic digital library", and was developed by Semantic Web researchers at the Digital Enterprise Research Unit, based in Ireland. It includes a high degree of personalization, for example the ability to annotate items, create a user profile, and have personal bookshelves.

Each resource uses three types of metadata:

bibliographic,
structural, and
community.

The structural metadata deliver information about content (for example chapters of a book), recorded in a structured ontology. Librarians can describe resources using a range of controlled vocabularies. New concepts can be suggested, and there is a facility for user tagging.

An interesting feature of JeromeDL is the way it combines social and semantic technologies: users can share bookmarks and comments.

3. Talis Platform

Talis, whose library systems are widely used, has released Talis Platform, which has semantic features and can store and search both content and RDF metadata.

The catalogue is at the heart of the library, and Burke (2009) sees it as benefiting from a more semantic approach: for example, there could be more genre, data and author information.

Papadakis et al. (2009) describe how librarians attempted to overcome keyword ambiguities through developments in their catalogue at Ionian University. The ontology is based on Library of Congress Subject Headings, and each term includes the opportunity to broaden or narrow the search.

The options appear as the user scrolls from left to right along the screen, and the idea is to bridge the information gap between the terms which the user employs and those used in the catalogue.

A neater manifestation of the same principle is seen in the application yufind, shown below in use by Yale University Library. This adopts filtering techniques to narrow searches. Thus, a search for John F. Kennedy received the following help with sifting through the 1,460 results:

Image: Figure 4. Screenshot of Yale University Library's yufind.

Figure 4. Screenshot of Yale University Library's yufind

One of the technologies with the most direct relevance to the library community is Simple Knowledge Organization System (SKOS) which is an RDF-compliant language to represent knowledge-organization systems such as thesauri, classifications, and folksonomies. The Library of Congress has its own SKOS project, which involved adding metadata to all its subject headings so that these could be machine readable, and manipulated in different ways.

At the ZBW German National Library of Economics, which is the world's largest economics library, librarians have used SKOS to improve their thesaurus, in terms both of its usability in house and its external re-usability. This task involved mapping a very elaborate thesaurus with 500 subject categories and 6,000 descriptors against SKOS, as a way of helping people through the search process (Borst and Neubert, 2009).

The potential of the Semantic Web

These examples, however, are fairly scattered and on the whole small scale. Major semantic applications for the library remain at the level of potential.

Joint (2008) suggests that the ideal use of the Semantic Web lies in digital repositories, and heritage library websites. In contrast to licensed databases, which are inevitably restricted, libraries have a clear mandate to expose the content of their repositories as widely as possible, especially as more frequently cited research can improve research metrics. Semantic techniques can inevitably increase search and discovery and allow more content to surface.

We saw above how one application, JeromeDL, combines elements of the social and Semantic Web. The two are also linked by Mary Burke (2009), who sees the library of the future as combining features of the Semantic Web, with its standardization, with those of Library 2.0, with its informality and its interactivity. In other words, the best of both worlds: a Web 2.0 interface, and a semantic back end.

Library consultant Elyssa Kroski ended her presentation at Online Information 2009 on ‘"Next-generation libraries" by referring to semantic technologies: she believes that librarians are eminently suited to the task of organizing and sifting through the immense information overflow currently available to us on the Web (Kroski, 2009).

And next-generation librarians and libraries are always on the lookout for the next big thing in order to be ahead of the game. The Semantic Web may not be the only thing, but it is maturing and the number of applications is increasing. It has plenty of potential for the LIS community: improving the user's search experience by providing an intermediary between user and library terminology, richer bibliographic description, greater exposure of content, and ability to link to external applications.

Initially, many librarians fought shy of the Web, seeing it as an enemy, with some observers predicting it would contribute to their demise. That didn't happen, but the lesson is, you can't fight the Web, so embrace it. Embracing it means not merely providing lists of reliable websites, but following its trends. Librarians are often enthusiastic users of 2.0 and they need to be equally enthusiastic users of semantic techniques.

What many librarians feared about the Web was the fact that it was so haphazard. Now the Web has bounced the ball straight into their court. It's an opportunity not to be missed.

Resources

Work on the Semantic Web has given rise to the new discipline of Web Science, which itself draws in diverse disciplines such as mathematics, physics, engineering, psychology, sociology, biology, computer science, web engineering, artificial intelligence, law, economics and politics.

Organizations

There are centres for studying the Semantic Web at:

Web Science Research Initiative (WSRI)
School of Electronics and Computer Science
University of Southampton
http://www.ecs.soton.ac.uk

Digital Enterprise Research Unit (DERI)
National University of Ireland, Galway
http://sw.deri.ie

Publications

The following publications are recommended as good introductions to the Semantic Web:

Greenberg, J. and Méndez, E. (Eds) (2007), Knitting the Semantic Web, Haworth Information Press, New York, NY (simultaneously published as Cataloging & Classification Quarterly, Vol. 43 No. 3-4).

Hitzler, P., Krötzsch, M. and Rudolph, S. (2009), Foundations of Semantic Web Technologies, Chapman & Hall/CRC.

Websites

Semanticweb.org is a portal created by Markus Krötzsch for information on research and development in the Semantic Web: http://semanticweb.org/wiki. It provides detailed introductions to a number of areas.

The W3C pages on the Semantic Web can be found at:
http://www.w3.org/standards/semanticweb

O'Reilly's XML.com provides a very clear introduction to RDFa:
http://www.xml.com/pub/a/2007/02/14/introducing-rdfa.html

References

Altova (2009), "What is the Semantic Web?", available at: http://www.altova.com/semantic_web.html [accessed December 21 2009].

Borst, T. and Neubert, J. (2009), "Case study: publishing STW thesaurus for economics as linked open data", W3C Semantic Web Use Cases and Case Studies, available at: http://www.w3.org/2001/sw/sweo/public/UseCases/ZBW [accessed December 18 2009].

Burke, M. (2009), "The Semantic Web and the digital library", Aslib Proceedings, Vol. 61 No. 3, pp. 316-322.

Bygstad, B., Ghinea, G. and Klæboe, G.T. (2009), "Organisational challenges of the Semantic Web in digital libraries: a Norwegian case study", Online Information Review, Vol. 33 No. 5, pp. 973-985.

Davies, J. (2009), "Industrial applications of semantic technology", Proceedings of the Online Information Conference 2009, Olympia, London.

Hall, W. and Shadbolt, N. (2009), "The Semantic Web revolution – unleashing the world's most valuable information", opening keynote address to Online Information Conference, Olympia, London, December 1 2009.

Joint, N. (2008), "The practitioner librarian and the Semantic Web: ANTAEUS", Library Review, Vol. 57 No. 3, pp. 178-186.

Krause, J. (2008), "Semantic heterogeneity: comparing new Semantic Web approaches with those of digital libraries", Library Review, Vol. 57 No. 3, pp. 235-248.

Kroski, E. (2009), "Next-generation libraries", presentation to Online Information Conference, Olympia, London, December 2 2009.

Krötzsch, M. (2008), "RDF", Semantic Web, available at: http://semanticweb.org/wiki/RDF [accessed December 21 2009].

Macgregor, G. (2008), "Introduction to a special issue on digital libraries and the Semantic Web: context, applications and research", Library Review, Vol. 57 No. 3, pp. 173-177.

Papadakis, I., Stefanidakis, M. and Tzali, A. (2009), "Semantic navigating an OPAC by subject headings meta-information", The Electronic Library, Vol. 27 No. 5, pp. 779-791.

Peterson, D. (2008), "President Obama uses RDFa", Sitepoint, available at: http://www.sitepoint.com/blogs/2009/01/29/president-obama-uses-rdfa [accessed December 17 2009].

Peterson, D. (2009), "Obama's Groundbreaking use of the Semantic Web", Sitepoint, available at: http://www.sitepoint.com/blogs/2009/03/19/obama-groundbreaking-use-semantic-web [accessed December 17 2009].

Prasad, A. and Madalli, D. (2008), "Faceted infrastructure for semantic digital libraries", Library Review, Vol. 57 No. 3, pp. 225-234.

W3C (2009), "Vocabularies", available at: http://www.w3.org/standards/semanticweb/ontology [accessed December 17 2009].