Website structure mining: increasing website visibility
The Web is an enormous set of documents connected through hypertext links created by designers of websites. Publishing on the Web, however, is more than just setting up a page on a site; it usually also involves linking to other pages on the Web.
The increasing amount of data available on the Web provides a huge amount of useful information that can be processed to discover useful knowledge. This trend has conducted to “web mining” as a new emerging discipline.
Broadly speaking, web mining can be defined as the discovery and analysis of useful information from the World Wide Web. It is a very active research field that involves the application of data mining techniques to the content, structure and usage of web resources.
Although it derives from data mining, web mining has many unique characteristics. For instance, the sources of web mining are web documents, which can be represented as a directed graph consisting of document nodes and hyperlinks. While the source of data mining is confined to the structural data in databases, different patterns can be identified in web mining considering the content of documents, the structure given by hyperlinks, or the way in which web pages are browsed.
There are three common areas of web mining; content mining, structure mining, and usage mining:
- Web content mining (WCM) deals with knowledge discovery in web content, including text, hypertext, images, audio and video. Recent advances in multimedia data mining promise to widen access to also cover image, sound and video, etc.
- Web structure mining (WSM) usually operates on the hyperlink structure of web pages. WSM focuses on sets of pages, ranging from a single website to the Web as a whole. WSM exploits the additional information that is contained in hypertext.
- Web usage mining (WUM) focuses on records of the requests made by visitors to a website, most often collected in a web server log. The content and structure of web pages, and in particular those of one website, reflect the intentions of the authors and designers of the pages, and the underlying information architecture.
Web structure mining
The challenge for WSM is to deal with the structure of the hyperlinks within the Web itself. The growing interest in web mining has led to a renewed interest on link analysis, which involves hypertext and web mining, relational learning and inductive logic programming, and graph mining. Link structure evaluation and improvement is crucial in allowing us to understand the overall website structure and discover where information is concentrated or is missing.
Several research lines can be distinguished when working with link analysis. Link-based classification deals with the prediction of the category of a web page, based on words that occur on the page, links between pages, anchor text, html tags and other possible attributes found on the web page. Google's PageRank algorithm is such an existing metric used for the evaluation of web pages, and is an important component of Google's search engine.
Social network analysis
Typically, social network analysis (SNA) has been frequently used for the study of link analysis. SNA is a set of research procedures for identifying structures in social systems based on the relations among the system components – also referred to as nodes.
SNA arose from using mathematical models of graphs applied in the analysis of social relationships between actors. In sociology, actors typically model individuals, groups, and occasionally autonomous devices. A social network consists of a finite set or sets of actors and the relation or relations defined on them. It is a complex system that is characterized by a high number of dynamically interconnected entities, and connects entities in any type of link that implies a peer-to-peer relationship. SNA may be viewed as a broadening or generalization of standard data analytic techniques and applied statistics that usually focus on observational items and their characteristics.
“The three key methods for increasing visibility of websites are SEO, search engine advertising, and paid inclusion.”
Although there are alternatives based on content analysis when compared to other web methods, such as a content-based analysis, the relative advantage of hyperlink analysis is that it is able to examine the way in which websites form certain kinds of relations with others via hyperlinks. Using this information, in combination with other web analyses, can contribute to the understanding of why and how certain types of contents come to appear on websites.
Improving visibility, usability and accessibility
A website’s internal structure is strongly related to issues like accessibility and navigability through the site. Navigation features allow the visitor to get easy access to information of interest, both internal and external to the site. It is included as one of the design features of corporate websites, along with presentation, security, speed and tracking. The quality of a website is also increased if the site is easily identifiable and accessible to the users. In fact, accessibility is part of web assessment indexes.
The website organization also has important implications on usability. This refers to “how well and how easily a user, without formal training, can interact with an information system or website”. Better structure of web links enables visitors to navigate through websites more easily, and also get them to the right place sooner. Although website usability can be evaluated following different approaches like cognitive walkthrough, Markov chains or survey methods, all of them are conditioned by the way the website is structured. Consequently, the identified website structure patterns should be considered by website designers to improve usability.
Another important implication refers to search engine optimization (SEO), as a significant majority of online searches utilize a search engine as the initial point of entry. The three key methods for increasing visibility of websites are SEO, search engine advertising, and paid inclusion.
SEO involves adopting methods that improve the ranking of a website when a user types in relevant keywords in a search engine; search engine advertising refers to buying display positions at the paid listing area of a search engine; and paid inclusion refers to paying search engine companies for the inclusion of the site in their organic listings. SEO is generally recognized as the most effective one, as searchers pay less attention to commercial content than they do to organic listings.
The structure of a site has a direct bearing on how well it can be perceived by search engines. Search engines send programs, known as “robots, “spiders” or “crawlers”, to investigate the Internet and to find out what is on sites. Not just this, they use algorithms to process the data returned to them by crawlers, and to determine the relevance and popularity of sites to be listed on their resulting pages. Site structure is important because it will always affect the ways in which “crawlers” see it and its content.
SEO services agree that if there are three or more links to each and every page from others on the site, then it can be said to be well structured for search engine optimization. If the structure of a site inhibits the number of internal links, then it is not well structured. Crawlers should be able to read as many pages as possible, and these links are the paths they will have to take. A site needs to be structured in such a way to allow easy and readily available navigation to and from a site map. A user-friendly structure of a site can generally be said to work well in terms of crawlers. That means search engine results page ranking can be improved with good site structure. The result page on a search engine gives the user their first glimpse of content after they have entered the text into the query box and hit enter. Website structure can be seen as one of the SEO tools at web designers' disposal for making websites visible on a search engine results page.
By using SNA, website designers can identify ways for improving website usability, which in turn benefits website visitors. They can also check whether the structure of their website is as they intended it to be. Finally, websites should be structured for facilitating search engines to browse their contents.
This is a shortened version of “Website structure mining using social network analysis”, which originally appeared in Internet Research, Volume 21 Number 2, 2011.
The authors are M.R. Martínez-Torres, Sergio L. Toral, Beatriz Palacios and Federico Barrero.