Chapter

Understanding the Web and Evaluating Web Resources

By Kevin Gilbertson and Rebecca Caudle

Evolution of the Web

Once upon a time, the Internet did not exist and there was no World Wide Web.

It is difficult, perhaps, to imagine now that the Web is so ubiquitous, so central to our lives, but there was a time before smart phones, before laptops, before personal computers, before computers, before the Web.

So, how did the Web become the Web?

The idea dates to 1945, when Vannevar Bush envisioned an electromechanical machine called the Memex, a combination of “memory” and “index”, which could make and follow links between documents. This hypothetical system, built on a compressed library collection (which at the time meant microfilm), would provide access to the records of the world.

The next step in the evolution of the Web came in 1989 when Tim Berners-Lee proposed an information management system that would be used to share research among university scholars. At the end of 1990, the system became a reality as Berners-Lee sent the first message via HTTP (Hypertext Transfer Protocol). HTTP would become the foundation of the Web.

In the years following 1990, the Web and related technology exploded. The first text-based Web browsers came online. These first browsers, the first interfaces to the Web, did not support multimedia – no images, no videos, only text. The introduction of multimedia Web browsers was a significant step in the advance of the Web. These browsers featured embedded images, buttons, and more visual interactions and encouraged access to the new information network. In 1996, with reliable multimedia browsers and growing Internet connections, the commercialization of the Web began.

During this time, in 1994, the W3C, the World Wide Web Consortium, was formed. The W3C, comprised of business, university, and international members, convened to develop standards for the Web and to ensure compatibility among the different vendor browser platforms.

Innovations in technology continued to advance the Web. In the early 2000s, the popularity of Web 2.0 technologies brought new dimensions to the Web. Moving from the ‘read only’ nature of Web 1.0, where an average reader could only read what was on a website, to the ‘read-write’ platform of Web 2.0, the Web shifted radically to encourage user-generated content in the form of blogs, social networks, and product reviews. From the introduction of social networks and sophisticated Web browsers to high speed Internet access and mobile smartphones, the Web came to play an increasingly central role in our daily activities.

Looking to the next stages in the evolution of the Web, we see increasingly sophisticated data formats and the increasing importance of metadata – structural and descriptive information about included content – in the development of the Semantic Web. The main idea behind the Semantic Web is to provide more meaning to the information of the Web to enable machines to play a greater role in finding, sharing, and digesting this information.

Over the years, the Web has become a powerful and sophisticated medium to share ideas and information, and it continues to evolve with the creation of new technologies.

Search Engines

In the early days of the Web, there were no search engines. Initially, there was a central directory listing of Web sites that were online. Other directories soon came online. At some point, however, maintaining such lists became unsustainable and a better solution was needed. Enter search engines.

It is important to note at the outset that, while search engines appear to search everything on the Web, large pockets of un-indexed content – for example, library databases where content is available only to subscribers – remain hidden and cannot be found using a general search engine. Specialized search engines – Google Scholar, for instance – can, in some cases, offer better access to this restricted content.

A search engine works by crawling the Web, indexing the content, and providing an interface to this index via a search algorithm. An algorithm is simply a set of rules governing how data is processed and how relevance is calculated. First, an automated program – a Web spider, Web crawler, or, more generally, Web robot – will follow links in sites, retrieving and storing the content of these pages. Next, this content is indexed. Indexing analyzes the content, extracting various elements from the data, like titles and names. Finally, this index, which allows more efficient access to the stored data, is searched and results are returned. A search engine, in its simplest form, performs basic text retrieval. A full text search will attempt to match the words given by the user to the words stored in the database index. These results are organized by various ranking parameters.

Most search engines rank results based on popularity and relevance. How a search engine determines these rankings – its search algorithm – is often proprietary and can be quite complicated. Popularity is typically gauged by evaluating link relationships among sites. For example, a result may be determined to be more popular by a search engine based on the number of incoming links, links pointing to that page. Link analysis of this sort also depends on authority and hub values where links from certain sites, judged to be of high importance, are given greater values. More advanced algorithms, which are increasing in complexity with the expansion of data on the Web, involve hundreds of other factors, including speed and trustworthiness, and attempt to refine a particular page’s importance within a particular result.

With these advances in search algorithms and the increasing role of social networks, search analysis and ranking are often performed within the context of the current user (including that user’s past behavior and search history among other things). It has been argued that such results form a bubble with the user at the center, where different viewpoints and challenging information are excluded. Called a filter bubble, it is this effective isolation of a search user inside his or her existing ideological frames that may result in significant variances between shared information environments. Some search engines – Duck Duck Go, for example – do not collect this user profile data and/or exclude this user profile data from their algorithms.

In this way, relying on search engines to deliver the best results becomes problematic and, while the semantic Web holds promise, the current state of machine intelligence limits how search engines search. The human mind excels at matching concepts on a level beyond simple full text searching while computers struggle with this kind of abstraction.

In addition to filter bubbles and the inherent limitations of full text indexing, the various practices of search engine optimization (SEO) can affect the quality of search results. SEO is basically the process of trying to increase the visibility of a site or page within a search result set. In its most legitimate form, SEO is a set of guidelines for writing relevant content and following best practices of document coding to encourage keyword relevance and indexing activity. In its nefarious forms, often called spamdexing (see also Google bomb or Googlewashing), SEO employs a number of different tactics, including keyword stuffing, link spam, and cloaking, to manipulate a site’s ranking.

As the Web continues to develop, our ability to evaluate websites and the information they present will only become more important.

How to Evaluate Websites

There is a lot of useful information on the Web, but since it is a public place and anyone with Internet access can create and publish information, it is important to critically evaluate information found on the Web. The Web is largely unregulated and unchecked which places a burden on you to evaluate websites for quality. When you are researching on the Web, question the information you are viewing and treat everything with uncertainty. Be a skeptic!

Using reliable and accurate websites will strengthen the argument of your research, but using information that is inaccurate or biased will weaken your paper.

Look at the following guidelines to help you evaluate websites:

Accuracy

The accuracy or verifiability of details is an important part of the evaluation process, especially when you are unfamiliar with the topic. The credibility of most research is proven through the documentation of other sources. This is how the author proves that the information is credible and not just their opinion or point of view.

  • Is the author mentioning other sources of information?
  • What type of other sites does the website link to? Are they reputable sites?
  • Can the background information used be verified?
  • Is the site a parody or satire of a real event? (For example, news stories found on The Onion).
  • Is the site advertising, persuading, or stating information objectively?
  • Does the page have a lot of grammar errors?
  • What evidence is the author using to support their ideas? Does the site list citations or links to other resources?

Authority/Authorship

Look for the author or group who is responsible for creating the page, and try to find information about their background. This information allows you to judge whether the author’s credentials make them qualified to write on the topic. When the author of a website is an expert in the field it is a better source for your research because the author has more knowledge and authority with the subject.

It’s also important to keep in mind that some websites may have an agenda or motivation for creating the page. They may be trying to sell a product or persuade you on a social or political issue.

  • Who created the page? What are their credentials/background? Are they an expert on the topic?
  • Has the author listed a contact email or phone number?
  • Why did the author create the page? Is the purpose to inform, sell, entertain, or persuade?
  • What domain is the website published under? (For example, edu, gov, or com.)
  • Does the author stand to gain anything by convincing others of their point?
  • Are there a lot of advertisements on the site?

Date/Currency

Look to see when the page was last updated or created. This will give you an idea if the author has maintained an interest in the page, or has abandoned it.

  • How current is the information? When was the page created or last updated?
  • Are there any dead links on the page that no longer work?

Reliable Websites

Since there is so much information on the Internet with a wide range of quality, it is necessary to develop skills to evaluate what you find online. One way you can judge a reliable website is to look at the domain where the website is hosted. For example, websites that have .edu in the URL address are hosted by educational institutions (See Table 1).

Domain Name Institution
gov (www.census.gov) government
edu (www.wfu.edu) education
org (www.amnesty.org) non-profit organization
com (www.amazon.com) commercial
Table 1 Domain Names

Often, websites ending with .edu and .gov are more authoritative because certain qualifications have to be met in order to use those domains, whereas anyone can use the others. However, care still needs to be given in evaluating the content on these domains. A governmental or educational domain increases only the likelihood that the content is reliable; the domain alone is not proof of reliability.

When you are viewing information on a page it is important to think about the potential motivations or biases associated with the site. Some websites are motivated to make money, so they may present information as a sales pitch, instead of presenting their information objectively. You may come across other websites that are biased for political reasons. The authors of the website may emphasize certain information to influence your opinion about the topic.

Finding Really Current Information

In some fields, knowledge and research are constantly being changed and updated. So it is important to make sure that the site you are viewing is either up-to-date or published at a time that is relevant to the topic of your research. Having current information is important depending on your topic. For example, getting information about a healthcare or technology topic from a website that is more than 10 years old is likely to be out of date because of advances and changes in the field. But getting information on a history or literature topic from a website that is 10 years old will likely still be accurate today.

When you need current information on a topic you can easily limit your search by going into the advance search feature. In Google you can limit the date to: anytime, past 24 hours, past week, past month, or past year.

Google Alerts is a great tool to use to stay current on particular topics. You create a Google Alert with your search terms. Google monitors the Web for you and sends you an email when there are new search results for your Google search terms. You can manage your alerts by deciding how often you would like to receive Google Alerts; as-it happens, once a day, or once a week.

RSS (Really Simple Syndication) feeds are another way to remain current on topics. You use it by first subscribing to a RSS aggregator, like Google Reader. Then click on the RSS icon or link on Web pages to add that page to your aggregator. Your aggregator automatically looks for new information on the pages you have added, and downloads the new content onto your aggregator or reader. This allows you to see new content on Web pages without having to go to each individual site.

0 Comments ↓

No comments yet.

Leave a Reply