How Searching the Web Works

Search Engines are searchable databases generated by software "robots" or spiders that roam the World Wide Web looking for pages. With the returns from these spiders a database is created that attempts to provide a picture of what material is available on the web. You can search these databases (Search Engines)  which return the sites that most closely match your search query.

Building the Search Engine

  1. Search Engine sends spider/agent to a web page for indexing and inclusion of the page in the search engine database.
  2. Spider analyzes page based on criteria below and sends back a report/record for the page it has analyzed.
  3. Spiders then follow the URLs it has found on the page to more pages and then creates records for these pages.
  4. And so on ... 


Using the Search Engine

  1. User goes to Search engine and types in a query.
  2. Search Engine analyzes the query and searches its database of what the web looks like and returns the results. It then ranks the results based on it's own criteria and finally sends the user a ranked list of results.

Ranking Criteria

Are the criteria that the search engines use to determine the order in which results are displayed. Every search engine is different and use different criteria and cover different sites. Each of these aspects is given a level of importance which helps the search engine return the most relevant sites to the search.

  • Domain Name
  • Meta Tags - keywords and description
  • Title of the page
  • Text, words on the pages
  • Proximity of searched words to each other
  • Frequency page is referred to by other indexed pages
  • Alt tags on image

Search Engine Issues:

What they don't Search

For more on how Search Engines work see:

Interesting Info about Search Engines:

Things you need to know when using a particular Search Engine
  1. Size: this represents the number of web pages indexed.
  2. Updated: Web spiders take time to crawl the web, and to send back their finding. This field shows how current this search engine is.
  3. Phrase Searching: Phrase searching is used to find words together as a phrase; for instance a search on "the heart of darkness" will return only pages that have the words heart of darkness in them in that order next to each other. But the and of are stop words so they are not searched. This means that this will really be a search for "heart darkness".
  4. Requires word in search: The required words must be somewhere in the retrieved documents.
  5. Exclude word in search: Any retrieved document cannot have this word in it.
  6. Default multiword operator: When you just type a couple of words in with no Boolean operators the default operator will be used to connect these words together. Knowing the default operator can make a HUGE difference in the quality of your search.
  7. Look for either word: This means one or the other or both words must be in the document.
  8. Truncation: Truncation is used to indicate ending on a word. For instance the truncated word cat will find cats, category, catalog etc. You should, as this example shows, be careful when using truncation. Truncation is most useful to catch plurals.

 

EXTRA FUN INFO!!
How the Web is like a Bowtie

A study of the web's structure, five times larger than any attempted previously, reveals that it isn't the fully interconnected network that we've been led to believe. The study suggests that the chance of being able to surf between two randomly chosen pages is less than one in four. Researchers from three Californian groups at IBM's Almaden Research Center in San Jose, the Altavista search engine in San Mateo and Compaq Systems Research Center in Palo Alto have analysed 200 million web pages and 1.5 billion hyperlinks. Their results, which will be presented next week at the World Wide Web 9 Conference in Amsterdam, indicate that the web is made up of four distinct components. A central core contains pages between which users can surf easily. Another large cluster, labeled 'in', contains pages that link to the core but cannot be reached from it. These are often new pages that have not yet been linked to. A separate 'out' cluster consists of pages that can be reached from the core but do not link to it, such as corporate websites containing only internal links. Other groups of pages, called 'tendrils' and 'tubes', connect to either the in or out clusters, or both, but not to the core, whereas some pages are completely unconnected. To illustrate this structure, the researchers picture the web as a plot shaped like a bow tie with finger-like projections. (from Nature 405, 113 (2000) © Macmillan Publishers Ltd. )


©1999 Southern Connecticut State University