|
Daria Goetsch
Search Innovation
April 11, 2003
Automated search engine robots, sometimes called "spiders"
or "crawlers", are the seekers of web pages. How do they work?
What is it they really do? Why are they important?
You'd think with all the fuss about indexing web pages to add
to search engine databases, that robots would be great and powerful
beings. Wrong. Search engine robots have only basic functionality
like that of early browsers in terms of what they can understand
in a web page. Like early browsers, robots just can't do certain
things. Robots don't understand frames, Flash movies, images
or JavaScript. They can't enter password protected areas and
they can't click all those buttons you have on your website.
They can be stopped cold while indexing a dynamically generated
URL and slowed to a stop with JavaScript navigation.
How Do Search Engine Robots Work?
Think of search engine robots as automated data retrieval programs,
traveling the web to find information and links.
When you submit a web page to a search engine at the "Submit
a URL" page, the new URL is added to the robot's queue
of websites to visit on its next foray out onto the web. Even
if you don't directly submit a page, many robots will find your
site because of links from other sites that point back to yours.
This is one of the reasons why it is important to build your
link popularity and to get links from other topical sites back
to yours.
When arriving at your website, the automated robots first check
to see if you have a robots.txt file. This file is used to tell
robots which areas of your site are off-limits to them. Typically
these may be directories containing only binaries or other files
the robot doesn't need to concern itself with.
Robots collect links from each page they visit, and later follow
those links through to other pages. In this way, they essentially
follow the links from one page to another. The entire World
Wide Web is made up of links, the original idea being that you
could follow links from one place to another. This is how robots
get around.
The "smarts" about indexing pages online comes from
the search engine engineers, who devise the methods used to
evaluate the information the search engine robots retrieve.
When introduced into the search engine database, the information
is available for searchers querying the search engine. When
a search engine user enters their query into the search engine,
there are a number of quick calculations done to make sure that
the search engine presents just the right set of results to
give their visitor the most relevant response to their query.
You can see which pages on your site the search engine robots
have visited by looking at your server logs or the results from
your log statistics program. Identifying the robots will show
you when they visited your website, which pages they visited
and how often they visit. Some robots are readily identifiable
by their user agent names, like Google's "Googlebot";
others are bit more obscure, like Inktomi's "Slurp".
Still other robots may be listed in your logs that you cannot
readily identify; some of them may even appear to be human-powered
browsers.
Along with identifying individual robots and counting the number
of their visits, the statistics can also show you aggressive
bandwidth-grabbing robots or robots you may not want visiting
your website. In the resources section of the end of this article,
you will find sites that list names and IP addresses of search
engine robots to help you identify them.
How Do They Read The Pages On Your Website?
When the search engine robot visits your page, it looks at the
visible text on the page, the content of the various tags in your
page's source code (title tag, meta tags, etc.), and the hyperlinks
on your page. From the words and the links that the robot finds,
the search engine decides what your page is about. There are many
factors used to figure out what "matters" and each search
engine has its own algorithm in order to evaluate and process
the information. Depending on how the robot is set up through
the search engine, the information is indexed and then delivered
to the search engine's database.
The information delivered to the databases then becomes part
of the search engine and directory ranking process. When the
search engine visitor submits their query, the search engine
digs through its database to give the final listing that is
displayed on the results page.
The search engine databases update at varying times. Once you
are in the search engine databases, the robots keep visiting
you periodically, to pick up any changes to your pages, and
to make sure they have the latest info. The number of times
you are visited depends on how the search engine sets up its
visits, which can vary per search engine.
Sometimes visiting robots are unable to access the website
they are visiting. If your site is down, or you are experiencing
huge amounts of traffic, the robot may not be able to access
your site. When this happens, the website may not be re-indexed,
depending on the frequency of the robot visits to your website.
In most cases, robots that cannot access your pages will try
again later, hoping that your site will be accessible then.
Resources
# # #
Daria Goetsch is the founder and Search Engine Marketing Consultant
for Search Innovation Marketing (www.searchinnovation.com),
a Search Engine Promotion company serving small businesses.
Besides running her own company, Daria is an associate of WebMama.com,
an Internet web marketing strategies company. She has specialized
in search engine optimization since 1998, including three years
as the Search Engine Specialist for O'Reilly & Associates,
a technical book publishing company.
Copyright © 2003 Search Innovation Marketing. All Rights
Reserved.
Permission to reprint this article is granted as long as
all text above this line is included in its entirety. We would
also appreciate your notifying us when you reprint it: please
send a note to reprint@searchinnovation.com.
Search Engine Robots
- How they work, what they do (Part 2)
|