The SygolBot Spider
Information about the SygolBot spider can be found on page http://www.robotstxt.org/wc/active/html/sygol.html even though this information is out of date. The correct info is:
| Name | SygolBot |
|---|---|
| Cover Page | http://www.sygol.com |
| Details Page | http://www.sygol.com/who.asp |
| Operational Status | active |
| Description | Very standard robot: it gets all words and links from a page end then indexes the first and stores the latter for further crawling. |
| Robot Purpose | indexing: gather pages for the Sygol search engine |
| Software Type | standalone |
| Software Platform | All Windows from 95 to latest. |
| Software Language | Visual Basic |
| Availability | none |
| Owner's Name | Giorgio Galeotti |
| Owner's Home Page | http://www.sygol.com |
| Owner's Email Address | i n f o @ s y g o l . c o m |
| Exclusion Protocol | yes |
| Exclusion Tag | SygolBot |
| Supports NOINDEX | Yes |
| Robot Host | http://www.sygol.com |
| HTTP From | No |
| HTTP User-Agent 1 | SygolBot http://www.sygol.com |
| HTTP User-Agent 2 | SygolBot http://www.sygol.it |
| History | It all started in 1999 as a hobby to try crawling the web and putting together a good search engine with very little hardware resources. |
| Environment | Hobby |
| Identifier | sygol |
| Updated | Mon, 07 Jun 2004 14:50:01 GMT |
| Update By | Giorgio Galeotti |
SYGOL respects the Robots Exclusion standard!
Before downloading any page from a domain, the spiders will look for exclusions for that domain in the local cache (at most 3 weeks old). If there was nothing in the cache, then the spiders will look for a robots.txt file in the domain. If a robots.txt file is found that disallows something, then the spiders will:
Delete the old exclusions, if any, from the cache in order to prepare for the newly found ones.
Store the new exclusions in the cache to minimize bandwidth.
Mark all URLs in the domain inadvertently spidered in the past as 'to be deleted from the index'.
Within minutes, delete from the index all the excluded pages from point 3.
If a robots.txt file is not found, or it is found but it does not disallow anything, then the spiders will look for the robots.txt file before downloading each and every page in the domain, since there are too many domains on the Internet to store this allows everything information for all of them in a database.
Notes
The exclusion metatags NOINDEX and NOFOLLOW are implemented as well.
"Disallow: /" means the entire site including the homepage.
To forbid SygolBot from spidering your entire site, put the following two lines in your robots.txt:
User-agent:
SygolBot
Disallow: /