I noticed that all the bots in the bots section
were written in perl. I 
like perl a lot, but when i just want to get something
done i usually 
open up my rebol interpreter. it's more portable, and
generally needs less 
code to get stuff done. So here's a rebol-bot for a
change of pace. One 
large drawback is that, as a new language, there are
few resources to learn 
about it. The tutorials on the rebol website are
helpful, but don't go into 
enough detail for me. Another place to look is rebol.org, a new
script repository.  That 
should be enough to get you started, so I won't be
going over all th' basics 
here. The complete source code to the bot is at the end.
Oh, I know my programming style sucks, I probably should have just stuck the bosch function in the while loop, but it's more clear to me this way. Apollyloggies to the confused ones.
Some people have shied away from rebol because of
it's lack of regular 
expressions, but it can accomplish everything that
regexes can w/ it's 
parse rules. This is just another format
for expression 
matching. There is a great tutorial on the rebol site
that shows how to use 
them. So you can follow what goes on in this essay,
here is the summary of 
the rules:
| - specify alternate rule [block] - sub-rule grouping none - match nothing (catch on no match) 12 - repeat pattern 12 times 1 12 - repeat pattern 1 to 12 times 0 12 - repeat pattern 0 to 12 times some - repeat one or more times any - repeat zero or more times skip - skip any character (or chars if repeat given) to - advance input to the given string (or char) thru - advance input thru the given string (or char) copy - copy the next match sequence to a variable (paren) - evaluate a REBOL expression word - look-up value of a word word: - mark the current input series position :word - set the current input series position 'word - (reserved for dialecting support)
;these are the url(s) you want the spider to start the
search on...
urls: [
    http://www.foo.com
    http://www.ceb.org/bar.htm
]
;how many levels deep you want the spider to go...the # of links increases
;almost exponentially, so watch out!
deep: 4
;a block of words to search for.
keywords: ["fravia" "REBOL"]
summary_size: 100
urls is a block of urls that
you want to start your 
search on. deep is how many levels you
want to search. For 
instance, if bar.htm and foo.com had links to 7 pages
all together, then 
those 7 pages would be the next level, and all their
links would be searched 
next.
level: 0
format: reduce [<html> <head>
<title> "Search results from 
" now/date
    {</title></head><body
bgcolor=#C0C0C0 text=#001010 
vlink=#405040><center>
    <h1>spider_search</h1>by
sonofsamiam<table border=1>
    <th bgcolor=#ff0000>url<th
bgcolor=#ff0000>keyword<th 
bgcolor=#ff0000>
    count<th bgcolor=#ff0000>summary}
]
links: [] ;block to hold the links.
db: []    ;block to store the database for sorting.
out: []   ;block to store the outputted html
;this is the sample html parser off the rebol
home page. works good 
for me!
html-code: [
    copy tag ["<" thru ">"] (append tags tag) |
    copy txt to "<" (append text txt)
]
html-code is a parse rule.
it searches for all the 
<'s and puts all the tags in a block and all the
text in a block. It 
doesn't handle weird html, w/ tags inside tags &
stuff. It will still work 
fine, but the summary might be lacking.
;...Hieronymous Bosch...
;this function slurps the data from the page
bosch: func [page url][
    tags: make block! 100
    text: make string! 8000
    parse page [to "<" some html-code]
    ;get the links
    foreach tag tags [
        if parse tag ["<a" thru
"href="
            [{"} copy link to {"} | copy
link to ">"]
            to end
        ][append links link]
    ]
    foreach keyword keywords [
        c: 0
        a: text
        while [a: find/tail a keyword][c: c + 1]
        either (c = 0) [
            links: []
        ][
            insert/only db reduce [c url keyword
copy/part text 
summary_size]
        ]
    ]
]
here is the function (bosch) that is
called w/ the contents of 
each page. It grabs all the links, searches for the
keywords, and then 
sticks the info into a database, stored in
db
;!_!_!_this_is_where_it_starts_!_!_!
while [level <= deep][
    foreach url urls [bosch read url url]
    urls: links
    links: []
    level: level + 1
]
;sort & format
db: sort db
foreach x db [
    foreach [c u k t] x [
        insert out reduce [<tr> <td> u
<td> k <td> c 
<td> t newline]
    ]
]
insert out format
append out [</table> </center> "thanx for
using sonofsamiam's 
spider!"
    </body> </html>
]
;write the html file
write %spider.htm out
q
I think all this is pretty self-explanatory. The
while loop 
controls what pages it searches & the rest formats the
data into an html 
page. Then out is written to a file and the
interpreter exits. 
It's speedy as hell, and it's helped my searching a
huge amount.  
I put in what I'm looking for in Altavista, and then search those results w/ my spider. You find info much quicker and easier this way.
Here is a sample output page:
| url | keyword | count | summary | 
|---|---|---|---|
| rt_bot1.htm | fravia | 9 | rt_bot1.htm The HCUbot: a simple Web Retrieval Bot in Perl The HCUbot: a simple Web Retriev | 
| hunt_01a.htm | fravia | 8 | Hunting Lesson I _____________________________________________________________________ ®—>>> | 
Now, this robot is very simple. If it comes across a
bad url, it can die. 
Also, the search is limited to just single words, not
any boolean or 
anything. What you should do is make it 'smart.'  This
isn't hard, if you 
know a little rebol. Consider using parse
rules for a search 
string. Anyway, I hope you enjoy this little spider.
I've had a lot of fun 
with it, and I'd be interested in any comments you
have on it.
.~the full code~.
Here is
the full source code. 
It's very small, as you can see, most of the space is
taken up w/ 
html-formatting. REBOL is pretty efficient.
REBOL[
    Title:  "spider.r"
    Author: "sonofsamiam"
    Home:   http://sonofsamiam.tsx.org/
    Date:   19-Sep-1999
    Purpose: {
        A helpful little web-indexing search bot.
Outputs sorted & html- formatted.
    }
    Comment: {
        I curbed my usual programming style of cramming the entire
        script on 5 lines :p I figure most of the readers won't be
        especially familiar w/ rebol, so i went for clarity.
    }
]
secure none
urls: [
    http://www.rebol.com
    %rt_bot1.htm
    %hunt_01a.htm
]
deep: 4
keywords: ["fravia" "REBOL"]
summary_size: 100
level: 0
format: reduce [<html> <head>
<title> "Search results from 
" now/date
    {</title></head><body
bgcolor=#C0C0C0 text=#001010 
vlink=#405040><center>
    <h1>spider_search</h1>by
sonofsamiam<table width="100%" 
border=1>
    <th bgcolor=#ff0000>url<th
bgcolor=#ff0000>keyword<th 
bgcolor=#ff0000>
    count<th bgcolor=#ff0000>summary}
]
links: []
db: []
out: []
html-code: [
    copy tag ["<" thru ">"] (append tags tag) |
    copy txt to "<" (append text txt)
]
bosch: func [page url][
    tags: make block! 100
    text: make string! 8000
    parse page [to "<" some html-code]
    foreach tag tags [
        if parse tag ["<a" thru
"href="
            [{"} copy link to {"} | copy
link to ">"]
            to end
        ][append links link]
    ]
    foreach keyword keywords [
        c: 0
        a: text
        while [a: find/tail a keyword][c: c + 1]
        either (c = 0) [
            links: []
        ][
            insert/only db reduce [c url keyword
copy/part text 
summary_size]
        ]
    ]
]
while [level <= deep][
    foreach url urls [bosch read url url]
    urls: links
    links: []
    level: level + 1
]
db: sort db
foreach x db [
    foreach [c u k t] x [
        insert out reduce [<tr> <td> u
<td> k <td> c 
<td> t newline]
    ]
]
insert out format
append out [</table> </center> "thanx for
using sonofsamiam's 
spider!"
    </body> </html>
]
write %spider.htm out
q
keep it real...