rt_bot1.htm The HCUbot: a simple Web Retrieval Bot in Perl

	The HCUbot: a simple Web Retrieval Bot in Perl	Bots section
20 July 1999	by deep (with corrections by [blue])
	Courtesy of Fravia's searchlores.org
fra_00xx 98xxxx deep 1000 BO PC	A great essays for all those that want to begin their travel in the beautiful land of the bots. The "HCUbot" described here is a full working automated bot that you'll be able (once you have learned some Perl, that is) to ameliorate or modify at leasure. It's like with lego-blocks. This is the first platform, you'll build on this all the colour you want.
	There is a crack, a crack in everything That's how the light gets in
Rating	(x)Beginner ( )Intermediate ( )Advanced ( )Expert

Simple Web Retrieval Bot in Perl

Written by deep

Introduction

I like Perl, I've been learning it for a while. It's a good language to learn - fairly straightforward, quick, very powerful and ideal for bots, cgi and the net generally! I hope that +fravia will publish this as part of the botstart section and that the bot section will start boting - it's my very first bot. Most of the source code is included - it's yours for a little work.

Tools required

Perl (standard on Linux and freely available) and various Perl modules (small, free downloads),
net access,
a text editor,
Linux (not absolutely necessary, but it's far superior, free and a real operating system).

Essay

What can I say about Perl? It's a good language to learn. Virtually all cgi is done in Perl but it's good for virtually anything that you'd care to do and it's possible to develop applications very quickly. I'm not yet that experienced at Perl - this is my first 'real' app and I'm certain that this bot is not written at all well, but it is written. Perhaps that's the best thing about Perl - it enables you to do things that would not otherwise be possible. The CPAN Perl code repository on the net holds vast quantities of free code to do almost anything you could ever wish - but you have to be able to use Perl. You will need to download at least the LWP (it stands for libwwwperl) modules from CPAN for HCUbot or any Perl bot to work.

There are many Perl bots available on the net, but I'm fairly certain that you will not find one that does exactly what you want. There's also a convention amoung bot writers not to give bots to people who do not understand them - it's considered irresponsible. Of course, once you've learned how to build bots, you can be as irresponsible as you like. What all this means is that you have to learn to appreciate them and Perl or you don't deserve them. Don't worry, it's easy enough - just a little effort.

Please note that this is not good Perl code and I am not a programmer. Rather it shows that to start using Perl you only need to understand scalars, arrays, hashes and regexes. I hope that HCUbot is replaced soon with a better HCUbot2 that shows me how it should be done.

What I've done is provide most of the source to 'HCUbot' - a very simple web retrieval bot that retrieves many web pages from a single site. What's missing is the code to call subroutines and pass arguments to them. The idea is that by assembling the bot you will earn the right to use it. Those familiar with a C, C++ or Java will have very little difficulty. This bot is fairly limited in what it can achieve, (and bots can do far more than download web pages) but you are free to add any functionality you like - just write the code.

To get a taste of Perl, take a look at this. It's a very simple script I knocked together to convert Perl's 'pod' documentation to plain text. It's not actually necessary as there is already a pod-text conversion utility available, but I was practicing my regexes.

##### Strips formating characters from pod documents
##### so that plain(er)txt is achieved
#!/usr/bin/perl -w

use diagnostics;
$infile="$ARGV[0]";
$outfile="/output.txt";
open (INFILE,"<$infile");
open (OUTFILE,">$outfile");
while() {
        # needs to handle all these '=item *', '=over
4', '=head1 DESCRIPTION
        #  '=cut', '=item $ua = LWP::Robot ...', etc
        s/^=\w+\s*[\d|\s|\*]*//og;  # removes things
starting =
        s/\w*<(.+?)>/$1/og;   # removes &lt; and &gt;
around terms and preceeding letters.

           # These are 'regular expressions' or
'regex(es)' - it may
           # look scary, but it's actually very simple
when you
           # understand how and they're very powerful.
print OUTFILE;
        }

Now take a look at this. All Perl tutorials say that there are many ways to achieve what you want. I wanted to process all the files in a directory with a certain file extension.


sub process_files {
        my ($dir) = _@;
        opendir(DIR, $dir) or die " $0: Can't open
$dir: $! \n" ;
        @files = readdir DIR;   # @files contains
every file in the directory

foreach $file(@files)   {
        if ($file !~ /^\d+\.ext$/o)      # regex
filters @files so that
                {next;}
        push (@extfiles, $file);         # @extfiles
contains only .ext files
        #       print "dev - pushed $file\n";
        }

# open each file for processing
foreach $file(@files) {

        open(FILE,"> $file") or die "$0: Unable to
open file $file - $!\n";
        # do something to file
        close FILE;
        }
closedir DIR;
}

Or, the second attempt.


sub process_files {

        my ($dir) = @_;
        opendir(DIR, $dir) or die " $0: Can't open
$dir: $! \n";

        # don't need @files array at all

        while ($file = <*.ext>) {
                open(FILE,"> $file") or die "$0:
Unable to open file $file - $!\n";
                # do something to file
                close FILE;
        }
closedir DIR;
}

But the best way is like this,

sub process_files {

        my ($dir) = @_;
        opendir(DIR, $dir) or die " $0: Can't open
$dir: $! \n";

@files = glob("*.ext");         # Easy when you know
how, eh?

foreach $file(@files) {
        open(FILE,"> $file") or die "$0: Unable to
open file $file - $!\n";
                # do something to file
        close FILE;
        }
closedir DIR;
}

Very soon after Fravia published this essay, the following comments and corrections by [blue] were posted to his messageboard. I am very pleased to include these corrections and welcome others. Four ways to achieve the same thing.

       1. It's always better to directly parse
directory list:

       from perlop

       chmod 0644, <*.c>;

       Because globbing invokes a shell, it's often
faster to call readdir() yourself
       and do your own grep() on the filenames.
Furthermore, due to its current
       implementation of using a shell, the glob()
routine may get ``Arg list too long''
       errors (unless you've installed tcsh(1L) as
/bin/csh).


       I think the best way to parse the directory is
       something like this:

       opendir(DIR, $path) || die "Can't open $path:
$!";

       # Avoid "." and ".."
       @files=grep( !/^\./, readdir(DIR) );

       closedir(DIR);


       Any decent operating system implementing a file
system cache will anyway read
       entire directory.


       2. Speaking about Win32 ActivePerl is
absolutely compatible and BTW there is Perl
       on almost ant OS you can think of.




       [blue]

OK, here's HCUbot's. HCUbot is written as a Linux application - it will need work to work on windoze (I've not used Perl over windoze - I think that it needs explicit sockets programming). Correction! [blue] states above that Perl for windoze (ActivePerl) is absolutely compatible. I've downloaded ActivePerl (1.5 meg) and I'm going to give it a go.

HCUbot produces many messages for help in development. You will see the headers sent to the server and the response headers back. I redirect the output to a file like this 'perl HCUbot www.orasomename.com > /tmp/BOTtestoutput' or the messages to the screen are overwhelming.

Perl helps you all the way with excellent error messages. You can write it cryptically or you can write it simply. I'm going to write it simply until I learn more - this code is quite clear to me. Use 'use diagnostics' and the -w switch only while developing - they can cause strange messages to be sent to servers. If something doesn't work, try it a slightly different way. I tend to use print statements to identify where perl fails (you may have noticed ;) and it seems to work well but there's also a very good debugger built in.

There are notes after the source to explain what's happening.


#!/usr/bin/perl -w # remove -w switch after sorting
use diagnostics;  # for development, remove after
sorting       
    # use strict; hmm   
use HTTP::Status;
use HTTP::Response;
use LWP::RobotUA;       # haha! did it
use URI::URL;
use HTML::Parse;
use vars qw($opt_h);    # needs work
use Getopt::Std;

my $url;

print "dev - $0 started - initialising variables.\n";

 my $arg = (shift @ARGV);
 my $domain_name = "http://".$arg."/";

 print "dev - \$domain_name is $domain_name\n";
 local @get_list = $domain_name;                # is
this ok???  # Yes
 print "dev - \@get_list is @get_list\n";
 local %hcuing = ();
 print "dev - \%hcuing is initialised as ()\n";
  # referer section
 local %referer = ();
 print "dev - \%referer is initialised as ()\n";
 local $counter = 0;   # for naming locally-stored
files
 print "dev - \$counter is $counter\n";

 local $maxcount = 15;
 my $mirror = 0;




         
########################################################

         ###     N.B.  SUBROUTINES CALLED FROM THIS
BLOCK      ###

&amp;change_dir($arg);

        while (($url = shift @get_list) &amp;&amp;
($counter < $maxcount))      {

                #####  INSERT   #####

                 ###    CODE    ###

                ####    HERE    ####


        }  ##  while there are URLs to fetch


&amp;shut_down; #not strictly necessary (helps
development, or helped me)

        ###     N.B.  SUBROUTINES CALLED FROM THIS
BLOCK      ###




                ##  print_help() er, prints help
###########

sub print_help {
  print << "HELP";

usage: $0 [-h] domain-name

 -h help

Example:  $0 www.ora.com

HELP
}



                ## change_dir, change to user's home
directory  ###


sub change_dir {


        my ($dirname) = @_;

        $dirname =~ /http:\/\/(\w+)/;
        print "dev - \$dirname to be created is
$dirname\n";
        # change to user's home directory
        chdir();
        my $pwd = `pwd`;
        print "dev - changed to user's home directory.
Directory is $pwd\n";

        # makedir beneath user's home directory with
appropriate permissions
        if (! ( -d $dirname))   {
                mkdir($dirname,0660) or die "Unable to
create directory $dirname  $!\n";
                print "dev - created directory
$dirname\n";
        }

        # move into that directory - will be
creating/renaming files
        chdir($dirname);
        $pwd = `pwd`;
        print "dev - changed to directory $pwd";

return 0;
}






                ##   get_html() retrieves html pages
######

sub get_html() {


  my($url) = @_;

  print "dev - in sub get_html()\n";

# Create a User Agent object

                # your email address here ~ be
responsible ~
$ua = new LWP::RobotUA
'HCUbot','jclinton@whitehouse.gov';
$ua->delay(0.01);     # short delay but probably
enough

# Ask the User Agent object to request a URL.
# Results go into the response object (HTTP::Reponse).

  my $request = new HTTP::Request('GET', $url);
  print "dev - \$url is $url\n";

        if (defined $referer{$url}) {           #
referer implementation, works
                $ref = $referer{$url};
                $request->referer($ref);

                }

   my $response = $ua->request($request);

  #####  for development/debugging purposes #######
  print "\ndev - \$request>as_string is \n";
  print $request->as_string;
  print "\ndev - \$response->as_string is \n";
  print $response->as_string;
  #####  for development/debugging purposes #######

  return ($response->code, $response->content_type,
$response->content);

}


                ##  not_good()  ############
                ## checks that page was received ok
and that it is html   #####

# returns 1 if the request was not OK or HTML, else 0

sub not_good {


    my ($code, $type) = @_;

    print "dev - in sub not_good \n";

    if ($code != RC_OK) {
      print "$url had response code of $code";
    return 1;
    }

    if ($type !~ /text\/html/) {
      warn("$url is not HTML.");
    return 1;
   }
return 0;   # return false (0) if document is ok
}




                ##   save_html()   #########

sub save_html {


my ($url,$data) = @_;

print "dev - in sub save_html \n";
$counter++;

        open(SAVEFILE,">$counter.ext")
                or die "unable to save file $url as
$counter.ext \n";
                print SAVEFILE $data;
        close SAVEFILE;

        # save %hcuing hash entry for $url and
local($counter) filename
        # Hash entry now defined as well as existing
        $hcuing{$url} = "$counter\.ext";

        print "dev - \%hcuing key $url given value
$counter\.ext\n";

return 0;
}





                ##    extract_hyperlinks()   #######
  ##   extracts relative urls, calls absolutise_url()


sub extract_hyperlinks {

  my ($data, $url) = @_;

  print "dev - in sub extract_hyperlinks \n";


my $parsed_html=HTML::Parse::parse_html($data);

  for (@{ $parsed_html->extract_links(qw (a)) }) {
    my ($link) = @$_;
    my ($absolute_link) = absolutise_url($link, $url);

      #   only interested in     i. same-domain
     ##                      and
    ####    ii. non-queued or fetched hyperlinks
   #####  This is the second filter for documents to
retrieve

        if (($absolute_link =~ /$domain_name/o)
                &amp;&amp;       (! exists
$hcuing{$absolute_link}))    {

                # queue for retrieval
                push (@get_list, "$absolute_link");
                # create but not define hash entry so
that url is only queued once
                $hcuing{$absolute_link} = "";
                print "dev - \%hcuing key
$absolute_link created. \n";

            # referer hash
                $referer{$absolute_link} = "$url";
                print "dev - \%referer key
$absolute_link with value $url created. \n";
                        }

                }
  $parsed_html->delete(); # manual garbage collection

return 0;

}




                ##   converts relative to absolute
urls   ######

sub absolutise_url() {

        my ($partial, $model) = @_;

        print "dev - in sub absolutise_url()\n";

    my $url = new URI::URL($partial, $model);
    my $absolutised = $url->abs->as_string;

    ## URI::URI returns duplicated urls - filter
further #!###

        #!~    THIS REGEX IS IMPORTANT    ~!#

         #!~   - first filter for queuing docs for
retrieval    ~!#
        # must have extension htm(l)
        # tried /html*#{0}/ and /html*[^#]/

        if  ( $absolutised =~ /htm[^#]*$/ )     {
                print "dev - absolutise_url()
returning: $absolutised. \n";
        return $absolutised;

        } else {
                print "dev - absolutise_url()
returning null: (not $absolutised). \n";
        # want to return null - will this work? yes
        return  $absolutised = "";

        }
}


                ##  shut_down  ##########
                ##  for development use

sub shut_down {         ## there's probably a name for
this by convention
                                      ## yeah,  maybe
shut_down
 print "dev - in END section\n";


open(SAVEHASH,">hcuing") or die "unable to open hcuing
hash file for saving.\n";
print SAVEHASH %hcuing or die "unable to print hcuing
hash file to disk. \n";
close SAVEHASH;

open(SAVEGETLIST,">getlist") or die "unable to open
getlist file for saving.\n";
print SAVEGETLIST @get_list or die "unable to print
\@getlist to disk. \n";
close SAVEGETLIST;

open(SAVEGETLIST,">referer") or die "unable to open
referer file for saving.\n";
print SAVEGETLIST %referer or die "unable to print
\@referer to disk. \n";
close SAVEGETLIST;

        # print each %hcuing key-value pair
        foreach $k (sort keys %hcuing)  {
                print "dev - \%hcuing $k =>
$hcuing{$k}\n";
                }

        # print each %referer key-value pair
        foreach $k (sort keys %referer)         {
                print "dev - \%referer $k =>
$referer{$k}\n";
                }

}

# possible enhancements
# edit documents so links point to local copies
# scope properly
# enhance that regex
# mirroring facility

HCUbot is replacing a browser, sending requests for web pages and receiving responses. HCUbot can even pretend to be a browser - any browser you like. This line

$ua = new LWP::RobotUA 'HCUbot','jclinton@whitehouse.gov';

identifies HCUbot as HCUbot, while the jclinton... is the email address the server administrator should contact if your bot screws up his server - she'll send you an awfully polite email. So to pretend to be a particular browser, you would replace HCUbot with something like "Mozilla/3". You'll have to check the actual string that the browser actually sends.

HCUbot sends a GET command to the server. It says that it wants particular web pages by saying GET this url with the url of the document that you're after. There are other commands - MIRROR (did you notice that my $mirror = 0; variable at the initialising variables section?), HEAD, POST and a few others. Mirror compares the document on the server with your local document. If the server's document is newer or has a different size, that document is retrieved. Mirror works by sending a HEAD request that retrieves headers for the document. The header contains the size of the document and the date that it was last amended. If the document needs retrieving, your machine decides to fetch it. That my $mirror = 0; variable initialisation is for HCUbot to mirror documents (not yet implemented).

Let's take a look at some headers that HCUbot works with.

dev - in sub get_html()                     ##
messages starting "dev - ..." are produced by
dev - $url is http://www.oracle.com/         ## HCUbot
so that you know what's happening.               


dev - $request&gt;as_string is
GET http://www.oracle.com/                            
         # Here's the request header
From: jclinton@whitehouse.gov
User-Agent: hcuBOT



dev - $response-&gt;as_string is
HTTP/1.1 200 OK                                   #
Here's the response header, that we
Cache-Control: public                               #
get back from the server                  
Date: Thu, 20 Jul 1999 20:18:19 GMT
Accept-Ranges: bytes
Server: Oracle_Web_Listener/4.0.7.1.0EnterpriseEdition
Allow: GET, HEAD
Content-Length: 12723
Content-Type: text/html
ETag: "8ef7c2d83beac682e5b0bb90ecc3791a"
Last-Modified: Thu, 20 Jul 1999 16:31:27 GMT
Client-Date: Thu, 20 Jul 1999 23:28:07 GMT
Client-Peer: 205.207.44.16:80
Title: Oracle Corporation - Home
X-Meta-Description: Oracle Corp. (Nasdaq: ORCL) is the
world's leading
 supplier of software for enterprise information
management.
X-Meta-Keywords:
database,software,Oracle,Oracle8i,relational server,
 server,application,tools,decision support
tools,internet,internet computing,
 CRM,customer relationship
management,e-business,PL/SQL,XML,Year 2000,Euro, Java,
technology


&lt;html&gt;            # and the html document
requested with a GET starts here.

Quite a whopper that response header, they're not normally that big. The request is simple on this one, it's jclinton@whitehouse.gov saying GET http://www.oracle.com/ using User-Agent: hcuBOT.

The important part of the response is the first line "HTTP/1.1 200 OK".

Hypertext TransferProtocol (HTTP) will either be 1.1 or 1.0. Version 0.9 only supports the GET method and is not used now as far as I'm aware. 1.0 supports GET, HEAD, POST, PUT, DELETE, LINK and UNLINK. 1.1 supports a few extra methods. This header says that it will accept HEAD and GET requests.

An important part is the response code. We want response code 200 as shown here which is the server replying "OK, here's the document you asked for". Response codes 100 to 199 are not implemented. 200 is what we want. 200-299 are request successfull, but that doesn't really mean that you'll get the document. 300-399 are redirection which can cause a bit of trouble. 400 is bad request (syntax error in the request header), 404 is document not found - just like when you click on a stale link. 400 - 499 you don't want. Server Errors are the 500 range which you don't want. 500 is internal server error, one that you don't want but will get often. I implemented the referer in HCUbot to try to avoid RC500s and made some other changes. The referer is the page that gave us the link. You'll sometimes get the document even with a RC500.

Here's a request header with a referer. It's saying "I want http://www.oracle.com/html/custcom.html, I got this url from http://www.oracle.com/".

dev - $request-&gt;as_string is
GET http://www.oracle.com/html/custcom.html
From: jclinton@whitehouse.gov
Referer: http://www.oracle.com/
User-Agent: hcuBOT



dev - $response-&gt;as_string is
HTTP/1.1 200 OK
Date: Thu, 20 Jul 1999 20:18:23 GMT
Server: Oracle_Web_Listener/4.0.7.1.0EnterpriseEdition
Allow: GET, HEAD
Content-Type: text/html
Client-Date: Thu, 20 Jul 1999 23:28:11 GMT
Client-Peer: 205.207.44.16:80
Title: Oracle Corporation - Customers.com

&lt;html&gt;            # HTML document follows

In HCUbot there's this code to test if the document was received OK (response code 200) and that it's html


if ($code != RC_OK) {
      print "$url had response code of $code";
    return 1;
    }

    if ($type !~ /text\/html/) {
      warn("$url is not HTML.");
    return 1;
   }
return 0;   # return false (0) if document is ok
}

Back to HCUbot. HCUbot uses the LWP (it stands for libwwwperl) perl module which is a predefined linbarary of code that deals with net protocols. So, to write a bot in C++, for example, you'd want to use a networking library to include just like iostream.h and math.h are used. What happens is your program calls on functions in these stored libraries. LWP relieves the programmer (that's me or you) of sockets programming. A socket is how you program the net - you read and write to a socket like you would read or write to a file except that it's more complex. Socket programming allows more control.

Specifically, HCUbot uses LWP::RobotUA, robot user agent which is an appropriate module for web robots. RobotUA is often called 'polite' because it's careful not to aggrevate servers. In particular it delays requests to the server. The default, however, is one minute which I think is far too long for today's servers.

This is how HCUbot works.

You feed HCUbot a url to start at. In the first request header above, the starting url was www.oracle.com.
Tebot requested this page as GET http://www.oracle.com
and it was retrieved successfully, RC 200 OK.
HCUbot tests that retrieved document is OK.
HCUbot saves the document to disk.
HCUbot extracts from that document links to other documents - these are the links that you would click on in your browser.
HCUbot makes these links absolute - HTML pages can have abbreviated hyperlinks.
HCUbot filters - only want HTML. The regex's marked #!~ THIS REGEX IS IMPORTANT ~!# decides on documents to queue. If you wanted jpgs or zips, you would change this regex for jpegs or zips.
HCUbot decides which documents to queue for retrieval. It decides on documents within the same domain and not already queued or retrieved.
Repeats until stops for some reason.

and that's about the size of it. To finish, here's some of the messages HCUbot produces



dev - in 'MAIN'calling get_html section
dev - in sub get_html()
dev - $url is
http://www.oracle.com/cgi-bin/press/printpr.cgi?file=199907191000.18885.html&mode=corp

dev - $request-&gt;as_string is
GET
http://www.oracle.com/cgi-bin/press/printpr.cgi?file=199907191000.18885.html&mode=corp
From: jclinton@whitehouse.gov
Referer: http://www.oracle.com/
User-Agent: hcuBOT



dev - $response-&gt;as_string is
HTTP/1.1 200 OK
Date: Thu, 20 Jul 1999 20:18:44 GMT
Server: Oracle_Web_Listener/4.0.7.1.0EnterpriseEdition
Allow: GET, POST
Content-Type: text/html
Client-Date: Thu, 20 Jul 1999 23:28:32 GMT
Client-Peer: 205.207.44.16:80
Title: Press Release

&lt;html&gt;
&lt;head&gt;&lt;title&gt;Press
Release&lt;/title&gt;&lt;/head&gt;
&lt;body bgcolor="#ffffff"&gt;
&lt;!--header--&gt;
&lt;table width=600 cellpadding=0 cellspacing=0
border=0&gt;
&lt;tr&gt;&lt;td colspan=2 align=right&gt;
&lt;map name="top"&gt;
&lt;area shape="rect" coords="0,0,140,25" href="/"
target="_top"&gt;
&lt;area shape="rect" coords="343,1,385,23"
href="/"target="_top"&gt;
&lt;area shape="rect" coords="386,1,441,23"
href="/html/sitemap.html" target="_top"&gt;
&lt;area shape="rect" coords="442,1,503,23"
href="/html/siteidx_frame.html" target="_top"&gt;
&lt;/map&gt;
&lt;img width=528 height=28
src="/templatesimages/hdr_top.gif" usemap="#top"
border=0
alt="home,site map,site index"&gt;&lt;/td&gt;
&lt;td valign=top rowspan=2&gt;
&lt;a href="/ebusiness/" target="_top"&gt;
&lt;img width=72 height=56
src="/templatesimages/hdr_eb.gif" border=0 alt="#1
e-business"&gt;&lt;/a&gt;
&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt;&lt;td valign=top width=203&gt;
&lt;div class="search"&gt;
&lt;FORM method=GET
action="http://orasearch.oracle.com/cgi-bin/query"&gt;
&lt;INPUT TYPE=hidden NAME=mss VALUE=simple&gt;
&lt;INPUT TYPE=hidden  NAME=pg VALUE=q&gt;
&lt;INPUT TYPE=hidden NAME=fmt VALUE=.&gt;
&lt;INPUT TYPE=hidden NAME=what VALUE=web&gt;
&lt;INPUT NAME=q size=10 maxlength=800
VALUE=""&gt;&lt;INPUT
TYPE="image" src="/templatesimages/search_btn.gif"
width=36 height=18 value="go" border=0&gt;
&lt;/FORM&gt;
&lt;/div&gt;
&lt;/td&gt;

&lt;td valign=top align=right width=397&gt;
&lt;map name="tabs"&gt;
&lt;area shape="rect" coords="5,0,84,16"
href="http://oraclestore.oracle.com" target="_top"&gt;
&lt;area shape="rect" coords="85,0,168,16"
href="/download/" target="_top"&gt;
&lt;area shape="rect" coords="169,0,219,16"
href="/support/" target="_top"&gt;
&lt;area shape="rect" coords="200,0,259,16"
href="/cgi-bin/press/pr.cgi" target="_top"&gt;
&lt;area shape="rect" coords="260,0,309,16"
href="/corporate/seminars_and_events/"
target="_top"&gt;
&lt;area shape="rect" coords="310,0,392,16"
href="/siteadmin/html/contactus.html"
target="_top"&gt;
&lt;/map&gt;
&lt;img width=397 height=28
src="/templatesimages/hdr_tab.gif" usemap="#tabs"
border=0
alt="Main Navigation Bar"&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;
&lt;table width=560&gt;&lt;tr&gt;&lt;td&gt;
&lt;img ALIGN=center WIDTH=246 HEIGHT=40
SRC="/corporate/pressimages/pr_ban.jpg"
ALT=""&gt;&lt;br&gt;
&lt;form action="pr.cgi" method="post"&gt;
&lt;INPUT TYPE="HIDDEN" NAME="status"
VALUE="Search"&gt;
&lt;div align=right&gt;&lt;INPUT TYPE="SUBMIT"
VALUE="Return to Corporate Press Release Index"&gt;
&lt;/div&gt;
&lt;/form&gt;

&lt;h2&gt;Oracle Capitalizes on Enterprise Demand for
Linux Offerings with Announcement of Oracle 8i on
Linux&lt;/h2&gt;
(July 19, 1999)&lt;p&gt;
&lt;P&gt;&lt;B&gt;Contact(s):&lt;/B&gt;&lt;TABLE
WIDTH=100%&gt;&lt;TR&gt;&lt;TD VALIGN=TOP
ALIGN=LEFT&gt;&lt;FONT SIZE=-1&gt;Reema
Bahnasy&lt;BR&gt;Oracle
Corp.&lt;BR&gt;650/506-3397&lt;BR&gt;&lt;A
HREF="mailto:rbahnasy@us.oracle.com"&gt;rbahnasy@us.oracle.com&lt;/A&gt;&lt;/FONT&gt;&lt;/TD&gt;&lt;TD
VALIGN=TOP ALIGN=LEFT&gt;&lt;FONT SIZE=-1&gt;Karesha
McGee&lt;BR&gt;Applied
Communications&lt;BR&gt;415/365-0202&lt;BR&gt;&lt;A
HREF="mailto:kmcgee@appliedcom.com"&gt;kmcgee@appliedcom.com&lt;/A&gt;&lt;/FONT&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TABLE&gt;&lt;P&gt;

&lt;P&gt;
Early Adopters Programs Draws Nearly 20,000 Developers

&lt;P&gt;
REDWOOD SHORES, Calif., July 19, 1999-&lt;A
HREF="http://www.oracle.com/"&gt;Oracle
Corporation&lt;/A&gt;, the number one
choice for e-business, today announced dramatic growth
and demand for Oracle
on Linux with strong adoption in both enterprise and
general business markets.
 Oracle also announced the general availability of
Oracle8i(TM) on Linux, after a
successful early adopter's program.
&lt;P&gt;
   Since &lt;A HREF="http://www.oracle.com/"&gt;Oracle
Corp.&lt;/A&gt; announced Oracle8 on Linux, there have
been over
50,000 downloads from Oracle(R) Technology Network
(&lt;A
HREF="http://technet.oracle.com/"&gt;http://technet.oracle.com/
).  Now, after the announcement of Oracle8i , there
have been nearly 20,000
registrants for early access in the first few weeks. 
Outside the development
community, Oracle has also seen overwhelming customer
adoption with an excess
of 800 paying customers today-over half of these
orders from enterprise
accounts and the remainder from small to mid-sized
businesses and
organizations.
&lt;P&gt;
   "Until the availability of Oracle database on
Linux, we either had to
rely on NT or use one of the shareware database
servers available for Linux,"
says Jonathan August, President and CEO of
Internection, Inc., a company
providing customized Internet services solutions to
businesses, including web
hosting and e-commerce solutions.  "Neither solution
provided us the security,
performance, manageability or reliability required by
our customers.  Oracle
brings enterprise credibility and robustness to our
products.  As a result,
we've gained access to customers ranging from small
businesses to Fortune 100
enterprises like Prudential and Pfizer.  Our total
revenue since the
additional of Oracle on Linux has increased by 250
percent."
&lt;P&gt;
   "Oracle on Linux combines enterprise level
reliability, scalability
and performance with a free, robust and well supported
operating system," says
Nick Marden, technical director of e-commerce,
Xoom.com, and e-commerce
service provider.  "It enables Xoom.com to better
understand our members'
needs and respond to them quickly.  Oracle on Linux
represents an
extraordinary value and it gets the job done."
&lt;P&gt;
   "Oracle is committed to bringing superior
technology to the Linux
community," says Chuck Rozwat, senior vice president
of Server Technologies at
Oracle.  "Oracle8i on Linux comes with both Java and
XML built right in.
Together they offer the most cost-effective way to
deploy scalable Internet
applications."
&lt;P&gt;
   Oracle8i is the first and only database
specifically designed for the
Internet.  Oracle8i extends Oracle's long-standing
technology leadership in
the areas of data management, transaction processing
and data warehousing to
the new medium of the Internet.  Oracle8i is the
centerpiece of Oracle's
Internet Platform, which also includes Oracle
Application Server and Oracle's
Internet development tools.
&lt;P&gt;
   Oracle Corporation is the world's leading supplier
of software for
information management, and the world's second largest
software company.  With
annual revenues of more than $8.8 billion, the company
offers its database,
application server, tools and application products,
along with related
consulting, education and support services, in more
than 145 countries around
the world.
&lt;P&gt;
   For more information about Oracle, please call
650/506-7000.  Oracle's
World Wide Web address is (URL) &lt;A
HREF="http://www.oracle.com/."&gt;http://www.oracle.com/.

&lt;P&gt;
&lt;P&gt;&lt;CENTER&gt;&lt;STRONG&gt;# #
#&lt;/CENTER&gt;&lt;/STRONG&gt;&lt;P&gt;

&lt;P&gt;
&lt;B&gt;Trademarks&lt;/B&gt;&lt;BR&gt;
Oracle is a registered trademark and Oracle8i is a
trademark or registered
trademark of Oracle corporation.  Other names may be
trademarks of their
respective owners.

&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;html&gt;

&lt;body bgcolor="#ffffff" link="000000"&gt;

&lt;img src="images/line.gif" width=600 height=1&gt;
&lt;br clear=all&gt;
&lt;table width=600 cellpadding=0 cellspacing=0
border=0&gt;

&lt;tr&gt;
&lt;td align="right" width="100"&gt;
&lt;div class="FOOTER"&gt;
&lt;a href="/appserver/"&gt;
&lt;font FACE="Arial, Helvetica" SIZE="1"&gt;
Powered by Oracle Application Server
&lt;/a&gt;&lt;/font&gt;
&lt;/div&gt;&lt;/td&gt;

&lt;td align="left" width="50"&gt;
&lt;div class="FOOTER"&gt;
&lt;img src="images/clear_dot.gif" width=50
height=1&gt;
&lt;/div&gt;&lt;/td&gt;

&lt;td width=450&gt;
&lt;div class="FOOTER"&gt;
&lt;center&gt;
&lt;font FACE="Arial, Helvetica" SIZE="1"&gt;

        &lt;a href="/" target="_top"&gt;Home&lt;/a&gt;
        | &lt;a
href="/html/sitemap.html"target="_top"&gt;Site
Map&lt;/a&gt;
        | &lt;a href="/html/siteidx_frame.html"
target="_top"&gt;Site Index&lt;/a&gt;
        | &lt;a HREF="http://orasearch.oracle.com"
target="_top"&gt;Search&lt;/a&gt;
        &lt;br&gt;
        &lt;a HREF="http://oraclestore.oracle.com/"
target="_top"&gt;Oracle Store&lt;/a&gt;
        | &lt;a href="/download/"
target="_top"&gt;Free Download&lt;/a&gt;
        | &lt;a HREF="/support/"
target="_top"&gt;Support&lt;/a&gt;
        | &lt;a HREF="/cgi-bin/press/pr.cgi"
target="_top"&gt;News&lt;/a&gt;
        | &lt;a HREF="/corporate/seminars_and_events/"
target="_top"&gt;Events&lt;/a&gt;
        | &lt;a HREF="/siteadmin/html/contactus.html"
target="_top"&gt;Contact Oracle&lt;/a&gt;
        &lt;br&gt;
        &lt;a href="/products/index.htm"
target="_top"&gt;Products&lt;/a&gt;
        | &lt;a href="/services/index.htm"
target="_top"&gt;Services&lt;/a&gt;
        | &lt;a href="/solutions/index.htm"
target="_top"&gt;Business Solutions&lt;/a&gt;
        | &lt;a href="/corporate/oracle_at_work/"
target="_top"&gt;Customer Successes&lt;/a&gt;
        | &lt;a href="/partners/index.htm"
target="_top"&gt;Partners&lt;/a&gt;
        &lt;br&gt;
        &lt;a href="http://technet.oracle.com"
target="_top"&gt;Developers/IT&lt;/a&gt;
        | &lt;a href="/corporate/index.htm"
target="_top"&gt;About Oracle&lt;/a&gt;
        | &lt;a href="/international/html/"
target="_top"&gt;International&lt;/a&gt;
        | &lt;a HREF="/html/employ.html"
target="_top"&gt;Employment&lt;/a&gt;
        | &lt;a HREF="http://cnn.com/customnews"
target="_top"&gt;cnn custom news&lt;/a&gt;

&lt;br&gt;&lt;p&gt;
&lt;b&gt;Copyright &copy; 1995,1999 Oracle
Corporation.  All Rights Reserved.&lt;br&gt;&lt;/b&gt;

&lt;A HREF="/html/copyright.html"&gt;Legal Notices and
Terms of Use&lt;/a&gt;
&nbsp; | &nbsp;&lt;a
href="/html/privacy.html"&gt;PRIVACY
STATEMENT&lt;/a&gt;&lt;/font&gt;&lt;/center&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;/table&gt;

&lt;br clear=all&gt;

&lt;table width=600 cellpadding=0 cellspacing=0
border=0&gt;
&lt;tr&gt;&lt;td align=right&gt;
&lt;a
href="http://ad.doubleclick.net/jump/www.oracle.com/products/trial/html/trial.html"&gt;
&lt;img
src="http://ad.doubleclick.net/ad/www.oracle.com/products/trial/html/trial.html"
 width=468 height=60 border=0 ismap&gt;&lt;/a&gt;

&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;
&lt;/body&gt;
&lt;/html&gt;
&lt;!--end footer--&gt;
&lt;/body&gt;
&lt;/html&gt;

dev - in sub not_good
dev - in sub save_html
dev - %tebotize key
http://www.oracle.com/cgi-bin/press/printpr.cgi?file=199907191000.18885.html&mode=corp
given value 5.tbt
dev - in sub extract_hyperlinks
dev - in sub absolutise_url()
dev - absolutise_url() returning null: (rejected
http://www.oracle.com/ebusiness/).
dev - in sub absolutise_url()
dev - absolutise_url() returning null: (rejected
mailto:rbahnasy@us.oracle.com).
dev - in sub absolutise_url()
dev - absolutise_url() returning null: (rejected
mailto:kmcgee@appliedcom.com).
dev - in sub absolutise_url()
dev - absolutise_url() returning null: (rejected
http://www.oracle.com/).
dev - in sub absolutise_url()
dev - absolutise_url() returning null: (rejected
http://www.oracle.com/).
dev - in sub absolutise_url()
dev - absolutise_url() returning null: (rejected
http://technet.oracle.com/).
dev - in sub absolutise_url()
dev - absolutise_url() returning null: (rejected
http://www.oracle.com/.).
dev - in sub absolutise_url()
dev - absolutise_url() returning null: (rejected
http://www.oracle.com/appserver/).
dev - in sub absolutise_url()
dev - absolutise_url() returning null: (rejected
http://www.oracle.com/).
dev - in sub absolutise_url()
dev - absolutise_url() returning:
http://www.oracle.com/html/sitemap.html.
dev - in sub absolutise_url()
dev - absolutise_url() returning:
http://www.oracle.com/html/siteidx_frame.html.
dev - in sub absolutise_url()
dev - absolutise_url() returning null: (rejected
http://orasearch.oracle.com/).
dev - in sub absolutise_url()
dev - absolutise_url() returning null: (rejected
http://oraclestore.oracle.com/).
dev - in sub absolutise_url()
dev - absolutise_url() returning null: (rejected
http://www.oracle.com/download/).
dev - in sub absolutise_url()
dev - absolutise_url() returning null: (rejected
http://www.oracle.com/support/).
dev - in sub absolutise_url()
dev - absolutise_url() returning null: (rejected
http://www.oracle.com/cgi-bin/press/pr.cgi).
dev - in sub absolutise_url()
dev - absolutise_url() returning null: (rejected
http://www.oracle.com/corporate/seminars_and_events/).
dev - in sub absolutise_url()
dev - absolutise_url() returning:
http://www.oracle.com/siteadmin/html/contactus.html.
dev - in sub absolutise_url()
dev - absolutise_url() returning:
http://www.oracle.com/products/index.htm.
dev - in sub absolutise_url()
dev - absolutise_url() returning:
http://www.oracle.com/services/index.htm.
dev - in sub absolutise_url()
dev - absolutise_url() returning:
http://www.oracle.com/solutions/index.htm.
dev - in sub absolutise_url()
dev - absolutise_url() returning null: (rejected
http://www.oracle.com/corporate/oracle_at_work/).
dev - in sub absolutise_url()
dev - absolutise_url() returning:
http://www.oracle.com/partners/index.htm.
dev - in sub absolutise_url()
dev - absolutise_url() returning null: (rejected
http://technet.oracle.com/).
dev - in sub absolutise_url()
dev - absolutise_url() returning:
http://www.oracle.com/corporate/index.htm.
dev - in sub absolutise_url()
dev - absolutise_url() returning:
http://www.oracle.com/international/html/.
dev - in sub absolutise_url()
dev - absolutise_url() returning:
http://www.oracle.com/html/employ.html.
dev - in sub absolutise_url()
dev - absolutise_url() returning null: (rejected
http://cnn.com/customnews).
dev - in sub absolutise_url()
dev - absolutise_url() returning:
http://www.oracle.com/html/copyright.html.
dev - in sub absolutise_url()
dev - absolutise_url() returning:
http://www.oracle.com/html/privacy.html.
dev - in sub absolutise_url()
dev - absolutise_url() returning:
http://ad.doubleclick.net/jump/www.oracle.com/products/trial/html/trial.html.
dev - in 'MAIN'calling get_html section
dev - in sub get_html()
dev - $url is
http://www.oracle.com/cgi-bin/press/printpr.cgi?file=199907130500.13306.html&mode=corp
From: jclinton@whitehouse.gov
Referer: http://www.oracle.com/
User-Agent: hcuBOT



dev - $response-&gt;as_string is
HTTP/1.1 200 OK
Date: Thu, 20 Jul 1999 20:18:48 GMT
Server: Oracle_Web_Listener/4.0.7.1.0EnterpriseEdition
Allow: GET, POST
Content-Type: text/html
Client-Date: Thu, 20 Jul 1999 23:28:36 GMT
Client-Peer: 205.207.44.16:80
Title: Press Release

&lt;html&gt;
&lt;head&gt;&lt;title&gt;Press
Release&lt;/title&gt;&lt;/head&gt;
&lt;body bgcolor="#ffffff"&gt;
&lt;!--header--&gt;
&lt;table width=600 cellpadding=0 cellspacing=0
border=0&gt;
&lt;tr&gt;&lt;td colspan=2 align=right&gt;
dev - $request&gt;as_string is
GET
http://www.oracle.com/cgi-bin/press/printpr.cgi?file=199907130500.13306.html&mode=corp
From: jclinton@whitehouse.gov
Referer: http://www.oracle.com/

Back to the Bots section

Final Notes

Perl is not the only language to write bots.
You can install Linux to your Windoze machine - you know you want to.
You could try something like this at altavista '+Perl +tutorial'or '+Perl +robot +tutorial'
I expect to update this page fairly soon with an improved HCUbot.

BOTS ARE THE FUTURE