start page | rating of books | rating of authors | reviews | copyrights

Unix Power ToolsUnix Power ToolsSearch this book

41.13. Perl and the Internet

Because Perl supports Berkeley sockets, all kinds of networking tasks can be automated with Perl. Below are some common idioms to show you what is possible with Perl and a little elbow grease.

41.13.1. Be Your Own Web Browser with LWP

The suite of classes that handle all the aspects of HTTP are collectively known as LWP (for libwww-perl library). If your Perl installation doesn't currently have LWP, you can easily install it with the CPAN module (Section 41.11) like this:

# perl -MCPAN -e 'install Bundle::LWP'

If you also included an X widget library such as Tk, you could create a graphic web browser in Perl (an example of this comes with the Perl Tk library). However, you don't need all of that if you simply want to grab a file from a web server:

use LWP::Simple;
my $url = "http://slashdot.org/slashdot.rdf";
getstore($url, "s.rdf");

This example grabs the Rich Site Summary file from the popular tech news portal, Slashdot, and saves it to a local file called s.rdf. In fact, you don't even need to bother with a full-fledged script:

$ perl -MLWP::Simple -e 'getstore("http://slashdot.org/slashdot.rdf", "s.rdf")'

Sometimes you want to process a web page to extract information from it. Here, the title of the page given by the URL given on the command line is extracted and reported:

use LWP::Simple;
use HTML::TokeParser;

$url = $ARGV[0] || 'http://www.oreilly.com';
$content = get($url);
die "Can't fetch page: halting\n" unless $content;

$parser = HTML::TokeParser->new(\$content);
$parser->get_tag("title");
$title = $parser->get_token;
print $title->[1], "\n" if $title;

After bringing in the library to fetch the web page (LWP::Simple) and the one that can parse HTML (HTML::TokeParser), the command line is inspected for a user-supplied URL. If one isn't there, a default URL is used. The get function, imported implicitly from LWP::Simple, attempts to fetch the URL. If it succeeds, the whole page is kept in memory in the scalar $content. If the fetch fails, $content will be empty, and the script halts. If there's something to parse, a reference to the content is passed into the HTML::TokeParser object constructor. HTML::TokeParser deconstructs a page into individual HTML elements. Although this isn't the way most people think of HTML, it does make it easier for both computers and programmers to process web pages. Since nearly every web page has only one <title> tag, the parser is instructed to ignore all tokens until it finds the opening <title> tag. The actual title string is a text string and fetching that piece requires getting the next token. The method get_token returns an array reference of various sizes depending on the kind of token returned (see the HTML::TokeParse manpage for details). In this case, the desired element is the second one.

One important word of caution: these scripts are very simple web crawlers, and if you plan to be grabbing a lot of pages from a web server you don't own, you should do more research into how to build polite web robots. See O'Reilly's Perl & LWP.

41.13.2. Sending Mail with Mail::Sendmail

Often, you may find it necessary to send an email reminder from a Perl script. You could do this with sockets only, handling the whole SMTP protocol in your code, but why bother? Someone has already done this for you. In fact, there are several SMTP modules on CPAN, but the easiest one to use for simple text messages is Mail::Sendmail. Here's an example:

use Mail::Sendmail;

my %mail = (
                Subject => "About your disk quota"
                To      => "[email protected], [email protected]"
                From    => "[email protected]",
                Message => "You've exceeded your disk quotas",
                smtp    => "smtp-mailhost.hostname.com",
           );

sendmail(%mail) or die "error: $Mail::Sendmail::error";
print "done\a\n";

Since most readers will be familiar with the way email works, this module should be fairly easy to adapt to your own use. The one field that may not be immediately clear is smtp. This field should be set to the hostname or IP address of a machine that will accept SMTP relay requests from the machine on which your script is running. With the proliferation of email viruses of mass destruction, mail administrators don't usually allow their machines to be used by unknown parties. Talk to your local system administrator to find a suitable SMTP host for your needs.

41.13.3. CGI Teaser

What Perl chapter would be complete without some mention of CGI? The Common Gateway Interface is a standard by which web servers, like Apache, allow external programs to interact with web clients. The details of CGI can be found in O'Reilly's CGI Programming with Perl, but the code below uses the venerable CGI module to create a simple form and display the results after the user has hit the submit button. You will need look through your local web server's configuration files to see where such a script needs to be in order for it to work. Unfortunately, that information is very system-dependent.

use CGI;

$cgi  = CGI->new;
$name = $cgi->param("usrname");

print
  $cgi->header, $cgi->start_html,
  $cgi->h1("My First CGI Program");

if( $name ){
  print $cgi->p("Hello, $name");
}

print
  $cgi->start_form,
  $cgi->p("What's your name: "), $cgi->textfield(-name => "usrname"),
  $cgi->submit, $cgi->end_form,
  $cgi->end_html;

CGI scripts are unlike other scripts with which you are probably more familiar, because these programs have a notion of programming state. In other words, when the user first accesses this page, $name will be empty and a blank form with a text box will be displayed. When the user enters something into that textbox and submits the form, the user's input will be stored under the key usrname. After the user presses the form's submit button, the values of that form are available through the CGI method param. Here, the desired value is stored under the key usrname. If this value is populated, a simple message is displayed before showing the form again.

Now you have nearly all the tools necessary to create your own Internet search engine. I leave the details of creating a massive data storage and retrieval system needed to catalog millions of web pages as an exercise for the reader.

-- JJ



Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.