NAME

webpluck - pluck information from web pages


SYNOPSIS

webpluck

webpluck [-c FILE] [-s DIRECTORY] [-t FILE] [-p URL] [-b|-g] [-1|-2] [-u USER -p PASSWORD] [-d LEVEL] [targets]

webpluck [--config FILE] [--store DIRECTORY] [--template FILE] [--proxy URL] [--bad|--good] [--stage1|--stage2] [--user USER --password PASSWORD] [--debug LEVEL] [targets]


DESCRIPTION

webpluck is a tool that will automatically fetch bits of information off of your favorite web sites, and present them in a way that saves you time, and prevents you from missing information that you would like to see.

I have a number of sources of information that I try to look at frequently, ranging from web based newspapers like C|Net and CNN, to online magazines like Web Review and Object Online, to less ``formal'' sources of information like the current world population (counters are stupid - I like to let people know how many people might be looking at my web page instead 8-).

These pages and others like them are great sources of information - my problem is that I don't have time to make the effort to check each one every day and see what is there or if the page has been updated.

There are a couple of different technologies that attempt to solve my problem. First there is the ``smart'' agent. The little gremlin that wonders out and roams the net and tries to guess what you want to see. Never mind that I have never seen one that actually works well, but hey I know what I want to see - I have the URLs in hand, I just don't have the time to go and check all the pages every day.

The second type is the ``custom newspaper''. There are two basic types of custom newspapers. There is the kind like CRAYON, which is little more then a page full of links to other pages that change everyday. This doesn't solve my problem - I'm still stuck clicking through lists of links to see what is on all the different pages. Then there are sites like My Yahoo, which is a single page who's content changes everyday. This is closer to what I want, a single site that I need to check for the latest news, etc. The only problem with My Yahoo, is that it is restricted to a small set of categories of content. I want to see information from resources other then what Yahoo provides. Specifically I want information from the web pages that I specified above.

Webpluck is a tool that will allow you to create your own ``My Yahoo'' like web pages from an unlimited number of sources. You can view my daily page (at http://strobe.weeg.uiowa.edu/~edhill/daily.html ) for an example of what webpluck can do.

To use webpluck you need a fair to moderate understanding of Perl regular expressions. The more you understand about regular expressions, the more you will be able to make webpluck do.


USAGE

The simplest way to use webpluck is just to type webpluck with no arguments.

   $ webpluck

This will cause webpluck to go and fetch information from all the targets you have defined in your standard configuration file, it will then use that information along with your standard template file and return a finished HTML document to STDOUT.

Typically you would at least want to redirect the output to a file. For example:

    $ webpluck > daily.html

This will create a daily.html file that is the results of the combination of the template you have defined and the data that has been collected.

You can also tell webpluck to fetch either one or a list of individual targets that are defined in your configuration file. So if you know that a certain web page is updated more often then the rest, you can do the following:

    $ webpluck cnn-us > daily.html

This will go the CNN web site and fetch the latest data, but will leave the rest of the data that has been collected alone and regenerate your finished HTML document.

Last there are a number of options that are defined in more detail below that can be used to change the behavior of webpluck in many different ways. For example:

    $ webpluck --debug 3 --template test.html --good \
               --proxy http:://proxy.company.com \
               cnn-us dilbert cnet-news > daily-test.html

Tells webpluck to show a lot of debugging information, use the test.html template instead of the built in default, obey an robot rules found on remote web servers, use the company's proxy server, and to only load the information from cnn-us, dilbert, and cnet-news pages.


OPTIONS

--config, -c FILE
This option specifies what configuration file to use. The configuration file defines the ``targets'' that contain data you want to get at, and describes how to extract the data you want from the remote site.

--store, -s DIRECTORY
This option specifies the storage directory where output is saved once it is collected from the remote web pages.

--template, -t FILE
An HTML template file that is used to format the output that you fetch from the various web pages you have defined. This file would contains ``webpluck tags'' that will be replaced with data you have collected.

--proxy, -x URL
If you are behind a firewall, and don't have direct access to web pages, you can use this flag to specify your HTTP proxy server. If you use this option, all web page requests will go through your proxy server.

--bad, -b
This flag tells webpluck to ignore the robot rules found in the /robots.txt on the remote host. By default, webpluck reads and obeys robot rules since by some definition it is a type of web robot. But you could also consider webpluck to just be a special type of web client. This option lets you decide. To be honest, webpluck is not that useful when obeying the robot rules, since most sites that have changing content disallow web robots, since they don't want those pages placed in search engine indexes (because of the changing content).

I believe that webpluck is different then the type of robots that robot rules are set up to protect against - which is why I have included this option.

--good, -g
This flag tells webpluck to act as a good webrobot and to look for and obey rules found in /robots.txt files found on remote servers. See the information for the --bad options to see reasons that you might not want to use this flag.

--stage1, -1
Causes webpluck to just fetch data from remote web pages and then exit. It will not use that data along with the template file that you have specified to create a finished HTML document.

--stage2, -2
Causes webpluck to just generate an HTML document based on the template that you have specified and the information that already exists on your local machine. This prevents webpluck from going out and retrieving information from remote web pages.

--user, -u USERNAME
Specifies a user name used to connect to sites that require authentication. This value can be overridden with target specific settings defined in your webpluck configuration file.

--password, -p PASSWORD
Specifies a password used to connect to sites that require authentication. This value can be overridden with target specific settings defined in your webpluck configuration file.

--debug, -d LEVEL
This flag causes webpluck to output additional information (which is sent to STDERR, since normal output goes to STDOUT) about what it is doing. LEVEL is an integer between 1 and 5 that specifies how much debugging output that you want to see.


CONFIGURATION FILE

Webpluck requires a configuration file to specify what web pages it is supposed to pull information from, and how it is supposed to get at that information. The configuration file is a text file that is made up of stanzas that look like the following.

   name     cnn-us
   url      http://www.cnn.com/US/
   regex    <H2>([^\<]+)<\/H2>.*?<A HREF=\"([^\"]+)\"
   fields   title:url

Each line in the stanza are key/value pairs separated by tabs or spaces. Each stanza must start with the line defining the target's name. Blank lines, and lines starting with the '#' symbol are ignored. The '#' symbol acts as a comment in the configuration file.

Valid keys are: name, url, regex, fields, user, and pass.

The name line allows you to specify the name of the target. This name should be made up of letters, numbers and the '-' or '_' symbols. It should not contain spaces or other special characters that would interfere with HTML processing (',``'', >, <, etc...). This name is used in 3 ways. First you can specify this name on the command line when invoking webpluck to have webpluck just fetch this specific target. Second, this is the name that you use in your template file to define a &ltpluck&gt tag. Third, this is the name of the file in your cache directory where the information that is collected from a remote web page is saved.

The url line specifies the remote web page that you want to retrieve. This can be any non-SSL web page (sorry, LWP doesn't handle SSL yet). This can even be other forms of URLs such as ftp, gopher, or news URLs. However, currently only HTTP based urls are supported when you are using a proxy server. You should also make sure that the URL you specify is the one that really contains the data you are looking for - it should not be a redirect, or a frames based reference.

The regex is the meat of a webpluck definition. It defines what information is going to be extracted from the remote web page.

The expression you define is placed inside the following loop within webpluck:

   while( $content =~ /$regex/isg ) { 
      ... webpluck does its thing ...
   }

$content is a variable containing the contents of the remote web page, and $regex is the expression that you define. A simple explanation of this is that for every possible match there is in the file for the regular expression that you define, information that is pulled out for that match is saved locally and then webpluck goes on to the next match. For a more detailed and complete explanation of what is going on, I suggest buying ``Mastering Regular Expressions'' by Jeffrey Friedl published by O'Reilly and Associates, ISBN 1-56592-257-3.

The fields line lists a number of field names that are associated with the data that is returned from your regular expression. The field names should contain just letters and numbers, and are separated by ':'. Most field names are just arbitrary strings that you can define (they are used later as place holder variables in your template file). However, there is one ``special'' field. If you are retrieving URL information from a remote site - you should call the field ``url''. If the URL that you retrieve is a relative URL, it is converted to an absolute URL as it is retrieved. This is important if you want to use that information from your own pages. This is the only field that is changed before it is saved locally.

The user and pass lines are used to define a username and password that should be used to connect to the URL you have specified. These values can also be supplied via the command line, but if they are specified in the configuration file, they override the command line values.


TEMPLATE FILE

The configuration file describes how to collect the data you are after, the template file describes how to display that data.

Typically your template file will be an HTML document that contains special ``tags'' indicating where certain data is supposed to go. The tags (and the information between the start of the tag and the end of the tag) are replaced with the information that has been retrieved from remote web pages, and the rest of the document is left untouched. Although your template is typically and HTML document, it does not have to be - it can be any text-based file that you can edit and insert the tags indicating where the data goes (text files, RTF files, etc.).

The tag that defines where data goes in an HTML-like tag called ``pluck''. An example of the pluck tag is as follows:

    <pluck name=cnn-us limit=5 size=40>
    <li><a href="url">title</a>
    </pluck>

Ths <pluck ...> line specifies options that are used by webpluck when placing data on the page (in the above example the options indicate that we want data collected from teh cnn-us target, that we only want the first 5 sets of results, and that the data being display should be truncated to 40 characters or less). The </pluck> tag signifies the end of a specific pluck template. Everything between the <pluck...> and </pluck> is the template that is used to display the data retrieved from the remote host. In the example above, for each headline retrieved from CNN, the line:

    <li><a href="url">title</a>

is written out, except that the strings ``url'' and ``title'' are replaced with the corresponding information that was retrieved from the CNN site. There is nothing special about the information between the <pluck ...> and </pluck> tags except that keywords corresponding to fields that you have defined (and a few special keywords) are replaced with information that you have collected.

The following are valid options inside the <pluck> tag definition.

name=[[string]]
This is the only option that is required. It is the name of the data source that you want to insert into this template.

limit=[[int]]
This limits the amount of information that is shown. Some web pages cycle through their news stories, pushing them down as new ones come in until the old stories fall off the page, others just keep adding and adding stories to a page. This option allows you to view only the top 10 stories, instead of printing all 1000 sets of data that was collected from the remote page.

item=[[int]]
This allows you to specify a specific item that you retrieved. For example if your expression matches columns in a table, and the item retrieved is always the stock price, and the forth item is the percentage change in stock - you can pull out just those two numbers for inclusion in your web page, and ignore the rest of the information that was retrieved. See the Netscape stock price example below for more details on how this can be used.

size=[[field:int]] ...
This allows you to limit the amount of information that is printed for each field that you retrieve. Since you are pulling data from other people's source - you have no control over the format or the amount of information that is retrieved. This allows you to limit specific fields to a certain width before it is used in your template.

This option is different then the rest because you can have more then one. You can specify size restrictions for each field that you have defined in your target.

if_changed=[[int]]
This is a conditional option. If you include this option, then the information between the <pluck></pluck> tags will only be shown if the information for this target has changed in the last [[int]] hours.

if_retrieved=[[int]]
A conditional statement like the one above, except that the information between the <pluck></pluck> tags will only be shown in the information for this target has been successfully retrieved in the last [[int]] hours.

There are a few special keywords all reflect times when data was either collected or checked, or when a web page was updated. The special keywords are:

    _checked_          The time that a target was checked
    _retrieved_        The time information was successfully retrieved
                       from a target (typically the same as _checked)
    _updated_          The time that webpluck last noticed that the
                       contents of a page had changed.

Currently these keywords only apply to the target as a whole, not the individual sets of data that have been retrieved from a remote web page. So if you use these special keywords, you want to reduce the number of times they are printed by limiting the output to 1. Although this is sort of annoying now, it allows it to be easily extended in the future so that you can have specific times associated with each set of data that is retrieved. For example:

    <pluck name=cnn-us limit=1>_retrieved</pluck>
                       
An example below shows how these special keywords are typically used.


EXAMPLES

All of these examples were created on Jun 3rd, 1997. It is possible (likely) that the web pages that I have listed as target have changed their format so that the regular expressions provided here are no longer valid - but hopefully these will give you an idea of what is possible.

Fetch headlines from the CNN US web page
 name     cnn-us
 url      http://www.cnn.com/US/
 regex    <H2>([^\<]+)<\/H2>.*?<A HREF=\"([^\"]+)\"
 fields   title:url

Find the URL of today's Dilbert
 name     dilbert
 url      http://www.unitedmedia.com/comics/dilbert/
 regex    SRC=\"?([^>]?\/comics\/dilbert\/archive.*?\.gif)\"?\s+
 fields   url

Find out what Netscape stock is selling at
 name     netscape-stock
 url      http://quote.yahoo.com/quotes?symbols=NSCP&;detailed=f
 regex    <td nowrap align=right>(.*?)</td>
 fields   value

Find out the last time that WebReview was updated
 name     web-review
 url      http://www.webreview.com/

The above examples show the target definitions used to fetch information from remote web sites. The examples below show html segments that you can include in your HTML template to use that data to generate a finished page.

Show the first 5 CNN US headlines (and the time is was updated)
 <b>CNN US News</b> 
 <font size=-1><i> 
 <pluck name=cnn-us limit=1>_updated_</pluck> 
 </font></i>
 <hr>
 <pluck name=cnn-us limit=5>
 <li><a href="url">title</a>
 </pluck>

Show the Netscape stock value, and it's change percentage
 <b>NCSP:</b>
 <pluck name=netscape-stock item=1>value</pluck>, 
 <pluck name=netscape-stock item=3>value</pluck><br>

Provide the link to WebReview if it has changed in the past 3 days
 <pluck name=web-review if_changed=72>
 <a href="http://www.webreview.com/";>Web Review</a>
 </pluck>


FILES

webpluck cache
A local directory where information that has been extracted from web pages is stored. For each web site that webpluck visits, the data that is retrieved from that site is stored in a separate file in this directory.

webpluck.conf
Webpluck configuration file. This file defines where webpluck goes looking for information (the URL), what information it is looking for (the fields) and how it goes about extracting that information (the regular expression).

template.html
Once you have information collected from various web sites, this file is used as a template to describe how to present that information. ``Tags'' in this file are replaced with the information that you have collected, and the finished document is sent to STDOUT.


DEPENDENCIES

Perl v5.003 or greater and the LWP library are required to run webpluck. You need to have them both installed on your system before webpluck will work. You can retrieve both from the CPAN archive at:

    http://www.perl.com/CPAN/

A fair to moderate understanding of Perl regular expressions on your part is also required to make use of this tool in any useful way.


BUGS, MISSING FEATURES

Having regular expression be an integral part of the program's design (and interface) limits its audience significantly. This is probably a tool for webmasters, Perl guru's and other folks with long beards and suspenders then it is a tool for mom and dad.

I would like to add some type of filtering process. So that not only can you grab information from a remote site, but you can run it through some filter to see if it is something that you *really* want to see (either via some keyword list, or some function that you define).


SEE ALSO

The perl manpage, The perlre manpage, ``Mastering Regular Expressions'' by Jeffrey Friedl published by O'Reilly and Associates, ISBN 1-56592-257-3.


AUTHOR

 Ed Hill (ed-hill@uiowa.edu)
 http://strobe.weeg.uiowa.edu/~edhill/
 Unix Systems Administrator, The University of Iowa