webpluck [-c FILE] [-s DIRECTORY] [-t FILE] [-p URL] [-b|-g] [-1|-2] [-u USER -p PASSWORD] [-d LEVEL] [targets]
webpluck [--config FILE] [--store DIRECTORY] [--template FILE] [--proxy URL] [--bad|--good] [--stage1|--stage2] [--user USER --password PASSWORD] [--debug LEVEL] [targets]
I have a number of sources of information that I try to look at frequently, ranging from web based newspapers like C|Net and CNN, to online magazines like Web Review and Object Online, to less ``formal'' sources of information like the current world population (counters are stupid - I like to let people know how many people might be looking at my web page instead 8-).
These pages and others like them are great sources of information - my problem is that I don't have time to make the effort to check each one every day and see what is there or if the page has been updated.
There are a couple of different technologies that attempt to solve my problem. First there is the ``smart'' agent. The little gremlin that wonders out and roams the net and tries to guess what you want to see. Never mind that I have never seen one that actually works well, but hey I know what I want to see - I have the URLs in hand, I just don't have the time to go and check all the pages every day.
The second type is the ``custom newspaper''. There are two basic types of custom newspapers. There is the kind like CRAYON, which is little more then a page full of links to other pages that change everyday. This doesn't solve my problem - I'm still stuck clicking through lists of links to see what is on all the different pages. Then there are sites like My Yahoo, which is a single page who's content changes everyday. This is closer to what I want, a single site that I need to check for the latest news, etc. The only problem with My Yahoo, is that it is restricted to a small set of categories of content. I want to see information from resources other then what Yahoo provides. Specifically I want information from the web pages that I specified above.
Webpluck is a tool that will allow you to create your own ``My Yahoo'' like web pages from an unlimited number of sources. You can view my daily page (at http://strobe.weeg.uiowa.edu/~edhill/daily.html ) for an example of what webpluck can do.
To use webpluck you need a fair to moderate understanding of Perl regular expressions. The more you understand about regular expressions, the more you will be able to make webpluck do.
$ webpluck
This will cause webpluck to go and fetch information from all the targets you have defined in your standard configuration file, it will then use that information along with your standard template file and return a finished HTML document to STDOUT.
Typically you would at least want to redirect the output to a file. For example:
$ webpluck > daily.html
This will create a daily.html file that is the results of the combination of the template you have defined and the data that has been collected.
You can also tell webpluck to fetch either one or a list of individual targets that are defined in your configuration file. So if you know that a certain web page is updated more often then the rest, you can do the following:
$ webpluck cnn-us > daily.html
This will go the CNN web site and fetch the latest data, but will leave the rest of the data that has been collected alone and regenerate your finished HTML document.
Last there are a number of options that are defined in more detail below that can be used to change the behavior of webpluck in many different ways. For example:
$ webpluck --debug 3 --template test.html --good \ --proxy http:://proxy.company.com \ cnn-us dilbert cnet-news > daily-test.html
Tells webpluck to show a lot of debugging information, use the test.html template instead of the built in default, obey an robot rules found on remote web servers, use the company's proxy server, and to only load the information from cnn-us, dilbert, and cnet-news pages.
I believe that webpluck is different then the type of robots that robot rules are set up to protect against - which is why I have included this option.
name cnn-us url http://www.cnn.com/US/ regex <H2>([^\<]+)<\/H2>.*?<A HREF=\"([^\"]+)\" fields title:url
Each line in the stanza are key/value pairs separated by tabs or spaces. Each stanza must start with the line defining the target's name. Blank lines, and lines starting with the '#' symbol are ignored. The '#' symbol acts as a comment in the configuration file.
Valid keys are: name, url, regex, fields, user, and pass.
The name line allows you to specify the name of the target. This name should be made up of letters, numbers and the '-' or '_' symbols. It should not contain spaces or other special characters that would interfere with HTML processing (',``'', >, <, etc...). This name is used in 3 ways. First you can specify this name on the command line when invoking webpluck to have webpluck just fetch this specific target. Second, this is the name that you use in your template file to define a <pluck> tag. Third, this is the name of the file in your cache directory where the information that is collected from a remote web page is saved.
The url line specifies the remote web page that you want to retrieve. This can be any non-SSL web page (sorry, LWP doesn't handle SSL yet). This can even be other forms of URLs such as ftp, gopher, or news URLs. However, currently only HTTP based urls are supported when you are using a proxy server. You should also make sure that the URL you specify is the one that really contains the data you are looking for - it should not be a redirect, or a frames based reference.
The regex is the meat of a webpluck definition. It defines what information is going to be extracted from the remote web page.
The expression you define is placed inside the following loop within webpluck:
while( $content =~ /$regex/isg ) { ... webpluck does its thing ... }
$content
is a variable containing the contents of the remote
web page, and $regex
is the expression that you define. A
simple explanation of this is that for every possible match there is in the
file for the regular expression that you define, information that is pulled
out for that match is saved locally and then webpluck goes on to the next
match. For a more detailed and complete explanation of what is going on, I
suggest buying ``Mastering Regular Expressions'' by Jeffrey Friedl
published by O'Reilly and Associates, ISBN 1-56592-257-3.
The fields line lists a number of field names that are associated with the data that is returned from your regular expression. The field names should contain just letters and numbers, and are separated by ':'. Most field names are just arbitrary strings that you can define (they are used later as place holder variables in your template file). However, there is one ``special'' field. If you are retrieving URL information from a remote site - you should call the field ``url''. If the URL that you retrieve is a relative URL, it is converted to an absolute URL as it is retrieved. This is important if you want to use that information from your own pages. This is the only field that is changed before it is saved locally.
The user and pass lines are used to define a username and password that should be used to connect to the URL you have specified. These values can also be supplied via the command line, but if they are specified in the configuration file, they override the command line values.
Typically your template file will be an HTML document that contains special ``tags'' indicating where certain data is supposed to go. The tags (and the information between the start of the tag and the end of the tag) are replaced with the information that has been retrieved from remote web pages, and the rest of the document is left untouched. Although your template is typically and HTML document, it does not have to be - it can be any text-based file that you can edit and insert the tags indicating where the data goes (text files, RTF files, etc.).
The tag that defines where data goes in an HTML-like tag called ``pluck''. An example of the pluck tag is as follows:
<pluck name=cnn-us limit=5 size=40> <li><a href="url">title</a> </pluck>
Ths <pluck ...> line specifies options that are used by webpluck when placing data on the page (in the above example the options indicate that we want data collected from teh cnn-us target, that we only want the first 5 sets of results, and that the data being display should be truncated to 40 characters or less). The </pluck> tag signifies the end of a specific pluck template. Everything between the <pluck...> and </pluck> is the template that is used to display the data retrieved from the remote host. In the example above, for each headline retrieved from CNN, the line:
<li><a href="url">title</a>
is written out, except that the strings ``url'' and ``title'' are replaced with the corresponding information that was retrieved from the CNN site. There is nothing special about the information between the <pluck ...> and </pluck> tags except that keywords corresponding to fields that you have defined (and a few special keywords) are replaced with information that you have collected.
The following are valid options inside the <pluck> tag definition.
This option is different then the rest because you can have more then one. You can specify size restrictions for each field that you have defined in your target.
_checked_ The time that a target was checked _retrieved_ The time information was successfully retrieved from a target (typically the same as _checked) _updated_ The time that webpluck last noticed that the contents of a page had changed.
Currently these keywords only apply to the target as a whole, not the individual sets of data that have been retrieved from a remote web page. So if you use these special keywords, you want to reduce the number of times they are printed by limiting the output to 1. Although this is sort of annoying now, it allows it to be easily extended in the future so that you can have specific times associated with each set of data that is retrieved. For example:
<pluck name=cnn-us limit=1>_retrieved</pluck> An example below shows how these special keywords are typically used.
name cnn-us url http://www.cnn.com/US/ regex <H2>([^\<]+)<\/H2>.*?<A HREF=\"([^\"]+)\" fields title:url
name dilbert url http://www.unitedmedia.com/comics/dilbert/ regex SRC=\"?([^>]?\/comics\/dilbert\/archive.*?\.gif)\"?\s+ fields url
name netscape-stock url http://quote.yahoo.com/quotes?symbols=NSCP&detailed=f regex <td nowrap align=right>(.*?)</td> fields value
name web-review url http://www.webreview.com/
<b>CNN US News</b> <font size=-1><i> <pluck name=cnn-us limit=1>_updated_</pluck> </font></i> <hr> <pluck name=cnn-us limit=5> <li><a href="url">title</a> </pluck>
<b>NCSP:</b> <pluck name=netscape-stock item=1>value</pluck>, <pluck name=netscape-stock item=3>value</pluck><br>
<pluck name=web-review if_changed=72> <a href="http://www.webreview.com/">Web Review</a> </pluck>
http://www.perl.com/CPAN/
A fair to moderate understanding of Perl regular expressions on your part is also required to make use of this tool in any useful way.
I would like to add some type of filtering process. So that not only can you grab information from a remote site, but you can run it through some filter to see if it is something that you *really* want to see (either via some keyword list, or some function that you define).
Ed Hill (ed-hill@uiowa.edu) http://strobe.weeg.uiowa.edu/~edhill/ Unix Systems Administrator, The University of Iowa