Transformation relative in absoljutye links
URI a class which we have considered just, the set of every possible functions for job with various parts URL gives (such as definition such as URL - $url-> scheme, definition on what host he refers - $url-> host, and so on on the basis of the documentation on classes URI. Nevertheless, the most interesting are a method query_form, considered earlier, and now a method new_abs for transformation of the relative link ("../foo.html") in absolute (" http: // www.perl.com/stuff/foo.html "):
use URI;
$abs = URI-> new_abs ($maybe_relative, $base);
For example, we shall consider this programmku which chooses links from a HTML-page snovymi modules on CPAN:
use strict;
use warnings;
use LWP 5.64;
my $browser = LWP:: UserAgent-> new;
my $url = ' http: // www.cpan.org/RECENT.html ';
my $response = $browser-> get ($url);
die " Can't get $url - ", $response-> status_line
unless $response-> is_success;
my $html = $response-> content;
while ($html = ~ m / <A HREF = " (. *?) "/g) {
print "$1n";
}
At start she starts to give out something like it:
MIRRORING.FROM
RECENT
RECENT.html
authors/00whois.html
authors/01mailrc.txt.gz
authors/id/A/AA/AASSAD/CHECKSUMS
...
But, if you want receive the list of absolute links you can to use a method new_abs, having changed a cycle while as follows:
while ($html = ~ m / <A HREF = " (. *?) "/g) {
print URI-> new_abs ($1, $response-> base), "n";
}
($response-> base the module HTTP::Message it is used for definition of the base address for transformation of relative links in absolute.)
Now our program gives out that ndo:
http://www.cpan.org/MIRRORING.FROM
http://www.cpan.org/RECENT
http://www.cpan.org/RECENT.html
http://www.cpan.org/authors/00whois.html
http://www.cpan.org/authors/01mailrc.txt.gz
http://www.cpan.org/authors/id/A/AA/AASSAD/CHECKSUMS
...
See. Chapter{head} 4, "URLs", books Perl and LWP for the greater information on objects URI.
Certainly, use regexp for allocation of addresses is too prmitivnym a method, therefore for more serious programs it is necessary to use modules of " grammatic analysis HTML " similar HTML:: LinkExtor or HTML:: TokeParser, or, even can be, HTML:: TreeBuilder.
Other properties of a browser
Objects LWP:: UserAgent have set svojst for management of own job. Some from them:
*
$browser-> timeout (15): This method establishes a maximum quantity of time for expectation of the answer of the server. If after 15 seconds (in this case) it will not be received the answer the browser will stop search.
*
$browser-> protocols_allowed ([' http ',' gopher ']): types of links with which the browser will "communicate" Are established., in particular HTTP and gopher. If there will be osuhhestvena an attempt to get access to any document under other report (for example, " ftp: ", " mailto: ", " news: ") there will be no even an attempt of connection, and we shall receive a mistake 500, with the message similar: " Access to ftp URIs has been disabled ".
*
use LWP:: ConnCache;
$browser-> conn_cache (LWP:: ConnCache-> new ()): After this installation the object of a browser tries to use HTTP/1.1 "Keep-Alive" which accelerates searches by use of one connection for several searches to the same server.
*
$browser-> agent (' SomeName/1.23 (more info here maybe) '): we Determine as our browser will identify myself in line "User-Agent" HTTP searches. By default, he sends "libwww-perl/versionnumber", i.e. "libwww-perl/5.65". You can change it to more informative message:
$browser-> agent (' SomeName/3.14 (contact@robotplexus.int) ');
Or, if it is necessary, you can will pretend to be a real browser:
$browser-> agent (
' Mozilla/4.0 (compatible; MSIE 5.12; Mac_PowerPC) ');
*
push {$ua-> requests_redirectable}, ' POST ': we Establish{Install} our browser on carrying out readdressing on POST searches (so does{makes} the majority of modern browsers (IE, NN, Opera)) though HTTP RFC speaks us about that, what is it generally it should not be carried out.

|