October 13th, 2009
Spelling of polite robots
If you want your program based on LWP will be convinced, that, pays attention to files robots.txt and does not do{make} too many searches for the short period of time you can use LWP:: RobotUA instead of LWP:: UserAgent.
If you want your program based on LWP will be convinced, that, pays attention to files robots.txt and does not do{make} too many searches for the short period of time you can use LWP:: RobotUA instead of LWP:: UserAgent.
LWP:: RobotUA is almost LWP:: UserAgent, and you can use it{him} also:
use LWP:: RobotUA;
my $browser = LWP:: RobotUA-> new (
‘ YourSuperBot/1.34 ‘,’ you@yoursite.com ‘);
* Your bot’s name and your email address
my $response = $browser-> get ($url);
But HTTP::RobotUA adds the following opportunities:
*
If robots.txt on the server to which refers $url, forbids to you access to $url then the object $browser (take into account, that he belongs to class LWP:: RobotUA) will not request it{him}, and we shall receive in the answer ($response) a mistake 403 containing a line ” Forbidden by robots.txt “. So, if you have the following line:
die ” $url – “, $response-> status_line, “nAborted”
unless $response-> is_success;
Then the program should will come to the end with the message:
http://whatever.site.int/pith/x.html – 403 Forbidden
by robots.txt
Aborted at whateverprogram.pl line 1234
*
If $browser will see, that communicated with this server not so long ago, then he sdleaet a pause (it is similar sleep) for prevention of realization of a plenty of searches for short term. What delay will be? In general, by default, it – 1 minute, but you can supervise it by change of attribute $browser-> delay (minutes).
For example:
$browser-> delay (7/60);
It means, that the browser will make a pause when it will be necessary while since time of the previous search will not pass 7 seconds.
For the greater information read the full documentation on LWP:: RobotUA.
Use of proxies
In some cases you want or it is necessary for you to use a proxy for access to the certain sites or for use of the certain report. Most often such necessity arises, when your LWP-program is started by the machine which is ” for firewallom “.
That a browser ispol`zovl a proxy which is determined in variables of an environment (HTTP_PROXY), call env_proxy before any searches. In particular:
use LWP:: UserAgent;
my $browser = LWP:: UserAgent-> new;
*? before the first search:
$browser-> env_proxy;
For the greater information on parameters of a proxy read the documentation on LWP:: UserAgent, in particular pay attention to methods proxy, env_proxy and no_proxy.
HTTP Authentication (identification)
Many sites limit access to the pages using ” HTTP Authentication “. It not simply the form where you should enter the password for access to the information, is the special mechanism, when HTTP serverposylaet to a browser the message which says: ” That document is part of a protected ‘ realm ‘, and you can access it only if you re-request it and add some special authorization headers to your request ” (” This document is a part protected ‘ areas ‘ and you can get access to it{him} if you once again will query, having added some specific headings to your search “).
For example, managers of a site Unicode.org limit access for programs of gathering emailov to their archives of electronic dispatches, protecting them with help HTTP Authentication, there is a general{common} login and the password for access (on http: // www.unicode.org/mail-arch/) – a login – “unicode-ml” and the password – “unicode”.
For example, we shall consider this URL which is a part of the protected area of the Website:
http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html
You ate will try to load this page with a browser receive the instruction: ” Enter username and password for ‘ Unicode-MailList-Archives’ at server ‘www.unicode.org’ “, or in a graphic browser something like it:
Screenshot of site with Basic Auth required
In LWP if you start the following:
use LWP 5.64;
my $browser = LWP:: UserAgent-> new;
my $url =
‘ http: // www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html ‘;
my $response = $browser-> get ($url);
die ” Error: “, $response-> header (‘ WWW-Authenticate ‘) ||
‘ Error accessing ‘,
* (‘ WWW-Authenticate ‘ is the realm-name)
“n”, $response-> status_line, ” n at $urln Aborting ”
unless $response-> is_success;
Then receive a mistake:
Error: Basic realm = “Unicode-MailList-Archives”
401 Authorization Required
at http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html
Aborting at auth1.pl line 9. [or wherever]
Because $browser does not know a login and the password for area (“Unicode-MailList-Archives”) on a host (“www.unicode.org”). The simplest method to give to learn{find out} to a browser a login and the password – to use a method credentials. Syntax the following:
$browser-> credentials (
‘ servername:portnumber ‘,
‘ realm-name ‘,
‘ username ‘ => ‘ password ‘
);
In most cases the port number{room} 80 – is TCP/IP port by default for HTTP; and you can use a method credentials up to any searches. For example:
$browser-> credentials (
‘ reports.mybazouki.com:80 ‘,
‘ web_server_usage_reports ‘,
‘ plinky ‘ => ‘ banjo123 ‘
);
So, if we shall add the following right after lines $browser = LWP:: UserAgent-> new;:
$browser-> credentials (* add this to our $browser ‘ s ” key ring ”
‘ www.unicode.org:80 ‘,
‘ Unicode-MailList-Archives ‘,
‘ unicode-ml ‘ => ‘ unicode ‘
);
Also we shall start, the search will pass.