October 13th, 2009

Spelling of polite robots

If you want your program based on LWP will be convinced, that, pays attention to files robots.txt and does not do{make} too many searches for the short period of time you can use LWP:: RobotUA instead of LWP:: UserAgent.

If you want your program based on LWP will be convinced, that, pays attention to files robots.txt and does not do{make} too many searches for the short period of time you can use LWP:: RobotUA instead of LWP:: UserAgent.

LWP:: RobotUA is almost LWP:: UserAgent, and you can use it{him} also:

use LWP:: RobotUA;

my $browser = LWP:: RobotUA-> new (

‘ YourSuperBot/1.34 ‘,’ you@yoursite.com ‘);

* Your bot’s name and your email address

my $response = $browser-> get ($url);

But HTTP::RobotUA adds the following opportunities:

*

If robots.txt on the server to which refers $url, forbids to you access to $url then the object $browser (take into account, that he belongs to class LWP:: RobotUA) will not request it{him}, and we shall receive in the answer ($response) a mistake 403 containing a line ” Forbidden by robots.txt “. So, if you have the following line:

die ” $url – “, $response-> status_line, “nAborted”

unless $response-> is_success;

Then the program should will come to the end with the message:

http://whatever.site.int/pith/x.html – 403 Forbidden

by robots.txt

Aborted at whateverprogram.pl line 1234

*

If $browser will see, that communicated with this server not so long ago, then he sdleaet a pause (it is similar sleep) for prevention of realization of a plenty of searches for short term. What delay will be? In general, by default, it – 1 minute, but you can supervise it by change of attribute $browser-> delay (minutes).

For example:

$browser-> delay (7/60);

It means, that the browser will make a pause when it will be necessary while since time of the previous search will not pass 7 seconds.

For the greater information read the full documentation on LWP:: RobotUA.

Use of proxies

In some cases you want or it is necessary for you to use a proxy for access to the certain sites or for use of the certain report. Most often such necessity arises, when your LWP-program is started by the machine which is ” for firewallom “.

That a browser ispol`zovl a proxy which is determined in variables of an environment (HTTP_PROXY), call env_proxy before any searches. In particular:

use LWP:: UserAgent;

my $browser = LWP:: UserAgent-> new;

*? before the first search:

$browser-> env_proxy;

For the greater information on parameters of a proxy read the documentation on LWP:: UserAgent, in particular pay attention to methods proxy, env_proxy and no_proxy.

HTTP Authentication (identification)

Many sites limit access to the pages using ” HTTP Authentication “. It not simply the form where you should enter the password for access to the information, is the special mechanism, when HTTP serverposylaet to a browser the message which says: ” That document is part of a protected ‘ realm ‘, and you can access it only if you re-request it and add some special authorization headers to your request ” (” This document is a part protected ‘ areas ‘ and you can get access to it{him} if you once again will query, having added some specific headings to your search “).

For example, managers of a site Unicode.org limit access for programs of gathering emailov to their archives of electronic dispatches, protecting them with help HTTP Authentication, there is a general{common} login and the password for access (on http: // www.unicode.org/mail-arch/) – a login – “unicode-ml” and the password – “unicode”.

For example, we shall consider this URL which is a part of the protected area of the Website:

http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html

You ate will try to load this page with a browser receive the instruction: ” Enter username and password for ‘ Unicode-MailList-Archives’ at server ‘www.unicode.org’ “, or in a graphic browser something like it:

Screenshot of site with Basic Auth required

In LWP if you start the following:

use LWP 5.64;

my $browser = LWP:: UserAgent-> new;

my $url =

‘ http: // www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html ‘;

my $response = $browser-> get ($url);

die ” Error: “, $response-> header (‘ WWW-Authenticate ‘) ||

‘ Error accessing ‘,

* (‘ WWW-Authenticate ‘ is the realm-name)

“n”, $response-> status_line, ” n at $urln Aborting ”

unless $response-> is_success;

Then receive a mistake:

Error: Basic realm = “Unicode-MailList-Archives”

401 Authorization Required

at http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html

Aborting at auth1.pl line 9. [or wherever]

Because $browser does not know a login and the password for area (“Unicode-MailList-Archives”) on a host (“www.unicode.org”). The simplest method to give to learn{find out} to a browser a login and the password – to use a method credentials. Syntax the following:

$browser-> credentials (

‘ servername:portnumber ‘,

‘ realm-name ‘,

‘ username ‘ => ‘ password ‘

);

In most cases the port number{room} 80 – is TCP/IP port by default for HTTP; and you can use a method credentials up to any searches. For example:

$browser-> credentials (

‘ reports.mybazouki.com:80 ‘,

‘ web_server_usage_reports ‘,

‘ plinky ‘ => ‘ banjo123 ‘

);

So, if we shall add the following right after lines $browser = LWP:: UserAgent-> new;:

$browser-> credentials (* add this to our $browser ‘ s ” key ring ”

‘ www.unicode.org:80 ‘,

‘ Unicode-MailList-Archives ‘,

‘ unicode-ml ‘ => ‘ unicode ‘

);

Also we shall start, the search will pass.

Comments are closed.

footer

Design & programming by: Web Development Company CONKURENT LLC 2003-2009.