Home
Home Page
Web Bases with LWP
Transformation relative in absoljutye links
For the greater information read the full documentation on LWP:: UserAgent.
Accessing HTTPS URLs
Job with the text and graphic data in common in PHP and MySQL
Change of appearance of the counter in CNStats
Simple banner system phpFBS
How to protect a site from total uploading.
21 mistake of programmer PHP
API functions
Minuses of use API of functions
Generation of the image
The guest book step by step
The guest book on PHP/MySQL
PHP - Simple caching
Even about protection e-mail addresses on webs - pages
Language of web - statistics
Program extract of the bill in system WebMoney
Language of web - statistics
Links

 

For the greater information read the full documentation on LWP:: UserAgent.

Spelling of polite robots


If you want your program based on LWP will be convinced, that, pays attention to files robots.txt and does not do{make} too many searches for the short period of time you can use LWP:: RobotUA instead of LWP:: UserAgent.


LWP:: RobotUA is almost LWP:: UserAgent, and you can use it{him} also:



use LWP:: RobotUA;

my $browser = LWP:: RobotUA-> new (

' YourSuperBot/1.34 ',' you@yoursite.com ');

* Your bot's name and your email address


my $response = $browser-> get ($url);


But HTTP::RobotUA adds the following opportunities:


*


If robots.txt on the server to which refers $url, forbids to you access to $url then the object $browser (take into account, that he belongs to class LWP:: RobotUA) will not request it{him}, and we shall receive in the answer ($response) a mistake 403 containing a line " Forbidden by robots.txt ". So, if you have the following line:



die " $url - ", $response-> status_line, "nAborted"

unless $response-> is_success;


Then the program should will come to the end with the message:



http://whatever.site.int/pith/x.html - 403 Forbidden

by robots.txt

Aborted at whateverprogram.pl line 1234


*


If $browser will see, that communicated with this server not so long ago, then he sdleaet a pause (it is similar sleep) for prevention of realization of a plenty of searches for short term. What delay will be? In general, by default, it - 1 minute, but you can supervise it by change of attribute $browser-> delay (minutes).


For example:



$browser-> delay (7/60);


It means, that the browser will make a pause when it will be necessary while since time of the previous search will not pass 7 seconds.


For the greater information read the full documentation on LWP:: RobotUA.

Use of proxies


In some cases you want or it is necessary for you to use a proxy for access to the certain sites or for use of the certain report. Most often such necessity arises, when your LWP-program is started by the machine which is " for firewallom ".


That a browser ispol`zovl a proxy which is determined in variables of an environment (HTTP_PROXY), call env_proxy before any searches. In particular:



use LWP:: UserAgent;

my $browser = LWP:: UserAgent-> new;


*? before the first search:

$browser-> env_proxy;


For the greater information on parameters of a proxy read the documentation on LWP:: UserAgent, in particular pay attention to methods proxy, env_proxy and no_proxy.

HTTP Authentication (identification)


Many sites limit access to the pages using " HTTP Authentication ". It not simply the form where you should enter the password for access to the information, is the special mechanism, when HTTP serverposylaet to a browser the message which says: " That document is part of a protected ' realm ', and you can access it only if you re-request it and add some special authorization headers to your request " (" This document is a part protected ' areas ' and you can get access to it{him} if you once again will query, having added some specific headings to your search ").


For example, managers of a site Unicode.org limit access for programs of gathering emailov to their archives of electronic dispatches, protecting them with help HTTP Authentication, there is a general{common} login and the password for access (on http: // www.unicode.org/mail-arch/) - a login - "unicode-ml" and the password - "unicode".


For example, we shall consider this URL which is a part of the protected area of the Website:



http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html


You ate will try to load this page with a browser receive the instruction: " Enter username and password for ' Unicode-MailList-Archives' at server 'www.unicode.org' ", or in a graphic browser something like it:

Screenshot of site with Basic Auth required


In LWP if you start the following:



use LWP 5.64;

my $browser = LWP:: UserAgent-> new;


my $url =

' http: // www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html ';

my $response = $browser-> get ($url);


die " Error: ", $response-> header (' WWW-Authenticate ') ||

' Error accessing ',

* (' WWW-Authenticate ' is the realm-name)

"n", $response-> status_line, " n at $urln Aborting "

unless $response-> is_success;


Then receive a mistake:



Error: Basic realm = "Unicode-MailList-Archives"

401 Authorization Required

at http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html

Aborting at auth1.pl line 9. [or wherever]


Because $browser does not know a login and the password for area ("Unicode-MailList-Archives") on a host ("www.unicode.org"). The simplest method to give to learn{find out} to a browser a login and the password - to use a method credentials. Syntax the following:



$browser-> credentials (

' servername:portnumber ',

' realm-name ',

' username ' => ' password '

);


In most cases the port number{room} 80 - is TCP/IP port by default for HTTP; and you can use a method credentials up to any searches. For example:




$browser-> credentials (

' reports.mybazouki.com:80 ',

' web_server_usage_reports ',

' plinky ' => ' banjo123 '

);


So, if we shall add the following right after lines $browser = LWP:: UserAgent-> new;:



$browser-> credentials (* add this to our $browser ' s " key ring "

' www.unicode.org:80 ',

' Unicode-MailList-Archives ',

' unicode-ml ' => ' unicode '

);


Also we shall start, the search will pass.