PHP SCRAPPER

Posted: October 23, 2011 in Uncategorized

Sometimes we want to extract the HTML content of the remote website page, this technique is called as HTML scrapper. This article will discuss on how we can extract the HTML content of the remote webpage.

We can achieve HTML scrapper operation in 2 step operation:

  • Call to Remote Web Page and extract the HTML content.
  • Match the HTML tags using Regular Expression.

Call to Remote Web Page using PHP:
In PHP there are various ways we can call the remote webpage, for more information on connecting to remote web page we can refer to . But here we will be using CURL to achieve our operation.

$ch = curl_init();
$timeout = 5; // set to zero for no timeout
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$file_contents = curl_exec($ch);
curl_close($ch);

$url holds the Remote URL you want to connect to; and $file_contents holds the HTML content of the remote web page that we have called.

Match the HTML tags using Regular Expression using PHP:
Here we will be using preg_match/preg_match_all to read the HTML tags from the HTML source. Here i am posting few Regular Expression code that will extract the content inside the HTML tags.

 

Extracting data from HTML tags

    preg_match_all('/<span>[\/\(\)-:<>\w\s]+< \/span>/',$file_contents,$htmlContent);
</span>

Assume that the $file_contents holds the HTML source code and after executing the above preg_match it will extract all the span tags from the HTML source code. Isn’t it simple, so now instead of span we want data from any other tag just replace the tag with that tag.

    preg_match_all('/<span>[\/\(\)-:<>\w\s]+< \/span>/',$file_contents,$htmlContent);
</span>

Assume that the $file_contents holds the HTML source code and after executing the above preg_match it will extract all the span tags having class=”test” from the HTML source code. This will ensure that we extract only those span tags that will have class attributes only.

    preg_match('%<table.*>.*\s*\s*</table>%',  $file_contents, $htmlContent);

Assume that the $file_contents holds the HTML source code and after executing the above preg_match it will extract all the table tags from the HTML source code. This will ensure that we extract only those table tags that will have class=’test’ attributes only. Now we have the table tag content, now we will extract the data inside td tags.

   preg_match_all('#<td [^>]*>(.*?)</td>]*>#is', $htmlContent[0], $td_matches);

Here we pass the extracted table tags details to the preg_match_all, this will ensure that we read all the data that resides inside the td tags.

Ref: http://www.hiteshagrawal.com

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s