I’m my new venture, I need to fetch some webpages – public webpages -…
On a domain, the index page is parsed 100%, without any problem, but all the other pages weren’t returning me the HTML, in fact event CURL wasn’t returning me any error.
First it was protected against any fetch without a user agent defined.
After I’v work it out, I wasn’t getting any HTML source..
I’v decided to use file_get_contents to see if I got any error, since CURL wasn’t returning me any…
And I got it.
failed to open stream: Redirection limit reached, aborting
2015/07/29 14:28:22 [error] 25586#0: *6458641 FastCGI sent in stderr: "PHP message: PHP Warning: file_get_contents(http://www.domain.com/page/): failed to open stream: Redirection limit reached, aborting in /home/webroot/worker/testes.php on line 5" while reading response header from upstream, client: 84.91.69.69, server: www.gipsy.digitalwhores.net, request: "GET /worker/testes.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock:", host: "gipsy.digitalwhores.net"
After a few searches I was able to solve it.
This is my PHP CURL.
function getPage ($url) { $useragent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.89 Safari/537.36'; $timeout= 120; $dir = dirname(__FILE__); $cookie_file = $dir . '/cookies/' . md5($_SERVER['REMOTE_ADDR']) . '.txt'; $ch = curl_init($url); curl_setopt($ch, CURLOPT_FAILONERROR, true); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file); curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true ); curl_setopt($ch, CURLOPT_ENCODING, "" ); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true ); curl_setopt($ch, CURLOPT_AUTOREFERER, true ); curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout ); curl_setopt($ch, CURLOPT_TIMEOUT, $timeout ); curl_setopt($ch, CURLOPT_MAXREDIRS, 10 ); curl_setopt($ch, CURLOPT_USERAGENT, $useragent); curl_setopt($ch, CURLOPT_REFERER, 'http://www.google.com/'); $content = curl_exec($ch); if(curl_errno($ch)) { echo 'error:' . curl_error($ch); } else { return $content; } curl_close($ch); }
The solution was in allowing cookies.