Final Update It appears that the targeted website blocked DO IPs and are giving the problems which I've been resolving for days. I spinned a EC2 instance and manage to work the code working, together with caching etc so as to reduce the hit on the website and allow my user to share the website.PHP Curl 405 Не допускается
-
UPDATE: I manage to get the Html by setting curl error to off, however the website other than returning 405 error is also not setting some cookies which are required for the website content to be loaded.
curl_setopt ($ ч, CURLOPT_FAILONERROR, FALSE);
Я использую следующие коды для ajax-> PHP для извлечения og: meta для сайтов. Тем не менее, есть 1 или 2 конкретных сайта, который возвращает ошибку и не будет получать информацию. Со следующими ошибками. Код работает без проблем для большинства веб-сайтов.
Warning: DOMDocument::loadHTML(): Empty string supplied as input in /my/home/path/getUrlMeta.php on line 58
От curl_error в моей error_log
The requested URL returned error: 405 Not Allowed
И
Failed to connect to www.something.com port 443: Connection refused
У меня нет никаких проблем с получением HTML, веб-сайта, когда я использую локон на моей консоли сервера и никаких проблем с извлечением информация, необходимая для большинства веб-сайтов с использованием кодов ниже
function file_get_contents_curl($url)
{
$ch = curl_init();
$header[0] = "Accept: text/html, text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: no-cache";
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
//curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_USERAGENT,"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0 ");
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
//The following 2 set up lines work with sites like www.nytimes.com
//Update: Added option for cookie jar since some websites recommended it. cookies.txt is set to permission 777. Still doesn't work.
$cookiefile = '/home/my/folder/cookies.txt';
curl_setopt($ch, CURLOPT_COOKIESESSION, true);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookiefile);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookiefile);
$data = curl_exec($ch);
if(curl_error($ch))
{
error_log(curl_error($ch));
}
curl_close($ch);
return $data;
}
$html = file_get_contents_curl($url);
libxml_use_internal_errors(true); // Yeah if you are so worried about using @ with warnings
$doc = new DomDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$query = '//*/meta[starts-with(@property, \'og:\')]';
$metas = $xpath->query($query);
$rmetas = array();
foreach ($metas as $meta) {
$property = substr($meta->getAttribute('property'),3);
$content = $meta->getAttribute('content');
$rmetas[$property] = $content;
}
/*below code retrieves the next bigger than 600px image should og:image be empty.*/
if (empty($rmetas['image'])) {
//$src = $xpath->evaluate("string(//img/@src)");
//echo "src=" . $src . "\n";
$query = '//*/img';
$srcs = $xpath->query($query);
foreach ($srcs as $src) {
$property = $src->getAttribute('src');
if (substr($property,0,4) == 'http' && in_array(substr($property,-3), array('jpg','png','peg'), true)) {
if (list($width, $height) = getimagesize($property)) {
do if ($width > 600) {
$rmetas['image'] = $property;
break;
} while (0);
}
}
}
}
echo json_encode($rmetas);
die();
UPDATE: Error on my part that website is not https enabled so I still have the 405 not allowed error.
локон Информация
{
"url": "http://www.example.com/",
"content_type": null,
"http_code": 405,
"header_size": 0,
"request_size": 458,
"filetime": -1,
"ssl_verify_result": 0,
"redirect_count": 0,
"total_time": 0.326782,
"namelookup_time": 0.004364,
"connect_time": 0.007725,
"pretransfer_time": 0.007867,
"size_upload": 0,
"size_download": 0,
"speed_download": 0,
"speed_upload": 0,
"download_content_length": -1,
"upload_content_length": -1,
"starttransfer_time": 0.326634,
"redirect_time": 0,
"redirect_url": "",
"primary_ip": "SOME IP",
"certinfo": [],
"primary_port": 80,
"local_ip": "SOME IP",
"local_port": 52966
}
Update: If I do a curl -i from console I get the following response. A error 405 but it follows by all the HTML that I need.
Home> curl -i http://www.domain.com
HTTP/1.1 405 Not Allowed
Server: nginx
Date: Wed, 22 Feb 2017 17:57:03 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Vary: Accept-Encoding
Vary: Accept-Encoding
Set-Cookie: PHPSESSID2=ko67tfga36gpvrkk0rtqga4g94; path=/; domain=.domain.com
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: __PAGE_REFERRER=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=0; path=/; domain=www.domain.com
Set-Cookie: __PAGE_SITE_REFERRER=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=0; path=/; domain=www.domain.com
X-Repository: legacy
X-App-Server: production-web23:8018
X-App-Server: distil2-kvm:80
Если он останавливается только на некоторых сайтах, это проблема на стороне сервера. Мы ничего не можем с этим поделать. – miken32
@ miken32, но URL-адрес доступен из веб-браузера. Не скручивается ли эмулирование браузера? Это общедоступный веб-сайт, который не требует входа в систему, нет ssl и т. Д. –
Удалите 'CURLOPT_FAILONERROR', и вы получите полное содержимое для 405, как и эквивалент командной строки, который вы показываете. –