php數(shù)據(jù)采集 PHP數(shù)據(jù)采集主要包括三個過程

php采集大數(shù)據(jù)的方案

1、建議你讀寫數(shù)據(jù)和下載圖片分開，各用不同的進程完成。

創(chuàng)新互聯(lián)服務(wù)項目包括梓潼網(wǎng)站建設(shè)、梓潼網(wǎng)站制作、梓潼網(wǎng)頁制作以及梓潼網(wǎng)絡(luò)營銷策劃等。多年來，我們專注于互聯(lián)網(wǎng)行業(yè)，利用自身積累的技術(shù)優(yōu)勢、行業(yè)經(jīng)驗、深度合作伙伴關(guān)系等，向廣大中小型企業(yè)、政府機構(gòu)等提供互聯(lián)網(wǎng)行業(yè)的解決方案，梓潼網(wǎng)站推廣取得了明顯的社會效益與經(jīng)濟效益。目前，我們服務(wù)的客戶以成都為中心已經(jīng)輻射到梓潼省份的部分城市，未來相信會繼續(xù)擴大服務(wù)區(qū)域并繼續(xù)獲得客戶的支持與信任！

比如說，取數(shù)據(jù)用get-data.php，下載圖片用get-image.php。

2、多進程的話，php可以簡單的用pcntl_fork()。這樣可以并發(fā)多個子進程。

但是我不建議你用fork，我建議你安裝一個gearman worker。這樣你要并發(fā)幾個，就啟幾個worker，寫代碼簡單，根本不用在代碼里考慮thread啊，process等等。

3、綜上，解決方案這樣：

（1）安裝gearman worker。

（2）寫一個get-data.php，在crontab里設(shè)置它每5分鐘執(zhí)行一次，只負責讀數(shù)據(jù)，然后把讀回來的數(shù)據(jù)一條一條的扔到 gearman worker的隊列里；

然后再寫一個處理數(shù)據(jù)的腳本作為worker，例如叫process-data.php，這個腳本常駐內(nèi)存。它作為worker從geraman 隊列里讀出一條一條的數(shù)據(jù)，然后跟你的數(shù)據(jù)庫老數(shù)據(jù)比較，進行你的業(yè)務(wù)邏輯。如果你要10個并發(fā)，那就啟動10個process-data.php好了。處理完后，如果圖片地址有變動需要下載圖片，就把圖片地址扔到 gearman worker的另一個隊列里。

（3）再寫一個download-data.php，作為下載圖片的worker，同樣，你啟動10個20個并發(fā)隨便你。這個進程也常駐內(nèi)存運行，從gearman worker的圖片數(shù)據(jù)隊列里取數(shù)據(jù)出來，下載圖片

4、常駐進程的話，就是在代碼里寫個while(true)死循環(huán)，讓它一直運行好了。如果怕內(nèi)存泄露啥的，你可以每循環(huán)10萬次退出一下。然后在crontab里設(shè)置，每分鐘檢查一下進程有沒有啟動，比如說這樣啟動3個process-data worker進程：

* * * * * flock -xn /tmp/process-data.1.lock -c '/usr/bin/php /process-data.php /dev/null 21'

* * * * * flock -xn /tmp/process-data.2.lock -c '/usr/bin/php /process-data.php /dev/null 21'

* * * * * flock -xn /tmp/process-data.3.lock -c '/usr/bin/php /process-data.php /dev/null 21'

不知道你明白了沒有

php 百度知道數(shù)據(jù)采集

問題其實不難，自己都能寫。給你幾個思路吧：

1.在百度知道中，輸入linux，然后會出現(xiàn)列表。復制瀏覽器地址欄內(nèi)容。

然后翻頁，在復制地址欄內(nèi)容，看看有什么不同，不同之處，就是你要循環(huán)分頁的i值。

當然這個是笨方法。

2.使用php的file或者file_get_contents函數(shù)，獲取鏈接URL的內(nèi)容。

3.通過php正則表達式，獲取你需要的3個字段內(nèi)容。

4.寫入數(shù)據(jù)庫。

需要注意的是，百度知道有可能做了防抓取的功能，你剛一抓幾個頁面，可能會被禁止。

建議也就抓10頁數(shù)據(jù)。

其實不難，你肯定寫的出來。還有，網(wǎng)上應該有很多抓取工具，你找找看，然后將抓下來的數(shù)據(jù)

在做分析。寫入數(shù)據(jù)庫。

PHP 采集程序中常用的函數(shù)

復制代碼

代碼如下:

//獲得當前的腳本網(wǎng)址

function

get_php_url()

{

if(!empty($_SERVER[”REQUEST_URI”]))

{

$scriptName

$_SERVER[”REQUEST_URI”];

$nowurl

$scriptName;

}

else

{

$scriptName

$_SERVER[”PHP_SELF”];

if(empty($_SERVER[”QUERY_STRING”]))

$nowurl

$scriptName;

else

$nowurl

$scriptName.”?”.$_SERVER[”QUERY_STRING”];

}

return

$nowurl;

}

//把全角數(shù)字轉(zhuǎn)為半角數(shù)字

function

GetAlabNum($fnum)

{

$nums

array(”0”,”1”,”2”,”3”,”4”,”5”,”6”,”7”,”8”,”9”);

$fnums

“0123456789″;

for($i=0;$i=9;$i++)

$fnum

str_replace($nums[$i],$fnums[$i],$fnum);

$fnum

ereg_replace(”[^0-9\.]|^0{1,}”,””,$fnum);

if($fnum==””)

$fnum=0;

return

$fnum;

}

//去除HTML標記

function

Text2Html($txt)

{

$txt

str_replace(”

“,”　”,$txt);

$txt

str_replace(””,””,$txt);

$txt

str_replace(””,””,$txt);

$txt

preg_replace(”/[\r\n]{1,}/isU”,”br/\r\n”,$txt);

return

$txt;

}

//清除HTML標記

function

ClearHtml($str)

{

$str

str_replace('','',$str);

$str

str_replace('','',$str);

return

$str;

}

//相對路徑轉(zhuǎn)化成絕對路徑

function

relative_to_absolute($content,

$feed_url)

{

preg_match('/(http|https|ftp):\/\//',

$feed_url,

$protocol);

$server_url

preg_replace(”/(http|https|ftp|news):\/\//”,

“”,

$feed_url);

$server_url

preg_replace(”/\/.*/”,

“”,

$server_url);

($server_url

”)

{

return

$content;

}

(isset($protocol[0]))

{

$new_content

preg_replace('/href=”\//',

‘href=”‘.$protocol[0].$server_url.'/',

$content);

$new_content

preg_replace('/src=”\//',

'src=”‘.$protocol[0].$server_url.'/',

$new_content);

}

else

{

$new_content

$content;

}

return

$new_content;

}

//取得所有鏈接

function

get_all_url($code){

preg_match_all('/a\s+href=[”|\']?([^”\'

]+)[”|\']?\s*[^]*([^]+)\/a/i',$code,$arr);

return

array('name'=$arr[2],'url'=$arr[1]);

}

//獲取指定標記中的內(nèi)容

function

get_tag_data($str,

$start,

$end)

{

(

$start

”

$end

”

)

{

return;

}

$str

explode($start,

$str);

$str

explode($end,

$str[1]);

return

$str[0];

}

//HTML表格的每行轉(zhuǎn)為CSV格式數(shù)組

function

get_tr_array($table)

{

$table

preg_replace(”‘td[^]*?'si”,'”‘,$table);

$table

str_replace(”/td”,'”,',$table);

$table

str_replace(”/tr”,”{tr}”,$table);

//去掉

HTML

標記

$table

preg_replace(”‘[\/\!]*?[^]*?'si”,””,$table);

//去掉空白字符

$table

preg_replace(”‘([\r\n])[\s]+'”,””,$table);

$table

str_replace(”

“,””,$table);

$table

str_replace(”

“,””,$table);

$table

explode(”,{tr}”,$table);

array_pop($table);

return

$table;

}

//將HTML表格的每行每列轉(zhuǎn)為數(shù)組，采集表格數(shù)據(jù)

function

get_td_array($table)

{

$table

preg_replace(”‘table[^]*?'si”,””,$table);

$table

preg_replace(”‘tr[^]*?'si”,””,$table);

$table

preg_replace(”‘td[^]*?'si”,””,$table);

$table

str_replace(”/tr”,”{tr}”,$table);

$table

str_replace(”/td”,”{td}”,$table);

//去掉

HTML

標記

$table

preg_replace(”‘[\/\!]*?[^]*?'si”,””,$table);

//去掉空白字符

$table

preg_replace(”‘([\r\n])[\s]+'”,””,$table);

$table

str_replace(”

“,””,$table);

$table

str_replace(”

“,””,$table);

$table

explode('{tr}',

$table);

array_pop($table);

foreach

($table

$key=$tr)

{

$td

explode('{td}',

$tr);

array_pop($td);

$td_array[]

$td;

}

return

$td_array;

}

//返回字符串中的所有單詞

$distinct=true

去除重復

function

split_en_str($str,$distinct=true)

{

preg_match_all('/([a-zA-Z]+)/',$str,$match);

($distinct

true)

{

$match[1]

array_unique($match[1]);

}

sort($match[1]);

return

$match[1];

}

怎么用php采集網(wǎng)站數(shù)據(jù)

簡單的分了幾個步驟：

1、確定采集目標

2、獲取目標遠程頁面內(nèi)容（curl、file_get_contents）

3、分析頁面html源碼，正則匹配你需要的內(nèi)容（preg_match、preg_match_all），這一步最為重要，不同頁面正則匹配規(guī)則不一樣

4、入庫

php通過post傳輸?shù)膉son數(shù)據(jù)能采集嗎

不能。所謂的json數(shù)據(jù)格式是http請求中的body是一個json格式的字符串，這個用$_POST就獲取不到了。PHP是一種易于學習和使用的服務(wù)器端腳本語言。只需要很少的編程知識你就能使用PHP建立一個真正交互的WEB站點。

php curl 大量數(shù)據(jù)采集

這個需要配合js，打開一個html頁面，首先js用ajax請求頁面，返回第一個頁面信息確定處理完畢（ajax有強制同步功能），ajax再訪問第二個頁面。（或者根據(jù)服務(wù)器狀況，你可以同時提交幾個URL，跑幾個相同的頁面）

參數(shù)可以由js產(chǎn)生并傳遞url，php后臺頁面根據(jù)URL抓頁面。然后ajax通過php，在數(shù)據(jù)庫或者是哪里設(shè)一個標量，標明檢測到哪里。由于前臺的html頁面執(zhí)行多少時候都沒問題，這樣php的內(nèi)存限制和執(zhí)行時間限制就解決了。

因為不會浪費大量的資源用一個頁面來跑一個瞬間500次的for循環(huán)了。（你的500次for循環(huán)死了原因可能是獲取的數(shù)據(jù)太多，大過了php限制的內(nèi)存）

不過印象中curl好像也有強制同步的選項，就是等待一個抓取后再執(zhí)行下一步。但是這個500次都是用一個頁面線程處理，也就是說肯定會遠遠大于30秒的默認執(zhí)行時間。

網(wǎng)站欄目：php數(shù)據(jù)采集 PHP數(shù)據(jù)采集主要包括三個過程
標題來源：http://weahome.cn/article/doceoih.html

真实的国产乱ⅩXXX66竹夫人,五月香六月婷婷激情综合,亚洲日本VA一区二区三区,亚洲精品一区二区三区麻豆

php數(shù)據(jù)采集 PHP數(shù)據(jù)采集主要包括三個過程

php采集大數(shù)據(jù)的方案

php 百度知道數(shù)據(jù)采集

PHP 采集程序中常用的函數(shù)

怎么用php采集網(wǎng)站數(shù)據(jù)

php通過post傳輸?shù)膉son數(shù)據(jù)能采集嗎

php curl 大量數(shù)據(jù)采集

其他資訊

網(wǎng)站制作

企業(yè)服務(wù)

網(wǎng)站建設(shè)

服務(wù)器托管

真实的国产乱ⅩXXX66竹夫人,五月香六月婷婷激情综合,亚洲日本VA一区二区三区,亚洲精品一区二区三区麻豆

php數(shù)據(jù)采集 PHP數(shù)據(jù)采集主要包括三個過程

php采集大數(shù)據(jù)的方案

php 百度 知道數(shù)據(jù)采集

PHP 采集程序中常用的函數(shù)

怎么用php采集網(wǎng)站數(shù)據(jù)

php通過post傳輸?shù)膉son數(shù)據(jù)能采集嗎

php curl 大量數(shù)據(jù)采集

其他資訊

網(wǎng)站制作

企業(yè)服務(wù)

網(wǎng)站建設(shè)

服務(wù)器托管

php 百度知道數(shù)據(jù)采集