這篇文章給大家介紹Nutch如何解析Html文檔,內(nèi)容非常詳細(xì),感興趣的小伙伴們可以參考借鑒,希望對大家能有所幫助。
讓客戶滿意是我們工作的目標(biāo),不斷超越客戶的期望值來自于我們對這個(gè)行業(yè)的熱愛。我們立志把好的技術(shù)通過有效、簡單的方式提供給客戶,將通過不懈努力成為客戶在信息化領(lǐng)域值得信任、有價(jià)值的長期合作伙伴,公司提供的服務(wù)項(xiàng)目有:域名與空間、虛擬空間、營銷軟件、網(wǎng)站建設(shè)、劍閣網(wǎng)站維護(hù)、網(wǎng)站推廣。
解析Html文檔 MapReduce任務(wù)描述
一、主程序調(diào)用
ParseSegment parseSegment = new ParseSegment(getConf());
if (!Fetcher.isParsing(job)) {
parseSegment.parse(segs[0]); // parse it, if needed
}
(1)、isParseing方法的實(shí)現(xiàn)
public static boolean isParsing(Configuration conf) {
return conf.getBoolean("fetcher.parse", true);
}
(2)、參數(shù)segs[0]
Path[] segs = generator.generate(
crawlDb,
segments,
-1,
topN,
System.currentTimeMillis());
generate方法中 generatedSegments的生成過程。
這里疑點(diǎn)比較多,先放一放
// read the subdirectories generated in the temp 讀取temp中生成的子文件夾
// output and turn them into segments 輸出并且把他們轉(zhuǎn)到segments中
//1、創(chuàng)建Path的集合對象
List
//2、
FileStatus[] status = fs.listStatus(tempDir);// 這里讀取上面生成的多個(gè)fetchlist的segment
try {
for (FileStatus stat : status) {
Path subfetchlist = stat.getPath();
if (!subfetchlist.getName().startsWith("fetchlist-"))
continue;// 過濾不是以fetchlist-開頭的文件
// start a new partition job for this segment
Path newSeg = partitionSegment(fs, segments, subfetchlist,
numLists);
// 對segment進(jìn)行Partition操作,產(chǎn)生一個(gè)新的目錄
generatedSegments.add(newSeg);
}
} catch (Exception e) {
LOG.warn("Generator: exception while partitioning segments, exiting ...");
fs.delete(tempDir, true);
return null;
}
二、job任務(wù)配置
job.setInputFormat(SequenceFileInputFormat.class);
job.setMapperClass(ParseSegment.class);
job.setReducerClass(ParseSegment.class);
FileOutputFormat.setOutputPath(job, segment);
job.setOutputFormat(ParseOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(ParseImpl.class);
JobClient.runJob(job);
三、Map、reduce任務(wù)的輸入和輸出
map任務(wù)輸入和輸出
輸入: WritableComparable/ Content
輸出: Text/ ParseImpl
public void map(WritableComparable> key, Content content,
OutputCollector
reduce任務(wù)輸入和輸出
輸入: Text/Iterator
輸出: Text/Writable
public void reduce(Text key, Iterator
OutputCollector
四、job任務(wù)輸入類SequenceFileInputFormat
protected FileStatus[] listStatus(JobConf job) throws IOException {
FileStatus[] files = super.listStatus(job);
for (int i = 0; i < files.length; i++) {
FileStatus file = files[i];
if (file.isDir()) { // it's a MapFile
Path dataFile = new Path(file.getPath(), MapFile.DATA_FILE_NAME);
FileSystem fs = file.getPath().getFileSystem(job);
// use the data file
files[i] = fs.getFileStatus(dataFile);
}
}
return files;
}
public RecordReader
JobConf job, Reporter reporter)
throws IOException {
reporter.setStatus(split.toString());
return new SequenceFileRecordReader
}
五、map()方法和reduce()方法中的實(shí)現(xiàn)
(1)、map任務(wù)
org.apache.nutch.parse.ParseSegment
public void map(WritableComparable> key, Content content,
OutputCollector
throws IOException {
// convert on the fly from old UTF8 keys
if (key instanceof Text) {
newKey.set(key.toString());
key = newKey;
}
//2、 獲取抓取狀態(tài),
//Nutch.FETCH_STATUS_KEY)——> _fst_
int status =
Integer.parseInt(content.getMetadata().get(Nutch.FETCH_STATUS_KEY));
//3、如果成功抓取,如果沒有抓取成功,就跳過這條記錄
if (status != CrawlDatum.STATUS_FETCH_SUCCESS) {
// content not fetched successfully, skip document
LOG.debug("Skipping " + key + " as content is not fetched successfully");
return;
}
//4、判斷是否試過截?cái)?,文檔中是否被截?cái)?,如果要跳過截?cái)?,且文檔是被截?cái)嗔?,也跳過這條記錄
if (skipTruncated && isTruncated(content)) {
return;
}
ParseResult parseResult = null;
try {
//5、創(chuàng)建一個(gè)ParseUtil對象,調(diào)用解析方法parse,并返回一個(gè)解析結(jié)果ParseResult
parseResult = new ParseUtil(getConf()).parse(content);
} catch (Exception e) {
LOG.warn("Error parsing: " + key + ": " +StringUtils.stringifyException(e));
return;
}
//以上主要是解析,一下是對解析的處理
//————————————————————————————————————————————————————————————————————————————————
//6、遍歷上一步解析得到的解析結(jié)果,
for (Entry
//7、獲取鍵和值
Text url = entry.getKey();
Parse parse = entry.getValue();
//8、獲取狀態(tài)
ParseStatus parseStatus = parse.getData().getStatus();
long start = System.currentTimeMillis();
//9、計(jì)數(shù)器+1
reporter.incrCounter("ParserStatus", ParseStatus.majorCodes[parseStatus.getMajorCode()], 1);
//10、如果解析不成功,將parse持有的所有對象都置為null。
if (!parseStatus.isSuccess()) {
LOG.warn("Error parsing: " + key + ": " + parseStatus);
parse = parseStatus.getEmptyParse(getConf());
}
// pass segment name to parse data
//11、將segment名稱賦值給parse data
parse.getData().getContentMeta().set(Nutch.SEGMENT_NAME_KEY,
getConf().get(Nutch.SEGMENT_NAME_KEY));
// compute the new signature
//12、計(jì)算新的分值
byte[] signature =
SignatureFactory.getSignature(getConf()).calculate(content, parse);
//13、設(shè)置digest的值
parse.getData().getContentMeta().set(Nutch.SIGNATURE_KEY,
StringUtil.toHexString(signature));
try {
scfilters.passScoreAfterParsing(url, content, parse);
} catch (ScoringFilterException e) {
if (LOG.isWarnEnabled()) {
LOG.warn("Error passing score: "+ url +": "+e.getMessage());
}
}
long end = System.currentTimeMillis();
LOG.info("Parsed (" + Long.toString(end - start) + "ms):" + url);
output.collect(url, new ParseImpl(new ParseText(parse.getText()),
parse.getData(), parse.isCanonical()));
}
}
ParseResult對象的特點(diǎn)
a、實(shí)現(xiàn)了Iterable接口 ,可迭代;迭代對象是Entry,entry的key是Text,值是Parse,如下
public class ParseResult implements Iterable
迭代方法如下,
public Iterator
return parseMap.entrySet().iterator();
}
可以看到是使用了Map結(jié)合的迭代器
b、持有一個(gè)HashMap用于存放解析結(jié)果,另持有當(dāng)前的url,代碼如下
private Map
private String originalUrl;
Parse和ParseImpl的說明
Parse是一個(gè)接口,其中的3個(gè)方法如下
/** The textual(正文) content of the page. This is indexed, searched, and used when generating snippets.*/
//網(wǎng)頁正文內(nèi)容,將被索引,搜索,生成快照
String getText();
/** Other data extracted from the page. */
//沖網(wǎng)頁中提取的其他數(shù)據(jù)
ParseData getData();
/** Indicates if the parse is coming from a url or a sub-url */
//標(biāo)識這個(gè)Parse是否來自于url或者一個(gè)子url
boolean isCanonical();//Canonical:正規(guī)的
ParseImpl實(shí)現(xiàn)了Parse和Writable接口
ParseImpl中有3個(gè)字段,如下所示,其中isCanonical是在構(gòu)造的時(shí)候傳入的,默認(rèn)是true
private ParseText text;
private ParseData data;
private boolean isCanonical;//規(guī)則
用于去重的digest的計(jì)算
//12、計(jì)算新的分值
byte[] signature =
SignatureFactory.getSignature(getConf()).calculate(content, parse);
//13、設(shè)置digest的值
parse.getData().getContentMeta().set(Nutch.SIGNATURE_KEY,
StringUtil.toHexString(signature));
Signature:n.簽名; 署名; 識別標(biāo)志,鮮明特征; [醫(yī)]藥的用法說明;
SignatureFactory類中的getSignature方法
該方法看ObjectCache中有沒有Signature的實(shí)現(xiàn),如果沒有就利用反射創(chuàng)建一個(gè)并返回。
/** Return the default Signature implementation. */
public static Signature getSignature(Configuration conf) {
String clazz = conf.get("db.signature.class", MD5Signature.class.getName());
ObjectCache objectCache = ObjectCache.get(conf);
Signature impl = (Signature)objectCache.getObject(clazz);
if (impl == null) {
try {
if (LOG.isInfoEnabled()) {
LOG.info("Using Signature impl: " + clazz);
}
Class> implClass = Class.forName(clazz);
impl = (Signature)implClass.newInstance();
impl.setConf(conf);
objectCache.setObject(clazz, impl);
} catch (Exception e) {
throw new RuntimeException("Couldn't create " + clazz, e);
}
}
return impl;
}
***** 重要
計(jì)算網(wǎng)頁特征,最后是調(diào)用了Signature的calculate方法,以下是Signature實(shí)現(xiàn)類MD5Signaure的類代碼
/**
* Default implementation of a page signature.
默認(rèn)的Signature的實(shí)現(xiàn)類
* It calculates an MD5 hash of the raw binary content of a page.
它計(jì)算page內(nèi)容的原始的二進(jìn)制代碼的md5 hash值
* In case there is no content,
* it calculates a hash from the page's URL.
*
* @author Andrzej Bialecki <ab@getopt.org>
*/
public class MD5Signature extends Signature {
public byte[] calculate(Content content, Parse parse) {
byte[] data = content.getContent();
if (data == null) data = content.getUrl().getBytes();
return MD5Hash.digest(data).getDigest();
}
}
(2)、reduce任務(wù)
未完待續(xù)
多線程解析
//解析是多線程地
private ParseResult runParser(Parser p, Content content) {
ParseCallable pc = new ParseCallable(p, content);
Future
ParseResult res = null;
try {
res = task.get(maxParseTime, TimeUnit.SECONDS);
} catch (Exception e) {
LOG.warn("Error parsing " + content.getUrl() + " with " + p, e);
task.cancel(true);
} finally {
pc = null;
}
return res;
}
關(guān)于Nutch如何解析Html文檔就分享到這里了,希望以上內(nèi)容可以對大家有一定的幫助,可以學(xué)到更多知識。如果覺得文章不錯(cuò),可以把它分享出去讓更多的人看到。