最近发现百度一直有蜘蛛来,但是一直没怎么收录页面,于是看百度有主动推送的方式,就想着是不是可以先验证当前页面百度有没有收录,如果没有收录则进行主动推送,增加推送量,百度每天有3000条的主动推送,不能白白浪费了啊
本篇主要说明如何查询是否被百度收录,后面会再写一个主动推送,两个需要联动起来,检查没有收录则推送。
代码实现
package com.sammery.ops.tester.edge;
import com.sammery.town.core.utils.ScriptUtil;
import org.apache.http.HttpStatus;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import java.io.IOException;
public class CheckBaiduStatusUtil {
public static void main(String[] args) throws IOException {
String url = "http://www.sammery.com/11.html";
System.out.println(checkBaiduStatus(url));
}
public static Document obtainDocument(String url) throws IOException {
try (CloseableHttpClient httpClient = HttpClients.createDefault()) {
HttpGet request = new HttpGet(ScriptUtil.parse("{0}/s?wd={}", "http://www.baidu.com", url));
try (CloseableHttpResponse response = httpClient.execute(request)) {
if (response.getStatusLine().getStatusCode() == HttpStatus.SC_OK) {
return Jsoup.parse(EntityUtils.toString(response.getEntity(), "utf-8"));
} else {
throw new IOException();
}
}
}
}
public static boolean checkBaiduStatus(String url) throws IOException {
Document doc = obtainDocument(url);
Elements elements = doc.getElementsByClass("hint_PIwZX");
return elements.size() > 0 && null != elements.first() && !elements.first().text().equalsIgnoreCase("百度为您找到相关结果0个");
}
}
由于百度添加有校验功能,如果请求过快会导致重定向到图片验证页面,验证通过之后才可以获取数量信息,尝试了一些方法,未能生效,该方法仅当一个建议处理方式吧,具体后面看看是怎么处理。
评论区