中文搜索引擎指南网

标题: Baiduspider常见问题解答 [打印本页]

作者: sowang    时间: 2021-2-11 00:56
标题: Baiduspider常见问题解答
1. 什么是Baiduspider
Baiduspider是百度搜索引擎的一个自动程序,它的作用是访问互联网上的网页,建立索引数据库,使用户能在百度搜索引擎中搜索到您网站上的网页。


2. Baiduspider的user-agent是什么?
索引擎百度各个产品使用不同的user-agent:
[tr][/tr]
产品名称
对应user-agent
网页搜索Baiduspider
移动搜索Baiduspider
图片搜索Baiduspider-image
视频搜索Baiduspider-video
新闻搜索Baiduspider-news
百度搜藏Baiduspider-favo
百度联盟Baiduspider-cpro
商务搜索Baiduspider-ads


3.如何区分PC与移动网页搜索的UA
PC搜索完整UA:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html
移动搜索完整UA:Mozilla/5.0 (Linux;u;Android 4.2.2;zh-cn;) AppleWebKit/534.46 (KHTML,like Gecko) Version/5.1 Mobile Safari/10600.6.3 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)

pc ua:通过关键词Baiduspider/2.0来确定是pc ua
移动ua:通过关键词android和mobile确定是来自移动端抓取访问,Baiduspider/2.0 确定为百度爬虫。


4.Baiduspider对一个网站服务器造成的访问压力如何?

为了达到对目标资源较好的检索效果,Baiduspider需要对您的网站保持一定量的抓取。我们尽量不给网站带来不合理的负担,并会根据服务器承受能力,网站质量,网站更新等综合因素来进行调整。如果您觉得baiduspider的访问行为有任何不合理的情况,您可以反馈至反馈中心

5. 为什么Baiduspider不停的抓取我的网站?

   
   对于您网站上新产生的或者持续更新的页面,Baiduspider会持续抓取。此外,您也可以检查网站访问日志中Baiduspider的访问是否正常,以防止有人恶意冒充Baiduspider来频繁抓取您的网站。 如果您发现Baiduspider非正常抓取您的网站,请通过投诉平台反馈给我们,并请尽量给出Baiduspider对贵站的访问日志,以便于我们跟踪处理。

6. 如何判断是否冒充Baiduspider的抓取?

   
建议您使用DNS反查方式来确定抓取来源的ip是否属于百度,根据平台不同验证方法不同,如linux/windows/os三种平台下的验证方法分别如下:

   
6.1  在linux平台下,您可以使用host ip命令反解ip来判断是否来自Baiduspider的抓取。Baiduspider的hostname以 *.baidu.com 或 *.baidu.jp 的格式命名,非 *.baidu.com 或 *.baidu.jp 即为冒充。
$ host 123.125.66.120
120.66.125.123.in-addr.arpa domain name pointer
baiduspider-123-125-66-120.crawl.baidu.com.


host 119.63.195.254
254.195.63.119.in-addr.arpa domain name pointer
BaiduMobaider-119-63-195-254.crawl.baidu.jp.

6.2  在windows平台或者IBM OS/2平台下,您可以使用nslookup ip命令反解ip来 判断是否来自Baiduspider的抓取。打开命令处理器 输入nslookup xxx.xxx.xxx.xxx(IP地 址)就能解析ip, 来判断是否来自Baiduspider的抓取,Baiduspider的hostname以 *.baidu.com 或 *.baidu.jp 的格式命名,非 *.baidu.com 或 *.baidu.jp 即为冒充。

6.3  在mac os平台下,您可以使用dig 命令反解ip来 判断是否来自Baiduspider的抓取。打开命令处理器 输入dig xxx.xxx.xxx.xxx(IP地 址)就能解析ip, 来判断是否来自Baiduspider的抓取,Baiduspider的hostname以 *.baidu.com 或 *.baidu.jp 的格式命名,非 *.baidu.com 或 *.baidu.jp 即为冒充。


7. 我不想我的网站被Baiduspider访问,我该怎么做?

Baiduspider遵守互联网robots协议。您可以利用robots.txt文件完全禁止Baiduspider访问您的网站,或者禁止Baiduspider访问您网站上的部分文件。 注意:禁止Baiduspider访问您的网站,将使您的网站上的网页,在百度搜索引擎以及所有百度提供搜索引擎服务的搜索引擎中无法被搜索到。关于robots.txt的写作方法,请参看我们的介绍:robots.txt写作方法


您可以根据各产品不同的user-agent设置不同的抓取规则,如果您想完全禁止百度所有的产品收录,可以直接对Baiduspider设置禁止抓取。

以下robots实现禁止所有来自百度的抓取: User-agent: Baiduspider Disallow: /

以下robots实现禁止所有来自百度的抓取但允许图片搜索抓取/image/目录: User-agent: Baiduspider Disallow: /

User-agent: Baiduspider-image Allow: /image/

请注意:Baiduspider-cpro抓取的网页并不会建入索引,只是执行与客户约定的操作,所以不遵守robots协议,如果Baiduspider-cpro给您造成了困扰,请联系union1@baidu.com。 Baiduspider-ads抓取的网页并不会建入索引,只是执行与客户约定的操作,所以不遵守robots协议,如果Baiduspider-ads给您造成了困扰,请联系您的客户服务专员。

8. 为什么我的网站已经加了robots.txt,还能在百度搜索出来?

   
因为搜索引擎索引数据库的更新需要时间。虽然Baiduspider已经停止访问您网站上的网页,但百度搜索引擎数据库中已经建立的网页索引信息,可能需要数月时间才会清除。另外也请检查您的robots配置是否正确。
如果您的拒绝被收录需求非常急迫,也可以通过 投诉平台 反馈请求处理。

9. 我希望我的网站内容被百度索引但不被保存快照,我该怎么做?

   
   Baiduspider遵守互联网meta robots协议。您可以利用网页meta的设置,使百度显示只对该网页建索引,但并不在搜索结果中显示该网页的快照。
和robots的更新一样,因为搜索引擎索引数据库的更新需要时间,所以虽然您已经在网页中通过meta禁止了百度在搜索结果中显示该网页的快照,但百度搜索引擎数据库中如果已经建立了网页索引信息,可能需要二至四周才会在线上生效。

10. Baiduspider抓取造成的带宽堵塞?

Baiduspider的正常抓取并不会造成您网站的带宽堵塞,造成此现象可能是由于有人冒充Baiduspider恶意抓取。如果您发现有名为Baiduspider的agent抓取并且造成带宽堵塞,请尽快和我们联系。您可以将信息反馈至 投诉平台 ,如果能够提供您网站该时段的访问日志将更加有利于我们的分析。





作者: sowang    时间: 2021-2-11 00:57
FAQs of Baiduspider
1. What is Baiduspider?
Baiduspider is Baidu search engine program which is used to visit pages on the internet and build information into Baidu index. This enables users to locate your site when they perform a search.

2. What is Baiduspider’s user-agent?
Baidu uses different user-agents for different products:  
Name of ProductsUser-agent
PC searchBaiduspider
Mobile searchBaiduspider
Image searchBaiduspider-image
Video searchBaiduspider-video
News searchBaiduspider-news
Baidu bookmarkBaiduspider-favo
Union baiduBaiduspider-cpro
Business searchBaiduspider-ads
other searchBaiduspider

3. Will Baiduspider creates additional loading to customer servers?
In order to ensure the search results cover most of your pages, Baiduspider must keep the crawling at a certain level. We have been trying our best to avoid increasing the loading to your servers, and to adjust the frequency based on combined factors, such as your server's capability, your site’s quality and the update frequency of your site. If you find any unreasonable access from Baiduspider, please inform us at http://webmaster.baidu.com/feedback/index (arabthaiPortuguês)

4. Why Baiduspider crawls my site continuously?
In order to ensure the latest information is presented, Baiduspider crawls new pages or pages frequently renewed at your site. Please check the log to see whether the crawling from Baiduspider is reasonable.
To avoid the excess crawling by spammers or other trouble makers who pretend to be Baiduspider, you can check the log. When you find any abnormal crawling, please inform us at http://webmaster.baidu.com/feedback/index (arabthaiPortuguês) and provide the log of Baiduspider.

5. How can I know the crawling is from Baiduspider?
We recommend using reverse DNS lookup to verify Baiduspider. Verification methods are different under linux/windows/os environments.
Instructions:
5.1  In Linux: run the “host IP” command,Examples:
$ host 123.125.66.120
120.66.125.123.in-addr.arpa domain name pointer
Baiduspider-123-125-66-120.crawl.baidu.com.

host 119.63.195.254
254.195.63.119.in-addr.arpa domain name pointer
BaiduMobaider-119-63-195-254.crawl.baidu.jp.
The hostname of Baiduspider is *.baidu.com or *.baidu.jp. Others are fake hostnames.

5.2  In Windows or IBM OS/2: run the “nslookup IP” command. Open the command processor and input nslookup xxx.xxx.xxx.xxx(IP address) to parse the IP. The hostname of Baiduspider is *.baidu.com or *.baidu.jp. Others are fake hostnames.

5.3  In MAC OS: run the “dig” command. Open the command processor and input dig -x xxx.xxx.xxx.xxx( IP address) to parse the IP. The hostname of Baiduspider is *.baidu.com or *.baidu.jp. Others are fake hostnames.

6. How can I prevent Baiduspider from crawling my site?
Baiduspider works on the robots.txt protocol. You can prevent Baiduspider from crawling your entire site or the specific contents by specifying them in robots.txt. Please note that by doing this, the pages of your site will not be found in Baidu search results and in any other the search results which is provided by Baidu. For detailsof setting a robots.txt, please see How to create a robots.txt

You can set different rules towards different user-agents. (Please note Baiduspider-video does not support the rules currently). If you prefer to prevent all the user-agents of Baidu, you can simply block Baiduspider.

Below robots command will block all the crawling from Baidu.
User-agent: Baiduspider
Disallow: /

Below robots command will allow Baiduspider-image only to crawl the directory of /image/
User-agent: Baiduspider
Disallow: /

User-agent: Baiduspider-image
Allow: /image/

Please note that the pages that crawled by Baiduspider-cpro will not be built into the index and Baiduspider-cpro works on the agreement that set with customers. In this case, Baiduspider-cpro will not work on the records set by robots.txt. If you are not comfortable with Baiduspider-cpro, please contact union1@baidu.com.Baiduspider-ads will not be built into the index and Baiduspider-ads works on the agreement that set with customers. In this case, Baiduspider-ads will not work on the records set by robots.txt. If you are not comfortable with Baiduspider-ads, please contact your customer service representative.

7. I've set robots.txt to my site, but why the contents of my site are still displayed as the search results?
It takes time for Baidu to update the database. Baidu stops crawling your site once you have added robots.txt.  The index which has been built previously requires several months to be removed from the database of the search engine.  On the other hand, please make sure your robots.txt is created correctly.
If your request of removing your site from the search engine is in urgent, please inform us at http://webmaster.baidu.com/feedback/index (arabthaiPortuguês)

8. How can I request Baiduspider to index my pages but not to show the cached links in the search results?
Baiduspider works on the meta robots.txt protocol. You can use the meta tag to request Baiduspider to index your pages but not to show the cached links in the search results.
Same as the way Baidu handling update request from robot.txt. Baidu stops showing the cached links after you had updated the meta robots.txt protocol. It takes 2 to 4 weeks to refresh the contents which have been stored in the database of Baidu previously.

9. Will the crawling from Baidu lead to bandwidth congestion?
Generally speaking, the crawling of Baiduspider will not lead to bandwidth congestion. If it happens, it is probably caused by some unauthorized access which pretend to be Baiduspider. When you find there is an agent named as Baiduspider which makes your network busy, please inform us at http://webmaster.baidu.com/feedback/index (arabthaiPortuguês) as soon as possible. Providing us with the log of that particular time frame will be of great help for us to investigate and analyze the problem.

10. How to index my websites, including independent sites and blogs, in Baidu?
Baidu indexes sites and pages which reach the requirement of user search experience.
To help Baiduspider discover you site more quickly, you are also welcomed to submit your website address at http://www.baidu.com/search/url_submit.htm
Homepage is enough, with no requirement on detailed content pages.
The value of pages is the only reason that justifies the indexing of Baidu, which has nothing to do with commercial factors, for example, Baidu Promotion.

11. How to tell whether my website has been indexed by Baidu? Is the result provided by the website grammar is equivalent to the real amount of indexing?
To check whether your site has been indexed by Baidu, please run website grammar. Input “site: (your domain name)” into the search box, for instance, http://www.baidu.com/s?wd=site%3Awww.baidu.com. Your site will be displayed as a result if it has been indexed.
The result provided by the website grammar is only an estimate for reference.

12. How to prevent my website from being indexed by Baidu?
Baidu strictly complies with robots.txt protocol. For detailed information, please visit http://www.robotstxt.org/.
You can prevent all your pages or parts of them from being crawled and indexed by Baidu with robots.txt. For specific method, please refer to How to Write Robots.txt.
If you set robots.txt to restrict crawling after Baidu has indexed your website, it usually takes 48 hours for the updated robots.txt to take effect and then the new pages won’t be indexed. Note that it may take several months for the contents which have been indexed by Baidu before the restriction of robots.txt to disappear from search results.
If you are in urgent need of restricting crawling, you can report to http://webmaster.baidu.com/feedback/index (arabthaiPortuguês) and we will deal with it as soon as possible.
  
13. How to report to Baiduspider when I meet problems?
If you have any problem on the crawling, please contact http://webmaster.baidu.com/feedback/index (arabthaiPortuguês) for further help.  
To deal with your feedback effectively and timely, please make sure that both the problem and the domain name your website are reported. It would be better if you can provide the website log of crawling, which can help us find the reason and solve the problem in time.

14. I have set robots.txt to restrict the crawling of Baidu, but why it doesn’t take effect?
Baidu strictly complies with robots.txt protocol. But our DNS updates periodically. If you have set robots.txt, due to the updating, Baidu needs some time to stop crawling your site.
If you are in urgent need of restricting crawling, you can report to http://webmaster.baidu.com/feedback/index (arabthaiPortuguês).
Besides, please check whether your robots.txt is correct in format.

15. Why some pages, such as private pages without links or pages requiring access rights, are also indexed by Baidu?
The crawling of pages by Baiduspider depends on the links between pages.
Except for the internal links between pages, there are also external links between different sites. Therefore, although some pages deny access from the internal links, they can also be indexed by search engine if there are links on other website directing to them.
Baiduspider enjoys the same access rights as other users. Consequently, spider can’t visit those pages that ordinary users fail to do so. There are 2 reasons that Baidu seems to index those pages with access rights:
o There is no restriction of access when the spider crawls the content. However, after the crawling, the access rights change.
o The content is protected by access rights, but due to website security holes, users are able to access it through some special paths. If the paths are publicized on the internet, following them, the spider is able to crawl the content.
If you would like the private content not to be indexed, you can restrict crawling with robots.txt. Besides, you are also welcomed to report to http://webmaster.baidu.com/feedback/index (arabthaiPortuguês) for solution.

16. Why does the amount of my website indexed tend to decrease?
Due to the instability of server, Baiduspider cannot crawl pages when it checks for updates and changes, thus those pages will be deleted temporarily.
You website does not fit for user search experience.

17. Why does my page disappear from the search results of Baidu?
Baidu does not promise that all pages can be searched.
If your page fails to be searched by Baidu for a long time or suddenly disappears from the results, the possible reasons are as follows:
o You website does not fit for the user search experience.
o Due to the instability of server that your website is based on, Baidu deletes it temporarily. The problem will be solved after the server is stable.
o Some contents of the page do not conform to the law and rules.
o Other technical problems.
The following comments are incorrect as well as groundless:
o If a website discontinues paying after participating in Baidu Promotion, it will disappear from Ba









欢迎光临 中文搜索引擎指南网 (http://sowang.com/bbs/) Powered by Discuz! X3.2