



查看: 10025|回复: 1
打印 上一主题 下一主题


发表于 2021-2-11 00:56:59 | 只看该作者 回帖奖励 |倒序浏览 |阅读模式
1. 什么是Baiduspider

2. Baiduspider的user-agent是什么?

PC搜索完整UA:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html
移动搜索完整UA:Mozilla/5.0 (Linux;u;Android 4.2.2;zh-cn;) AppleWebKit/534.46 (KHTML,like Gecko) Version/5.1 Mobile Safari/10600.6.3 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)

pc ua:通过关键词Baiduspider/2.0来确定是pc ua
移动ua:通过关键词android和mobile确定是来自移动端抓取访问,Baiduspider/2.0 确定为百度爬虫。



5. 为什么Baiduspider不停的抓取我的网站?

   对于您网站上新产生的或者持续更新的页面,Baiduspider会持续抓取。此外,您也可以检查网站访问日志中Baiduspider的访问是否正常,以防止有人恶意冒充Baiduspider来频繁抓取您的网站。 如果您发现Baiduspider非正常抓取您的网站,请通过投诉平台反馈给我们,并请尽量给出Baiduspider对贵站的访问日志,以便于我们跟踪处理。

6. 如何判断是否冒充Baiduspider的抓取?


6.1  在linux平台下,您可以使用host ip命令反解ip来判断是否来自Baiduspider的抓取。Baiduspider的hostname以 *.baidu.com 或 *.baidu.jp 的格式命名,非 *.baidu.com 或 *.baidu.jp 即为冒充。
$ host domain name pointer

host domain name pointer

6.2  在windows平台或者IBM OS/2平台下,您可以使用nslookup ip命令反解ip来 判断是否来自Baiduspider的抓取。打开命令处理器 输入nslookup xxx.xxx.xxx.xxx(IP地 址)就能解析ip, 来判断是否来自Baiduspider的抓取,Baiduspider的hostname以 *.baidu.com 或 *.baidu.jp 的格式命名,非 *.baidu.com 或 *.baidu.jp 即为冒充。

6.3  在mac os平台下,您可以使用dig 命令反解ip来 判断是否来自Baiduspider的抓取。打开命令处理器 输入dig xxx.xxx.xxx.xxx(IP地 址)就能解析ip, 来判断是否来自Baiduspider的抓取,Baiduspider的hostname以 *.baidu.com 或 *.baidu.jp 的格式命名,非 *.baidu.com 或 *.baidu.jp 即为冒充。

7. 我不想我的网站被Baiduspider访问,我该怎么做?

Baiduspider遵守互联网robots协议。您可以利用robots.txt文件完全禁止Baiduspider访问您的网站,或者禁止Baiduspider访问您网站上的部分文件。 注意:禁止Baiduspider访问您的网站,将使您的网站上的网页,在百度搜索引擎以及所有百度提供搜索引擎服务的搜索引擎中无法被搜索到。关于robots.txt的写作方法,请参看我们的介绍:robots.txt写作方法


以下robots实现禁止所有来自百度的抓取: User-agent: Baiduspider Disallow: /

以下robots实现禁止所有来自百度的抓取但允许图片搜索抓取/image/目录: User-agent: Baiduspider Disallow: /

User-agent: Baiduspider-image Allow: /image/

请注意:Baiduspider-cpro抓取的网页并不会建入索引,只是执行与客户约定的操作,所以不遵守robots协议,如果Baiduspider-cpro给您造成了困扰,请联系union1@baidu.com。 Baiduspider-ads抓取的网页并不会建入索引,只是执行与客户约定的操作,所以不遵守robots协议,如果Baiduspider-ads给您造成了困扰,请联系您的客户服务专员。

8. 为什么我的网站已经加了robots.txt,还能在百度搜索出来?

如果您的拒绝被收录需求非常急迫,也可以通过 投诉平台 反馈请求处理。

9. 我希望我的网站内容被百度索引但不被保存快照,我该怎么做?

   Baiduspider遵守互联网meta robots协议。您可以利用网页meta的设置,使百度显示只对该网页建索引,但并不在搜索结果中显示该网页的快照。

10. Baiduspider抓取造成的带宽堵塞?

Baiduspider的正常抓取并不会造成您网站的带宽堵塞,造成此现象可能是由于有人冒充Baiduspider恶意抓取。如果您发现有名为Baiduspider的agent抓取并且造成带宽堵塞,请尽快和我们联系。您可以将信息反馈至 投诉平台 ,如果能够提供您网站该时段的访问日志将更加有利于我们的分析。

分享到:  QQ好友和群QQ好友和群 QQ空间QQ空间 腾讯微博腾讯微博 腾讯朋友腾讯朋友
 楼主| 发表于 2021-2-11 00:57:23 | 只看该作者
FAQs of Baiduspider
1. What is Baiduspider?
Baiduspider is Baidu search engine program which is used to visit pages on the internet and build information into Baidu index. This enables users to locate your site when they perform a search.

2. What is Baiduspider’s user-agent?
Baidu uses different user-agents for different products:  
Name of ProductsUser-agent
PC searchBaiduspider
Mobile searchBaiduspider
Image searchBaiduspider-image
Video searchBaiduspider-video
News searchBaiduspider-news
Baidu bookmarkBaiduspider-favo
Union baiduBaiduspider-cpro
Business searchBaiduspider-ads
other searchBaiduspider

3. Will Baiduspider creates additional loading to customer servers?
In order to ensure the search results cover most of your pages, Baiduspider must keep the crawling at a certain level. We have been trying our best to avoid increasing the loading to your servers, and to adjust the frequency based on combined factors, such as your server's capability, your site’s quality and the update frequency of your site. If you find any unreasonable access from Baiduspider, please inform us at http://webmaster.baidu.com/feedback/index (arabthaiPortuguês)

4. Why Baiduspider crawls my site continuously?
In order to ensure the latest information is presented, Baiduspider crawls new pages or pages frequently renewed at your site. Please check the log to see whether the crawling from Baiduspider is reasonable.
To avoid the excess crawling by spammers or other trouble makers who pretend to be Baiduspider, you can check the log. When you find any abnormal crawling, please inform us at http://webmaster.baidu.com/feedback/index (arabthaiPortuguês) and provide the log of Baiduspider.

5. How can I know the crawling is from Baiduspider?
We recommend using reverse DNS lookup to verify Baiduspider. Verification methods are different under linux/windows/os environments.
5.1  In Linux: run the “host IP” command,Examples:
$ host domain name pointer

host domain name pointer
The hostname of Baiduspider is *.baidu.com or *.baidu.jp. Others are fake hostnames.

5.2  In Windows or IBM OS/2: run the “nslookup IP” command. Open the command processor and input nslookup xxx.xxx.xxx.xxx(IP address) to parse the IP. The hostname of Baiduspider is *.baidu.com or *.baidu.jp. Others are fake hostnames.

5.3  In MAC OS: run the “dig” command. Open the command processor and input dig -x xxx.xxx.xxx.xxx( IP address) to parse the IP. The hostname of Baiduspider is *.baidu.com or *.baidu.jp. Others are fake hostnames.

6. How can I prevent Baiduspider from crawling my site?
Baiduspider works on the robots.txt protocol. You can prevent Baiduspider from crawling your entire site or the specific contents by specifying them in robots.txt. Please note that by doing this, the pages of your site will not be found in Baidu search results and in any other the search results which is provided by Baidu. For detailsof setting a robots.txt, please see How to create a robots.txt

You can set different rules towards different user-agents. (Please note Baiduspider-video does not support the rules currently). If you prefer to prevent all the user-agents of Baidu, you can simply block Baiduspider.

Below robots command will block all the crawling from Baidu.
User-agent: Baiduspider
Disallow: /

Below robots command will allow Baiduspider-image only to crawl the directory of /image/
User-agent: Baiduspider
Disallow: /

User-agent: Baiduspider-image
Allow: /image/

Please note that the pages that crawled by Baiduspider-cpro will not be built into the index and Baiduspider-cpro works on the agreement that set with customers. In this case, Baiduspider-cpro will not work on the records set by robots.txt. If you are not comfortable with Baiduspider-cpro, please contact union1@baidu.com.Baiduspider-ads will not be built into the index and Baiduspider-ads works on the agreement that set with customers. In this case, Baiduspider-ads will not work on the records set by robots.txt. If you are not comfortable with Baiduspider-ads, please contact your customer service representative.

7. I've set robots.txt to my site, but why the contents of my site are still displayed as the search results?
It takes time for Baidu to update the database. Baidu stops crawling your site once you have added robots.txt.  The index which has been built previously requires several months to be removed from the database of the search engine.  On the other hand, please make sure your robots.txt is created correctly.
If your request of removing your site from the search engine is in urgent, please inform us at http://webmaster.baidu.com/feedback/index (arabthaiPortuguês)

8. How can I request Baiduspider to index my pages but not to show the cached links in the search results?
Baiduspider works on the meta robots.txt protocol. You can use the meta tag to request Baiduspider to index your pages but not to show the cached links in the search results.
Same as the way Baidu handling update request from robot.txt. Baidu stops showing the cached links after you had updated the meta robots.txt protocol. It takes 2 to 4 weeks to refresh the contents which have been stored in the database of Baidu previously.

9. Will the crawling from Baidu lead to bandwidth congestion?
Generally speaking, the crawling of Baiduspider will not lead to bandwidth congestion. If it happens, it is probably caused by some unauthorized access which pretend to be Baiduspider. When you find there is an agent named as Baiduspider which makes your network busy, please inform us at http://webmaster.baidu.com/feedback/index (arabthaiPortuguês) as soon as possible. Providing us with the log of that particular time frame will be of great help for us to investigate and analyze the problem.

10. How to index my websites, including independent sites and blogs, in Baidu?
Baidu indexes sites and pages which reach the requirement of user search experience.
To help Baiduspider discover you site more quickly, you are also welcomed to submit your website address at http://www.baidu.com/search/url_submit.htm
Homepage is enough, with no requirement on detailed content pages.
The value of pages is the only reason that justifies the indexing of Baidu, which has nothing to do with commercial factors, for example, Baidu Promotion.

11. How to tell whether my website has been indexed by Baidu? Is the result provided by the website grammar is equivalent to the real amount of indexing?
To check whether your site has been indexed by Baidu, please run website grammar. Input “site: (your domain name)” into the search box, for instance, http://www.baidu.com/s?wd=site%3Awww.baidu.com. Your site will be displayed as a result if it has been indexed.
The result provided by the website grammar is only an estimate for reference.

12. How to prevent my website from being indexed by Baidu?
Baidu strictly complies with robots.txt protocol. For detailed information, please visit http://www.robotstxt.org/.
You can prevent all your pages or parts of them from being crawled and indexed by Baidu with robots.txt. For specific method, please refer to How to Write Robots.txt.
If you set robots.txt to restrict crawling after Baidu has indexed your website, it usually takes 48 hours for the updated robots.txt to take effect and then the new pages won’t be indexed. Note that it may take several months for the contents which have been indexed by Baidu before the restriction of robots.txt to disappear from search results.
If you are in urgent need of restricting crawling, you can report to http://webmaster.baidu.com/feedback/index (arabthaiPortuguês) and we will deal with it as soon as possible.
13. How to report to Baiduspider when I meet problems?
If you have any problem on the crawling, please contact http://webmaster.baidu.com/feedback/index (arabthaiPortuguês) for further help.  
To deal with your feedback effectively and timely, please make sure that both the problem and the domain name your website are reported. It would be better if you can provide the website log of crawling, which can help us find the reason and solve the problem in time.

14. I have set robots.txt to restrict the crawling of Baidu, but why it doesn’t take effect?
Baidu strictly complies with robots.txt protocol. But our DNS updates periodically. If you have set robots.txt, due to the updating, Baidu needs some time to stop crawling your site.
If you are in urgent need of restricting crawling, you can report to http://webmaster.baidu.com/feedback/index (arabthaiPortuguês).
Besides, please check whether your robots.txt is correct in format.

15. Why some pages, such as private pages without links or pages requiring access rights, are also indexed by Baidu?
The crawling of pages by Baiduspider depends on the links between pages.
Except for the internal links between pages, there are also external links between different sites. Therefore, although some pages deny access from the internal links, they can also be indexed by search engine if there are links on other website directing to them.
Baiduspider enjoys the same access rights as other users. Consequently, spider can’t visit those pages that ordinary users fail to do so. There are 2 reasons that Baidu seems to index those pages with access rights:
o There is no restriction of access when the spider crawls the content. However, after the crawling, the access rights change.
o The content is protected by access rights, but due to website security holes, users are able to access it through some special paths. If the paths are publicized on the internet, following them, the spider is able to crawl the content.
If you would like the private content not to be indexed, you can restrict crawling with robots.txt. Besides, you are also welcomed to report to http://webmaster.baidu.com/feedback/index (arabthaiPortuguês) for solution.

16. Why does the amount of my website indexed tend to decrease?
Due to the instability of server, Baiduspider cannot crawl pages when it checks for updates and changes, thus those pages will be deleted temporarily.
You website does not fit for user search experience.

17. Why does my page disappear from the search results of Baidu?
Baidu does not promise that all pages can be searched.
If your page fails to be searched by Baidu for a long time or suddenly disappears from the results, the possible reasons are as follows:
o You website does not fit for the user search experience.
o Due to the instability of server that your website is based on, Baidu deletes it temporarily. The problem will be solved after the server is stable.
o Some contents of the page do not conform to the law and rules.
o Other technical problems.
The following comments are incorrect as well as groundless:
o If a website discontinues paying after participating in Baidu Promotion, it will disappear from Ba


使用道具 举报

Archiver|手机版|小黑屋|教你搜 ( 鲁ICP备16006309号

GMT+8, 2025-4-1 05:12 , Processed in 0.376974 second(s), 23 queries .

Powered by Discuz! X3.2

© 2001-2013 Comsenz Inc.

快速回复 返回顶部 返回列表