1. What is Baiduspider?
Baiduspider is Baidu search engine program which is used to visit pages on the internet and build information into Baidu index. This enables users to locate your site when they perform a search.
2. What is Baiduspider’s user-agent?
Baidu uses different user-agents for different products:
3. Will Baiduspider creates additional loading to customer servers?
In order to ensure the search results cover most of your pages, Baiduspider must keep the crawling at a certain level. We have been trying our best to avoid increasing the loading to your servers, and to adjust the frequency based on combined factors, such as your server's capability, your site’s quality and the update frequency of your site. If you find any unreasonable access from Baiduspider, please inform us at
http://webmaster.baidu.com/feedback/index (
arab,
thai,
Português)
4. Why Baiduspider crawls my site continuously?
In order to ensure the latest information is presented, Baiduspider crawls new pages or pages frequently renewed at your site. Please check the log to see whether the crawling from Baiduspider is reasonable.
To avoid the excess crawling by spammers or other trouble makers who pretend to be Baiduspider, you can check the log. When you find any abnormal crawling, please inform us at
http://webmaster.baidu.com/feedback/index (
arab,
thai,
Português) and provide the log of Baiduspider.
5. How can I know the crawling is from Baiduspider?
We recommend using reverse DNS lookup to verify Baiduspider. Verification methods are different under linux/windows/os environments.
Instructions:
5.1 In Linux: run the “host IP” command,Examples:
$ host 123.125.66.120
120.66.125.123.in-addr.arpa domain name pointer
Baiduspider-123-125-66-120.crawl.baidu.com.
host 119.63.195.254
254.195.63.119.in-addr.arpa domain name pointer
BaiduMobaider-119-63-195-254.crawl.baidu.jp.
The hostname of Baiduspider is *.baidu.com or *.baidu.jp. Others are fake hostnames.
5.2 In Windows or IBM OS/2: run the “nslookup IP” command. Open the command processor and input nslookup xxx.xxx.xxx.xxx(IP address) to parse the IP. The hostname of Baiduspider is *.baidu.com or *.baidu.jp. Others are fake hostnames.
5.3 In MAC OS: run the “dig” command. Open the command processor and input dig -x xxx.xxx.xxx.xxx( IP address) to parse the IP. The hostname of Baiduspider is *.baidu.com or *.baidu.jp. Others are fake hostnames.
6. How can I prevent Baiduspider from crawling my site?
Baiduspider works on the robots.txt protocol. You can prevent Baiduspider from crawling your entire site or the specific contents by specifying them in robots.txt. Please note that by doing this, the pages of your site will not be found in Baidu search results and in any other the search results which is provided by Baidu. For detailsof setting a robots.txt, please see How to create a
robots.txt
You can set different rules towards different user-agents. (Please note Baiduspider-video does not support the rules currently). If you prefer to prevent all the user-agents of Baidu, you can simply block Baiduspider.
Below robots command will block all the crawling from Baidu.
User-agent: Baiduspider
Disallow: /
Below robots command will allow Baiduspider-image only to crawl the directory of /image/
User-agent: Baiduspider
Disallow: /
User-agent: Baiduspider-image
Allow: /image/
Please note that the pages that crawled by Baiduspider-cpro will not be built into the index and Baiduspider-cpro works on the agreement that set with customers. In this case, Baiduspider-cpro will not work on the records set by robots.txt. If you are not comfortable with Baiduspider-cpro, please contact
union1@baidu.com.Baiduspider-ads will not be built into the index and Baiduspider-ads works on the agreement that set with customers. In this case, Baiduspider-ads will not work on the records set by robots.txt. If you are not comfortable with Baiduspider-ads, please contact your customer service representative.
7. I've set robots.txt to my site, but why the contents of my site are still displayed as the search results?
It takes time for Baidu to update the database. Baidu stops crawling your site once you have added robots.txt. The index which has been built previously requires several months to be removed from the database of the search engine. On the other hand, please make sure your robots.txt is created correctly.
If your request of removing your site from the search engine is in urgent, please inform us at
http://webmaster.baidu.com/feedback/index (
arab,
thai,
Português)
8. How can I request Baiduspider to index my pages but not to show the cached links in the search results?
Baiduspider works on the meta robots.txt protocol. You can use the meta tag to request Baiduspider to index your pages but not to show the cached links in the search results.
Same as the way Baidu handling update request from robot.txt. Baidu stops showing the cached links after you had updated the meta robots.txt protocol. It takes 2 to 4 weeks to refresh the contents which have been stored in the database of Baidu previously.
9. Will the crawling from Baidu lead to bandwidth congestion?
Generally speaking, the crawling of Baiduspider will not lead to bandwidth congestion. If it happens, it is probably caused by some unauthorized access which pretend to be Baiduspider. When you find there is an agent named as Baiduspider which makes your network busy, please inform us at
http://webmaster.baidu.com/feedback/index (
arab,
thai,
Português) as soon as possible. Providing us with the log of that particular time frame will be of great help for us to investigate and analyze the problem.
10. How to index my websites, including independent sites and blogs, in Baidu?
Baidu indexes sites and pages which reach the requirement of user search experience.
To help Baiduspider discover you site more quickly, you are also welcomed to submit your website address at
http://www.baidu.com/search/url_submit.htm
Homepage is enough, with no requirement on detailed content pages.
The value of pages is the only reason that justifies the indexing of Baidu, which has nothing to do with commercial factors, for example, Baidu Promotion.
11. How to tell whether my website has been indexed by Baidu? Is the result provided by the website grammar is equivalent to the real amount of indexing?
To check whether your site has been indexed by Baidu, please run website grammar. Input “site: (your domain name)” into the search box, for instance,
http://www.baidu.com/s?wd=site%3Awww.baidu.com. Your site will be displayed as a result if it has been indexed.
The result provided by the website grammar is only an estimate for reference.
12. How to prevent my website from being indexed by Baidu?
Baidu strictly complies with robots.txt protocol. For detailed information, please visit
http://www.robotstxt.org/.
You can prevent all your pages or parts of them from being crawled and indexed by Baidu with robots.txt. For specific method, please refer to
How to Write Robots.txt.
If you set robots.txt to restrict crawling after Baidu has indexed your website, it usually takes 48 hours for the updated robots.txt to take effect and then the new pages won’t be indexed. Note that it may take several months for the contents which have been indexed by Baidu before the restriction of robots.txt to disappear from search results.
If you are in urgent need of restricting crawling, you can report to
http://webmaster.baidu.com/feedback/index (
arab,
thai,
Português) and we will deal with it as soon as possible.
13. How to report to Baiduspider when I meet problems?
If you have any problem on the crawling, please contact
http://webmaster.baidu.com/feedback/index (
arab,
thai,
Português) for further help.
To deal with your feedback effectively and timely, please make sure that both the problem and the domain name your website are reported. It would be better if you can provide the website log of crawling, which can help us find the reason and solve the problem in time.
14. I have set robots.txt to restrict the crawling of Baidu, but why it doesn’t take effect?
Baidu strictly complies with robots.txt protocol. But our DNS updates periodically. If you have set robots.txt, due to the updating, Baidu needs some time to stop crawling your site.
If you are in urgent need of restricting crawling, you can report to
http://webmaster.baidu.com/feedback/index (
arab,
thai,
Português).
Besides, please check whether your robots.txt is correct in format.
15. Why some pages, such as private pages without links or pages requiring access rights, are also indexed by Baidu?
The crawling of pages by Baiduspider depends on the links between pages.
Except for the internal links between pages, there are also external links between different sites. Therefore, although some pages deny access from the internal links, they can also be indexed by search engine if there are links on other website directing to them.
Baiduspider enjoys the same access rights as other users. Consequently, spider can’t visit those pages that ordinary users fail to do so. There are 2 reasons that Baidu seems to index those pages with access rights:
o There is no restriction of access when the spider crawls the content. However, after the crawling, the access rights change.
o The content is protected by access rights, but due to website security holes, users are able to access it through some special paths. If the paths are publicized on the internet, following them, the spider is able to crawl the content.
If you would like the private content not to be indexed, you can restrict crawling with robots.txt. Besides, you are also welcomed to report to
http://webmaster.baidu.com/feedback/index (
arab,
thai,
Português) for solution.
16. Why does the amount of my website indexed tend to decrease?
Due to the instability of server, Baiduspider cannot crawl pages when it checks for updates and changes, thus those pages will be deleted temporarily.
You website does not fit for user search experience.
17. Why does my page disappear from the search results of Baidu?
Baidu does not promise that all pages can be searched.
If your page fails to be searched by Baidu for a long time or suddenly disappears from the results, the possible reasons are as follows:
o You website does not fit for the user search experience.
o Due to the instability of server that your website is based on, Baidu deletes it temporarily. The problem will be solved after the server is stable.
o Some contents of the page do not conform to the law and rules.
o Other technical problems.
The following comments are incorrect as well as groundless:
o If a website discontinues paying after participating in Baidu Promotion, it will disappear from Ba