YQL for Content and API Providers

Blocking Data Access from YQL

YQL fetches content from URLs when requested by a developer. Because YQL is not a Web crawler, it does not follow the robots exclusion protocol for non-HTML data such as XML or CSV from a site.

To stop YQL from accessing any site content, block the YQL user-agent (Yahoo Pipes 2.0) on your Web server.

For example, on Apache servers, add this rule to your virtual host block in httpd.conf:

          SetEnvIfNoCase User-Agent "Yahoo Pipes" noYQL
          <Limit GET POST>
          Order Allow,Deny
          Allow from all
          Deny from env=noYQL
          </Limit>

Blocking HTML Data Scraping from YQL

YQL uses the robots.txt file on your server to determine the Web pages accessible from your site. YQL uses the user-agent "Yahoo Pipes 2.0" when accessing the robots.txt file, and checks it for allows/disallows from this user agent.

If the robots.txt check does prevent YQL from accessing your content, it will then fetch the target page using a different user agent:

"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14"

Therefore, to deny YQL access to your content, simply add "Yahoo Pipes 2.0" to the relevent parts of your robots.txt, for example:

          User-agent: Yahoo Pipes 2.0
          Disallow: /

Another approach is to block YQL on your Web server. For example, in Apache, add this to your virtual host block in httpd.conf:

          SetEnvIfNoCase User-Agent "Yahoo Pipes" noYQL
          <Limit GET POST>
          Order Allow,Deny
          Allow from all
          Deny from env=noYQL
          </Limit> 

IP-based rate limiting

YQL allows APIs to accurately use IP-based rate limits that will track and count on the YQL developer's IP address, rather than the IP addresses of shared proxy servers that YQL uses to access content on the Web.

Yahoo/YQL determines the last valid client IP address connecting to its Web service. For all outgoing requests to external content and API providers, it ensures that this is the first IP address in the X-FORWARDED-FOR HTTP header in the following manner:

X-FORWARDED-FOR: 1.2.3.4, 5.6.7.8, 9.10.11.12

In the above example, the request arriving at YQL came from the 1.2.3.4 IP address. IP rate limiters should use this value rather than the IP addresses of YQL proxy servers.

We also set the CLIENT-IP HTTP header to this IP address, for example:

CLIENT-IP: 1.2.3.4

Note: Because these headers are "unsigned," they can be spoofed. Therefore, providers should only use these headers if the proxy setting them is trusted. The IP addresses of the proxy hosts that should be trusted can be found at https://developer.yahoo.com/yql/proxy.txt. This file will be updated as our proxy hosts change.