The Google Search Appliance advertises via the Accept-Encoding part of the HTTP request header that it can handle gzip content. However, this does not appear to be the case with at least gzip-encoded content coming from MediaWiki.
The HTTP request header looks like this:
GET
HOST: www.xyz.com
ACCEPT: text/html,text/plain,application/*
FROM:
USER-AGENT: gsa-crawler (Enterprise; ... ; ...)
ACCEPT-ENCODING: gzip
The solution is to remove the gzip option from Accept-Encoding which can be done by:
- Go to GSA admin interface.
- Crawl and Index->HTTP Headers
- Set field Additional HTTP Headers for Crawler to
Accept-Encoding:
The HTTP request header now looks like this:
GET
HOST: www.xyz.com
ACCEPT: text/html,text/plain,application/*
FROM:
USER-AGENT: gsa-crawler (Enterprise; ... ; ...)
ACCEPT-ENCODING:
Solution source: A posting in the Google Search Appliance/Google Mini group. I found that simply setting the field to “Accept-Encoding:” worked just fine — no need to include “foo”.