HTTP Caching

DONG Yuxuan @ Feb 08, 2020 Asia/Shanghai

Introduce basic rules of HTTP caching and a practice in Apache httpd.

When the browser wants to send an HTTP request with the GET method. It will check if the resource has a fresh (not expired) cache. If there’s a fresh cache, the browser directly uses it instead of sending the request. If there’s an expired cache a validation request will be sent. If the server found the cache is actually fresh, it sends a 304 Not Modified response to tell the browser the cache is still fresh. The browser then uses the cache as the response and remarks it fresh. If the server found the cache is expired indeed, it sends the new content. The browser uses the new content as the new cache and marks it fresh.

The server uses the Cache-Control HTTP header to control how the browser caches the request. The header has the following values.

For example, Cache-Control: public, max-age=600 means the response can be cached by both the browser, proxies, and CDNs for 10 minutes. In 10 minutes, if the resource is request again, the cache can be directly used. After 10 minutes, the cache must be validated before using.

There’re two ways to validate.

The first way is through the ETag and If-None-Match headers. An HTTP response can have the ETag header to represent the current state of the resource. For example, ETag can be the hash of the resource. The browser stores the content of ETag. When it needs to access the resource but the cache is considered expired, it sends a validation request for the resource. The validation request has the If-None-Match header with the stored ETag as its value. If the server finds the cache is not really expired by ETag, it tells the browser the cache is still fresh with a 304 Not Modified response. If the cache is expired indeed, the server sends new content and new ETag.

The second way is using the Last-Modified and If-Modified-Since headers. A response can have the Last-Modified header to specify the last modification time of the resource. When the browser caches the resource, it also stores the Last-Modified value. If the cache needs to be validated, the browser sends a request to the resource. The request has the If-Modified-Since headers whose value is the previous Last-Modified. The server compares Last-Modified with the real last modification time of the resource to determine if the cache is fresh.

Caching can speed up your web site but it can also make mistakes. If not configured properly, the browser may serve the user a wrong resource. Moreover, the browser may serve the user wrong resource for only a part of you site and this can break the whole web app down.

The simplest way to have balance between performance and usability is using Cache-Control: no-cache. It asks the browser to validate caches every time. It still needs network accessing everytime but if the cache is fresh, the server can send a very small response to save traffics.

For an SPA (Single Page Application) built with morden toolchains, .js files, .css files, and other assets like images are ususally named by their hash values. Thus we can set a very long fresh duration like Cache-Control: public, max-age=315360000. This will make the cache fresh for about 10 years. However, the cache config of index.html must be different. Since if the browser gets a wrong index.html, the URLs of assets will also be wrong. Thus we could use Cache-Control: no-cache for index.html. Moreover, if there’re dynamic resources like CGI scripts, we should use Cache-Control: no-store for them.

In Apache, we could implement above ideas by the following snippet.

# Ensure `mod_header` is enabled

# Assets named by hash can be cached forever
Header set Cache-Control "public, max-age=315360000"

# The entry cache must be validated
<Files index.html>
	Header set Cache-Control no-cache
</Files>

# Dynamic resources must **not** be cached
<Directory ${PREFIX}/cgi-bin>
	Header set Cache-Control no-store
</Directory>