HTTP Cache

DONG Yuxuan @ Feb 08, 2020

When the browser wants to send an HTTP request with the GET method. It will check if the resource is cached and if the cache is fresh (not expired). If there’s a fresh cache, the browser directly uses it instead of sending the request. Else the browser sends the request. If there’s an expired cache, the request will be sent with some validation information. The server validates the cache with the validation information. If the server found the cache is actually fresh, it sends a 304 Not Modified response to tell the browser the cache is still fresh. The browser then uses the cache as the response and remarks it fresh. If the server found the cache is expired indeed, it sends the new content. The browser uses the new content as the new cache and marks it fresh.

The server uses the Cache-Control HTTP header to control how the browser caches the request. The header has the following values.

For example, Cache-Control: public, max-age=600 means the response can be cached by both the browser, proxies, and CDNs for 10 minutes. In 10 minutes, if the resource is request again, the cache can be directly used. After 10 minutes, the cache must be validated before using.

There’re two ways to validate.

The first way is through the ETag and If-None-Match headers. An HTTP response can have the ETag header to represent the current state of the resource. For example, ETag can be the hash of the resource. The browser stores the content of ETag. When it needs to access the resource but the cache is considered expired, it sends a request for the resource. The request has the If-None-Match header with the stored ETag as its value. The server finds the cache is not really expired by ETag, it tells the browser the cache is still fresh with a 304 Not Modified response. If the cache is expired indeed, the server sends the new content and new ETag.

The second way is using the Last-Modified and If-Modified-Since headers. A response can have the Last-Modified header to specify the last modification time of the resource. When the browser caches the resource, it also stores Last-Modified. If the cache needs to be validated, the browser sends a request to the resource. The request has the If-Modified-Since headers whose value is the previous Last-Modified. The server compares Last-Modified with the real last modification time of the resource to determine if the cache is fresh.

Caching can speed up your web site but it can also make mistakes. If not configured properly, the browser may serve the user a wrong resource. Moreover, the browser may serve the user wrong resource for only a part of you site and this can break the whole web app down.

The simplest way to have balance between performance and usability is using Cache-Control: no-cache. It asks the browser to validate caches every time. However, if the cache is fresh, the server can send a very small response which can save traffics.

For an SPA (Single Page Application) built with morden toolchains, .js, .css files, and other assets like images are ususally named by their hash values. Thus we can set a very long fresh duration like Cache-Control: public, max-age=315360000. This will make the cache fresh for about 10 years. However, the cache config of index.html must be different. Since if the browser gets a wrong index.html, the URLs of assets will also be wrong. Thus we could use Cache-Control: no-cache for index.html. Moreover, if there’re dynamic resources like CGI scripts, we should use Cache-Control: no-store for them.

In Apache, we could implement above ideas by the following snippet.

# Ensure `mod_header` is enabled

# Assets named by hash can be cached forever
Header set Cache-Control "public, max-age=315360000"

# The entry cache must be validated
<Files index.html>
	Header set Cache-Control no-cache

# Dynamic resources must **not** be cached
<Directory ${PREFIX}/cgi-bin>
	Header set Cache-Control no-store