I recently found myself peering under Django’s hood, trying to better understand how it manages static files and file uploads. I’ve been a Django user for years, yet I’ve never felt that I understood its storage layer.
What I found is a story told again and again in code: incremental change, organic growth, and strong path dependence. The storages layer is fantastically useful and flexible; for most people, it just works. On the other hand, if you’re actively building something rich and strange with it, then perhaps this historical perspective (and kibitzing!) will be of interest.
In the beginning
The very first version of Django shipped with support for file uploads. (Support for static files would have to wait.) To handle a variety of production scenarios, Django 1.0 introduced the notion of a
Storage. Django storages are lowest-common-denominator abstract filesystems, but with a twist: they also surface a mapping between (private) filesystem paths and the (public) URLs where one can actually request those files.
Django 1.0 also shipped with a single concrete
FileSystemStorage, which simply wrapped the local filesystem. Since all of this code was strictly intended to be used with file uploads, the class defaulted to using
MEDIA_URL as the base of its path-to-URL mapping — a default that lives on in Django even today.
A couple years later, Django 1.3 shipped with its star feature: support for static files. The new
staticfiles package became the
Storage layer’s second real customer.
A small issue must have been apparent to Django’s developers at the time. It made sense to use
FileSystemStorage to store static files locally, but the class defaults were no good: static files might need entirely different paths than uploaded media.
Oddly, instead of resolving this small problem by removing references to
MEDIA_* in the storage layer, thereby clarifying its layering in the ecosystem,
staticfiles instead opted to introduce a new derived class,
StaticFilesStorage, whose sole purpose was to override the defaults to
STATIC_URL. I’m not sure what the motivation was: it may have been historical, since
staticfiles was originally a third-party package. Regardless, it seems to cause developer confusion even today.
Other small sins were committed with Django 1.3. The
staticfiles package had the task of finding static files and collecting them into a final location, defined by
STATIC_ROOT. But where were the static files to be found? Enter Django 1.3’s
Finder abstraction. Django 1.3 shipped with several finders, including the
AppDirectoriesFinder, which looks for content in the
static subdirectories of Django apps. Curiously, 1.3 also shipped with both a
FileSystemFinder, which wraps (multiple)
FileSystemStorage instances under the hood, and a
BaseStorageFinder, which wraps an arbitrary
Storage instance. I think the motivation for
FileSystemFinder was to support Django’s convenient new
STATICFILES_DIRS setting “out of the box”, but the partial functional overlap between these new finders also led to confusion.
Another strangeness shipped with Django 1.3: there were now real-world storages where the special “twist” of having to map between paths and URLs no longer made sense. The mapping continued to make sense for file uploads and collected static files, but for the storages used in finders, the URL side of the mapping was meaningless. No effort was made to clarify or refactor the API.
Cached static files
Django 1.4 included a key new
staticfile feature: cached static files. Caching gave developers the ability to automatically generate and append content hashes to filenames during collection (like
style-91a0.css), permitting them to leverage far future Expires headers for static content.
Responsibility for hashing content is split in two. The first interested party is the
collectstatic management command. After finishing collection, it looks for a magic method,
post_process(), on the underlying
Storage and calls it if present. This method is intended to be generic, performing arbitrary work and returning a list of impacted static files.
post_process() method is apparently not well-used: after a search across all public Python repositories on both GitHub and BitBucket, the only implementation I found was Django’s own content hash generator. Tellingly, Django’s implementation is completely generic with respect to the underlying storage, living as it does in a mixin; it’s not clear to me it belongs on
Storage at all.
Modern day Django
Fast forward to today, and the fantastic Django 1.7.4 release. Aside from a small refactoring to introduce the new
ManifestFilesMixin (a slight variant on the previous
CachedFilesMixin), and the introduction of deconstructibility to support using storages with 1.7’s new migrations, things have largely remained the same in this corner of core Django.
The Django community hasn’t stood still, however. The Django Storages project has implemented several commonly-used storages, including for Amazon S3, Azure, and other well-known cloud providers. And packages like Django Compressor have filled in the critical gap between static files, which are intended to be served directly, and the assets from which they are generated1.
I think the fact that the ecosystem has flourished demonstrates that the original design, while imperfect, is still quite sound. I do think there is an opportunity for a beneficial (if backwards-incompatible) refactoring.
There’s an opportunity to clarify layering. Django’s storage abstractions should be independent of any specific use. For example, they should not refer back to
MEDIA_* settings; media and static files should be strict consumers of the storage layer. It might also be worth reconsidering the restriction that storages must be constructible without any parameters; this has led to a flourishing of storage classes whose only purpose is to override defaults.
Then there’s the question of the precise responsibilities of
Storage implementations. Path-to-URL mapping, so fundamental to storages in all cases in Django 1.0, is only sometimes needed today. In addition, there are plenty of real-world storages where common operations (directory listings, reading back written files) are either expensive in the underlying filesystem, or simply impossible. There is currently little clarity around which
Storage methods are required in derived classes, and which are optional. The bottom line today seems to be: if you use an exotic
Storage, and it blows up in your use case, then you’re out of luck.
Finally, the sheer number of third-party asset pipelines for Django shows that there’s a lot more room to grow. I suspect that, much like they did with migrations, the core Django team will take their time before finally deciding on the one true path forward.