Django Storage Minutia

I recently found myself peering under Django’s hood, trying to better understand how it manages static files and file uploads. I’ve been a Django user for years, yet I’ve never felt that I understood its storage layer.

What I found is a story told again and again in code: incremental change, organic growth, and strong path dependence. The storages layer is fantastically useful and flexible; for most people, it just works. On the other hand, if you’re actively building something rich and strange with it, then perhaps this historical perspective (and kibitzing!) will be of interest.

In the beginning

The very first version of Django shipped with support for file uploads. (Support for static files would have to wait.) To handle a variety of production scenarios, Django 1.0 introduced the notion of a Storage. Django storages are lowest-common-denominator abstract filesystems, but with a twist: they also surface a mapping between (private) filesystem paths and the (public) URLs where one can actually request those files.

Django 1.0 also shipped with a single concrete Storage implementation, FileSystemStorage, which simply wrapped the local filesystem. Since all of this code was strictly intended to be used with file uploads, the class defaulted to using MEDIA_ROOT and MEDIA_URL as the base of its path-to-URL mapping — a default that lives on in Django even today.

Static files

A couple years later, Django 1.3 shipped with its star feature: support for static files. The new staticfiles package became the Storage layer’s second real customer.

A small issue must have been apparent to Django’s developers at the time. It made sense to use FileSystemStorage to store static files locally, but the class defaults were no good: static files might need entirely different paths than uploaded media.

Oddly, instead of resolving this small problem by removing references to MEDIA_* in the storage layer, thereby clarifying its layering in the ecosystem, staticfiles instead opted to introduce a new derived class, StaticFilesStorage, whose sole purpose was to override the defaults to STATIC_ROOT and STATIC_URL. I’m not sure what the motivation was: it may have been historical, since staticfiles was originally a third-party package. Regardless, it seems to cause developer confusion even today.

Other small sins were committed with Django 1.3. The staticfiles package had the task of finding static files and collecting them into a final location, defined by STATICFILES_STORAGE and STATIC_ROOT. But where were the static files to be found? Enter Django 1.3’s Finder abstraction. Django 1.3 shipped with several finders, including the AppDirectoriesFinder, which looks for content in the static subdirectories of Django apps. Curiously, 1.3 also shipped with both a FileSystemFinder, which wraps (multiple) FileSystemStorage instances under the hood, and a BaseStorageFinder, which wraps an arbitrary Storage instance. I think the motivation for FileSystemFinder was to support Django’s convenient new STATICFILES_DIRS setting “out of the box”, but the partial functional overlap between these new finders also led to confusion.

Another strangeness shipped with Django 1.3: there were now real-world storages where the special “twist” of having to map between paths and URLs no longer made sense. The mapping continued to make sense for file uploads and collected static files, but for the storages used in finders, the URL side of the mapping was meaningless. No effort was made to clarify or refactor the API.

Cached static files

Django 1.4 included a key new staticfile feature: cached static files. Caching gave developers the ability to automatically generate and append content hashes to filenames during collection (like style-91a0.css), permitting them to leverage far future Expires headers for static content.

Responsibility for hashing content is split in two. The first interested party is the collectstatic management command. After finishing collection, it looks for a magic method, post_process(), on the underlying Storage and calls it if present. This method is intended to be generic, performing arbitrary work and returning a list of impacted static files.

The post_process() method is apparently not well-used: after a search across all public Python repositories on both GitHub and BitBucket, the only implementation I found was Django’s own content hash generator. Tellingly, Django’s implementation is completely generic with respect to the underlying storage, living as it does in a mixin; it’s not clear to me it belongs on Storage at all.

Modern day Django

Fast forward to today, and the fantastic Django 1.7.4 release. Aside from a small refactoring to introduce the new ManifestFilesMixin (a slight variant on the previous CachedFilesMixin), and the introduction of deconstructibility to support using storages with 1.7’s new migrations, things have largely remained the same in this corner of core Django.

The Django community hasn’t stood still, however. The Django Storages project has implemented several commonly-used storages, including for Amazon S3, Azure, and other well-known cloud providers. And packages like Django Compressor have filled in the critical gap between static files, which are intended to be served directly, and the assets from which they are generated1.

I think the fact that the ecosystem has flourished demonstrates that the original design, while imperfect, is still quite sound. I do think there is an opportunity for a beneficial (if backwards-incompatible) refactoring.

There’s an opportunity to clarify layering. Django’s storage abstractions should be independent of any specific use. For example, they should not refer back to MEDIA_* settings; media and static files should be strict consumers of the storage layer. It might also be worth reconsidering the restriction that storages must be constructible without any parameters; this has led to a flourishing of storage classes whose only purpose is to override defaults.

Then there’s the question of the precise responsibilities of Storage implementations. Path-to-URL mapping, so fundamental to storages in all cases in Django 1.0, is only sometimes needed today. In addition, there are plenty of real-world storages where common operations (directory listings, reading back written files) are either expensive in the underlying filesystem, or simply impossible. There is currently little clarity around which Storage methods are required in derived classes, and which are optional. The bottom line today seems to be: if you use an exotic Storage, and it blows up in your use case, then you’re out of luck.

Finally, the sheer number of third-party asset pipelines for Django shows that there’s a lot more room to grow. I suspect that, much like they did with migrations, the core Django team will take their time before finally deciding on the one true path forward.

[1] Asset pipelines are my secret reason for spending time here. After evaluating the big two, Compressor and Pipeline, Peter and I rolled our own for Cloak. It’s something we’re considering shipping publicly. I’m more closely aligned with Compressor in spirit, but it was primarily designed with runtime in mind; its “offline” compression feels like somewhat of an afterthought.