Django Storage Minutia
I recently found myself peering under Django’s hood, trying to better understand how it manages static files and file uploads. I’ve been a Django user for years, yet I’ve never felt that I understood its storage layer.
What I found is a story told again and again in code: incremental change, organic growth, and strong path dependence. The storages layer is fantastically useful and flexible; for most people, it just works. On the other hand, if you’re actively building something rich and strange with it, then perhaps this historical perspective (and kibitzing!) will be of interest.
In the beginning
The
very first version of Django
shipped with support for file uploads. (Support for static files would have to
wait.) To handle a variety of production scenarios, Django 1.0
introduced the notion
of a Storage
. Django storages are lowest-common-denominator abstract
filesystems, but with a twist: they also surface a mapping between (private)
filesystem paths and the (public) URLs where one can actually request those
files.
Django 1.0 also shipped with a single concrete Storage
implementation,
FileSystemStorage
, which simply
wrapped the local filesystem.
Since all of this code was strictly intended to be used with file uploads, the
class
defaulted
to using MEDIA_ROOT
and MEDIA_URL
as the base of its path-to-URL mapping
— a default that
lives on in Django even today.
Static files
A couple years later,
Django 1.3
shipped with its star feature: support for static files. The new staticfiles
package became the Storage
layer’s second real customer.
A small issue must have been apparent to Django’s developers at the time. It
made sense to use FileSystemStorage
to store static files locally, but the
class defaults were no good: static files might need entirely different paths
than uploaded media.
Oddly, instead of resolving this small problem by removing references to
MEDIA_*
in the storage layer, thereby clarifying its layering in the
ecosystem, staticfiles
instead opted to introduce a new derived class,
StaticFilesStorage
, whose sole purpose was to override the defaults to
STATIC_ROOT
and STATIC_URL
. I’m not sure what the motivation was: it may
have been historical, since staticfiles
was originally a third-party package.
Regardless, it seems to cause
developer confusion
even today.
Other small sins were committed with Django 1.3. The staticfiles
package had
the task of finding static files and collecting them into a final location,
defined by STATICFILES_STORAGE
and STATIC_ROOT
. But where were the static
files to be found? Enter Django 1.3’s Finder
abstraction.
Django 1.3 shipped with several finders, including the AppDirectoriesFinder
,
which looks for content in the static
subdirectories of Django apps.
Curiously, 1.3
also shipped
with both a FileSystemFinder
, which wraps (multiple) FileSystemStorage
instances under the hood, and a BaseStorageFinder
, which wraps an arbitrary
Storage
instance. I think the motivation for FileSystemFinder
was to support
Django’s convenient new STATICFILES_DIRS
setting “out of the box”, but the
partial functional overlap between these new finders also led to
confusion.
Another strangeness shipped with Django 1.3: there were now real-world storages where the special “twist” of having to map between paths and URLs no longer made sense. The mapping continued to make sense for file uploads and collected static files, but for the storages used in finders, the URL side of the mapping was meaningless. No effort was made to clarify or refactor the API.
Cached static files
Django 1.4
included a key new staticfile
feature:
cached static files.
Caching gave developers the ability to automatically generate and append content
hashes to filenames during collection (like style-91a0.css
), permitting them
to leverage
far future Expires headers
for static content.
Responsibility for hashing content is split in two. The first interested party
is the collectstatic
management command. After finishing collection, it
looks for a magic method,
post_process()
, on the underlying Storage
and calls it if present. This
method is intended to be generic, performing arbitrary work and returning a list
of impacted static files.
The post_process()
method is apparently not well-used: after a search across
all public Python repositories on both GitHub and BitBucket, the only
implementation I found was
Django’s own content hash generator.
Tellingly, Django’s implementation is completely generic with respect to the
underlying storage, living as it does in a mixin; it’s not clear to me it
belongs on Storage
at all.
Modern day Django
Fast forward to today, and the fantastic
Django 1.7.4
release. Aside from a
small refactoring
to introduce the new ManifestFilesMixin
(a slight variant on the previous
CachedFilesMixin
), and the introduction of
deconstructibility
to support using storages with 1.7’s new migrations, things have largely
remained the same in this corner of core Django.
The Django community hasn’t stood still, however. The Django Storages project has implemented several commonly-used storages, including for Amazon S3, Azure, and other well-known cloud providers. And packages like Django Compressor have filled in the critical gap between static files, which are intended to be served directly, and the assets from which they are generated1.
I think the fact that the ecosystem has flourished demonstrates that the original design, while imperfect, is still quite sound. I do think there is an opportunity for a beneficial (if backwards-incompatible) refactoring.
There’s an opportunity to clarify layering. Django’s storage abstractions should
be independent of any specific use. For example, they should not refer back to
MEDIA_*
settings; media and static files should be strict consumers of the
storage layer. It might also be worth reconsidering the restriction that
storages must be constructible without any parameters; this has led to a
flourishing of storage classes whose only purpose is to override defaults.
Then there’s the question of the precise responsibilities of Storage
implementations. Path-to-URL mapping, so fundamental to storages in all cases
in Django 1.0, is only sometimes needed today. In addition, there are plenty of
real-world storages where
common operations (directory listings, reading back written files) are either
expensive in the underlying filesystem, or simply impossible. There is currently
little clarity around which Storage
methods are required in derived classes,
and which are optional. The bottom line today seems to be: if you use an exotic
Storage
, and it blows up in your use case, then you’re out of luck.
Finally, the sheer number of third-party asset pipelines for Django shows that there’s a lot more room to grow. I suspect that, much like they did with migrations, the core Django team will take their time before finally deciding on the one true path forward.
[1] Asset pipelines are my secret reason for spending time here. After evaluating the big two, Compressor and Pipeline, Peter and I rolled our own for Cloak. It’s something we’re considering shipping publicly. I’m more closely aligned with Compressor in spirit, but it was primarily designed with runtime in mind; its “offline” compression feels like somewhat of an afterthought.