Integrating Amazon's S3 Simple Storage Service
At work today, I spent a fun few hours migrating our picture storage from local files to Amazon's Simple Storage Service (S3) web service. Our application lets customers upload images for display on their customized listing page (like an online Yellow Pages). I've been thinking about ways to both scale and "clusterize" our services as we grow both our customer base and consumer web traffic.
Like many cool applications these days, we use lighttpd to serve up our static images -- the scalability is fantastic and our web server's CPU usage is essentially zero no matter the bandwidth. OK, not zero, but close -- and lighttpd is the reason why. The web server hosts the images locally, and the directory is exported via NFS to our application servers for their use (which is mostly just writing new files). The server winds up getting many of the images in the OS's page cache, so serving them is obviously efficient. This all works great, and probably gives us at least a year's worth of traffic growth, but leaves the lighttpd / NFS server box as a single point of failure.
I've investigated the Linux-HA heartbeat and drbd technologies for building a clustered NFS server (and in fact I am using heartbeat to cluster our application servers). This seems promising, but we strongly, strongly favor simple solutions over perfect-but-complex ones, and a drbd/NFS cluster does not seem super simple.
That's where Amazon's S3 comes in -- it gives effectively unlimited storage and bandwidth, with Amazon's reliability, on a pay-as-you-go basis. Since our application nicely encapsulates all image management in a single Python class, I was able to move all image functionality to S3 in a few hours. Instead of generating URLs that point to our own web servers, the code just points to the S3-specific URLs, and the rest of the application is unnaffected. So now in exchange for one day's work and a few dollars per month, I get a scalable, reliable image serving and storage solution. If I were working in a Fortune-500 company instead of a tiny startup, I might call this a Server-Oriented Scalable Grid Computing integration, but that seems like quite a mouthful for a two hundred lines of Python :-)
Categories: python