So, recently Nick G. Asked “Since you’ve worked with S3 a good bit, I’d like to get your take on using a service like S3 compared to using a local instance (or cluster) of MogileFS?”
I’d like to interject here and mention that in this case “quite a bit” means I’ve used it in one application for data backup, at an early stage during which there was no good example (much less released interface) for using S3 with PHP code. So I wrote and distributed my own. I’m sure that it’s fallen into disuse and more active projects are likely to be favored. So thats my “lots of experience”. Always take what I have to say with the appropriate amount of salt
Any my answer would be that each type of storage solution listed has both strengths and weaknesses, and determining which set best compliments your application needs will tell you where you should invest. I would also throw another option into the pot: the SAN. While a SAN might not be in the range of your average garage tinkerer it *is* in the range of medium or large startups with proper funding. I do however believe the question was geared more towards a slant on an “each of these versus S3” analysis, so thats how I’ll approach the question.
But first… Let me get this out of the way. S3 looses, by default, if you absolutely cannot live without block device access. Do not pass go, do not collect $200. It’s a weakness and you’ll have to be willing to accept it for what it is.
S3 Vs. SAN (Storage Area Network)
Your most tried and true contender in the mass storage market is probably the SAN. Those refrigerator sized boxes which sit in warehouse sized data-centers thirstily consuming vast amounts of electricity, and pushing the bits through slender delicate orange fiber-optic cables. Your basic sales pitch surrounding any SAN these days are comprised of the same points in varying degree:
Expandability: No modern SAN would be complete without the promise of expandable storage. On a quick and dirty level a SAN has a bunch of disks which it slices and dices into pieces and then glues those pieces together into a “logical unit”. So many many hard drives become just one hard drive. However keep in mind that you have to use a filesystem which supports dynamic expansion, and you almost always have to dismount the volume to accomplish it to boot.
Backup: At a small cost of “twice what you were planning to pay plus $30,000 for the software” you ought to be able to, with any modern SAN, preform realtime on-the-fly backups. I would throw in negative commentary here, but I think the sales pitch bares its negative connotations in a fairly self evident manner.
Clustering: You can have multiple machines accessing the same hard drive! Which is great as long as you can setup and use a clustering filesystem. What they fail to tell you is that using a *normal* non-cluster aware FS will get you nothing but massive data corruption. So unless you plan on using some cookie cutter type system for accessing the storage, and are planning on spending big bucks on having it built for you… the clustering is going to be less than useful. Also you cannot run multiple MySQL database instances on the same part of a shared disk like that, so get that idea out of your head too (disclaimer: I know not if allowing MySQL access to the raw partition fares any better in this case, but I somehow doubt it).
High availability/integrity: So long as you buy a bunch of extra hard drives for the machine you can expect to handle failures of individual disks gracefully. That is if the term gracefully includes running at 25% slower for a couple of hours while bits get shifted around… and then again when the broken drive is replaced… But, no, you wont loose your data
Speed: Yea… SAN’s are fricken fast… no doubt… SAN’s usually function on a dedicated fiber-optic network (afore mentioned delicate orange cables) so a) they don’t saturate your IP network, and B) aren’t limited to its speed
So how does S3 stack up against the SAN? Well, lets see… Expandability: S3 has a SAN beat hands down with not only implied expandability but also with implied constriction, S3 you pay for what you use.
Backup: Amazon guarantees data retention, no need to pay extra. Clustering: again, covered, providing that you have built your application to play nice in some way there is no problem here.
High Availability and Integrity: Here there is more of a tradeoff since a SAN is a guaranteed write and then immediately be available, and S3 is a write once, eventually stored. One of the hurdles with S3 is that it may take a while (an unknown period of time) for a file stored in S3 to become available globally, making it less than ideal to, say, host html generated by your CMS — thats not to say that its impossible, but there may be an indeterminate period when you have a page linked to and only half your viewers can access it (you would think you could get around this by storing the data first an then the index last, but there is no guarantee that the order in which items are sent is the order in which they will become available.)
And finally Speed: Here the SAN wins out — you pay for bandwidth to connect to amazon’s S3 service, and you cant, and wouldn’t want, to pay the bills for a sustained multi-gigabit per second connection to S3 (ouch)
Therefor: If you can handle A) a small time-to-availability, B) non-block-access, and C) a speed limited by the public internet connection. Then S3 is probably a better choice. But for the total package… if you have the resources… the SAN is irreplaceable.
S3 Vs. NAS (Network Attached Storage)
The NAS is like the SAN’s little brother. They tend to function much the same as a SAN, but are usually A) put on the IP network which can cause saturation and limits speed, B) are usually not as robust in the H.A. and Data integrity department, C) have a lower cap on their ultimate expandability, and D) cost a whole hell of a lot less than a SAN.
So the NAS has carved out a well deserved niche in small business and some home offices because it provides a bunch of local storage at a much more reasonable price. We, therefor, cannot evaluate its pros and cons on the same points as we did the SAN. NAS are often used to store large files locally in a shared manner. Many clients mount the shared volume and are able to work collaboratively on the files stored there. And for this reason S3 is not even thinking about encroaching on the NAS space. First off a home DSL working on a 100MB CAD file is not feasible in the same way that it is on a NAS. It would be an awful thing to wait for 100MB to save at 12Kb/sec – Period. Also the idea of using a multi-user accounting software to have two accountants in the records at the same time is basically impossible…
If you’re thinking about the NAS in a data-center type environment, I’m going to consider it lumped in with either the homegrown cluster solution (small NAS) or the SAN (large NAS)
So if you need a NAS… stick with a NAS. HOWEVER consider S3 as a very convenient, effective, and affordable alternative to something like tape based backup solutions for this data.
S3 Vs. HomeGrown Cluster Storage
The home grown clustering solution is an interesting one to tackle. NFS servers, or distributed filesystems (with or without local caching), or samba servers, or netware servers, all with-or-without some sort of redundancy built in, and all with varying levels of support attached. And thats your biggest challenge in this space: finding support.
You will have to build your application to take into account the eccentricities of the particular storage medium (varying level of POSIX support, for example) but knowing what those quirks *are* will save you time frustration and gobs of money later on. Because if you’re using some random duct-taped-solution thats been all Mac’d out it will probably do the trick — but what happens if the guy who designed it (and thus knows how all the pieces fit together) leaves the company or gets hit by a bus? well… you’re probably out of luck. But with S3 you have a very large pool of people all rallying around one solution with one (ok or two) access methods and it simply is what it is.
There are really no surprises with S3 which is the first reason that it beats out the custom tricked out storage solution. The second reason is that there is no assembly required — except maybe figuring out which access library to use. No assembly means no administration. No administration means better code. Better code means getting sold like a hot video sharing company. Well… One can dream
S3 Vs. Local Storage
Aside from the obvious block access, and up-to-scsi speeds that local storage provides it looses to S3 in almost every way.
It’s not expandable very far. It’s not very fail-safe. It’s not distributed. It requires some form of backup. It requires, power, cooling, room, and physical administration. My advice: if you can skip the hard drive you SHOULD.
S3 Vs. MogileFs
MogileFS is an interesting comer in this particular exercise-of-thought. It’s a kind of hybrid between the grow-your-own cluster and the local storage bit. It offers an intriguing combination of pro’s vs con’s, and is probably the most apples-to-apples comparison that can be made with S3 at all. Which makes me wish I’d had more of a chance to use it.
But the basic premise is that you have a distributed system, which is easily expandable, and handles data redundancy. My understanding is that you classify certain portions of the storage with a certain number of redundant copies required to be considered safe, and the data is stored on that many different nodes. When you make a request for the file you are returned a link in such a way as is meant to distribute the read load among the various servers housing the data. You also have a built in fail-safe for a node going down and shouldn’t be handed a link to a file on a downed node.
So what does all that mean? Well if you went about trying to build yourself a non-authenticated in-house version of Amazon’s S3 service you would probably end up with something that is remarkably similar to MogileFS. I wouldn’t even be surprised to find out that S3 is modeled after Mogile. What’s more Mogile has a proven track record when it comes to serving files for web based applications.
So how do they actually compare? I would say that for a company deciding whether or not to use Mogile Versus S3 it comes down to a couple of key factors. A) source and destination of traffic, B) type of files being distributed, and C) up front investment.
As far as your traffic. If you’re planning on using Mogile primarily internally and data will rarely leave the LAN then you will not be paying for bandwidth costs associated with S3. That makes for a pretty simple solution. If you are distributing the files to a global audience, however, you might find that using S3 to pay for bandwidth costs along with handling local availability, delivery speed, and high availability is a win. However I’d be fairly inclined to guarantee (as I’ve covered before) that the raw bandwidth purchased from your ISP is a lot cheaper than from Amazon AWS, so long as you already have all the necessary equipment in place for redundancy, delivery, etc, Mogiles advantages brings it within striking distance of S3.
If you are distributing primarily small files (images, etc) then mogile is not going to present to you any challenges. If, however, you are serving 100MB video files or 650MB CD images Mogile might actually work against you. When I tried to use Mog for this kind of an application there was a limit on the size of an individual file that it was willing to transfer between hosts. In this respect Mog broke its own replication. DISCLAIMER: I only spent a week or so total with Mog (broken up into hour here and hour there sessions) this might have been a) known, or b) easily worked around, but my quick googling at the time yielded little help. The idea of having to split large files was a deal breaker at the time and other things were pressing for my attention.
And the real thing that Mog does require which S3 does not is a hardware and manpower investment. Since you’re going to have to work your application in a similar manner to house data in either S3 or in MogileFS, S3 wins out on sheer ease of setup… All you have to do is signup for an AWS account, pop in a credit card number, and you’re on your way. That same hour. You also don’t run out of space with S3 like you can with Mog, granted Mog can be easily expanded — but you have to put more hardware into it. S3 is simply already as large as you need it to be.
Summary
In the end what these choices always come down to is some combination of the classic triangle: Time Vs. Money Vs. Manpower. And what storage is right for you depends on how much of each you are willing to commit. Something always has to give. The main advantage of S3 is tat you’re borrowing on the fact that Amazon has already committed a lot of time and hardware resources which you can leverage if the shoe fits.
More than likely what you’ll find is that the “fit” os something like S3 will be a seasonal thing. When you start out developing your application and you don’t have resources to throw at it using S3 for your storage will make a lot of sense because you can avoid the whole issue of capacity planning and purchasing hardware with storage in mind. Then you will probably move into a quasi-funded mode where it is starting to, or outright gets, too expensive to use S3 versus hiring a an admin and throwing a couple of servers in a data-center. And then you might just come back full circle to a point when you’re drowning in physical administration, and spending a little extra for ease of use and peace of mind comes back into style.
So which is right for you? Probably all of the above, just at different times, for different uses, and for different reasons. The key to your success will likely lie in your ability to plan for when and where each type of storage is right. And to already have a path in mind for when it’s time to migrate.