But if it's not, you'll typically want cross-datacenter replication so if one rack goes down you don't lose all your data. So then you're looking at something like Glusterfs/MooseFS/Ceph. But the latencies involved with synchronously replicating to multiple datacenters can really kill your performance. For example, try git cloning a large project onto a Glusterfs mount with >20ms ping between nodes. It's brutal.
Other products try to do asynchronous replication, EdgeFS is one I was looking at recently. This follows the general industry trend, like it or not, of "eventually consistent is consistent enough". Not much better than a cron job + rsync, in my opinion, but for some workloads it's good enough. If there's a partition you'll lose data.
Eventually you just give up and realize that a perfectly synchronized geographically distributed POSIX filesystem is a pipe dream, you bite the bullet and re-write your app against S3 and call it a day.
NFS is just the protocol. Whether it's a single point of failure depends on the server-side implementation. In Amazon EFS it is not.
(disclaimer: I'm a PM-T on the EFS team)
We can store small files on NVMe disks in metadata, so that you can read them with a latency of just a couple of ms, and can scale to 10s of thousands of concurrent reads/writes per second with small files.
You could try NetApp - it's significantly faster than EFS in my experience.
There are two options in AWS: one where you get an entire virtual storage system to yourself but also need to have some NetApp expertise (although less than on-premises). You can find pricing here:
And one where you just consume the NFS/SMB shares more akin to EFS (pricing at the bottom of the page):
In both instances you can get better pricing than what is on that page by working with a partner, but if you don't want to you can buy directly through AWS and pay a slight premium.
Depending on the use-case, Veritas InfoScale might be a better option. It can aggregate different tiers of storage, and present a multi-node shared-nothing file-system, with options for replication between AZs, regions, or different cloud providers.
Priced by core not capacity.
That may be true but also is due to applications often having very sequential IO patterns even when they don't need to be.
I hope we'll get some convenience wrappers around io_uring that make batching of many small IO calls in a synchronous manner simple and easy for cases where you don't want to deal with async runtimes. E.g. bulk_statx() or fsync_many() would be prime candidates for batching.
... and then later you notice that you need a special SLA and Amazon Redshift to guarantee that a read from S3 will return the same value as the last write.
Even S3 is only eventually consistent and especially if a US user uploads images into your US bucket and then you serve the URLs to EU users, you might have loading problems. The correct solution, according to our support contact, is that we wait a second after uploading to S3 to make sure at least the metadata has replicated to the EU before we post the image url. That, or pay for Redshift to tell us how long to wait more precisely.
Because contrary to what everyone expects, S3 requests to US buckets might be served by delayed replicas in other countries if the request comes from outside the US.
I'm shocked that you got this answer. This is definitely not how you are supposed to operate.
If you need to ensure the sequantiality of a write followed by a read on S3, the idiomatic way is to enable versioning on your bucket, issue a write, and provide the version ID to whoever need to read after that write.
Not only will that transparently ensure that you will not read deprecated data, but it will even ensure that you actually read the result of that particular write, and not any consecutive write that could have happened in between.
This pattern is very easy to implement for sequential operations in a single function, like:
version = s3.put_object(...)
data = s3.get_object(version_id=version, ...)
In practice it never bothered me too much though. I prefer having explicit synchronization through versions, rather than having a blocking write waiting for all caches to be synchronized.
Also, this should only be necessary if you write to an already existing object. New keys will always trigger a cache invalidation from my understanding.
Also, we were already at a throughput where the s3 metadata nodes would sometimes go offline, so I'm not sure putting more work on them would have improved our overall situation.
Why would an unversioned put be less costly than a versioned put?
To me it should be pretty much the same. I would almost suspect unversioned puts are hacked around versioned ones.
Out of curiosity what kind of workload did you perform? I have been moving terabytes of files (some very small 1kB files, some huge 150GB files) and never noticed a single blip of change in behavior from S3 (and certainly not anything going "off")
Pretty simple, we started a bulk import, and then HTTPS requests to amazonaws.com started timing out... Later, support told us to spread out our requests to alleviate the load spike.
What we did was process roughly 50mio 1MB JPEG images from the full-res S3 bucket into an derived-data S3 bucket.
What? That makes no sense. Do you have a source for that? I thought the explicit choice of region when creating a bucket limits where data is located. Why would the give you geo-replication for free? Also: "Amazon S3 creates buckets in a Region you specify. To optimize latency, minimize costs, or address regulatory requirements, choose any AWS Region that is geographically close to you." - https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingBucket....
Also the consistency when creating objects is described here: https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.... For new objects, you get read-after-write consistency and the example you gave contradicts that.
Mind the documented caveat for this case:
> The caveat is that if you make a HEAD or GET request to a key name before the object is created, then create the object shortly after that, a subsequent GET might not return the object due to eventual consistency.
The main downside is that you have to pick one datacenter as the main datacenter at any given time. This can change if the old one goes down or due to business needs, but you can't have two of them as the main datacenter at once for a given dataset.
Do you have any additional info? EdgeFS's github doesn't work; does repo access require a Nexenta sales call?
We're also looking into asynchronously replicated FSs, I think built-in caching + tiering is slightly nicer than cron + rsync; would love to know what other solutions you looked into.
Things like cron+rsync fail in boring ways.
Fancy things fail in fascinating ways.
Unix directories map directory and file names(keys) to inode numbers(values). And a file's inode(key) is a map to the data blocks(values) on disk that make up a file's content.
S3 might not have Posix semantics but to say it's not a filesystem is incorrect.
Even if you drop the Unix (posix) part, most practically used file systems have features that are hard to guarantee in a distributed setting, and in either case simply don't exist in S3.
Also Nowhere did I say or even imply that a key/value store is all a Unix file system is. Different filesystems have different features. Object storage is a type of filesystem with a very specific feature set.
It is an eventually consistent key-value store with some built-in optimization (indexes) for filtering/searching object timestamps and keys (e.g. listing keys by prefix), which allows for presenting them in a UI similar to how it is possible with files in a directory. That's about it.
It seems like you'd be limiting yourself for concurrent access -- if everything is flowing through the MySQL cluster -- not a bad thing! Just perhaps warrants a caveat note. I'd expect S3 to smoke HopFS on concurrency.
With 500 prefixes it doesn't matter if you're Netflix or a student working on a distributed computing project, you can get that 1.6m requests.
There is no limit to the number of prefixes in a bucket. You can scale to be arbitrarily large.
Isn't rename in S3 effectively a copy-delete operation?
And tag the files for expiration/clean up. S3 is not a file system and people should stop treating it like one - only to get bitten by these assumptions around it being a FS.
What is the access protocol ? What tools am I using to access the POSIX presentation ?
"HopsFS is a new implementation of the Hadoop Filesystem (HDFS), that supports multiple stateless NameNodes, where the metadata is stored in MySQL Cluster, an in-memory distributed database. HopsFS enables more scalable clusters than Apache HDFS (up to ten times larger clusters), and enables NameNode metadata to be both customized and analyzed, because it can now be easily accessed via a SQL API."
There are other papers that describe HopsFS architecture in more details if you are interested: https://www.usenix.org/system/files/conference/fast17/fast17...
I don’t see how it can be useful. Moving or renaming files in S3 seems more like maintenance than something you want to do on a regular basis.
There's no 'renaming' of any file in S3. I don't think AWS had S3 as file system in mind. There's EBS for that.
EBS isn't a file system, it's lower level than that (it's. block store, hence the name; you bring your own filesystem.) EFS and FSx are filesystems.
Don't you have to spin up an EC2 instance to use that?
The architect of JuiceFS is simper than HopsFS, it does not have worker nodes (the client access S3 and manage cache directly), and the metadata is stored in highly optimized server (similar to NN in HDFS, can be scaled out by adding more nodes).
JuiceFS provide POSIX client (using FUSE), and Hadoop SDK (in Java), and a S3 gateway (also a WebDAV gateway).
Disclaimer: Founder of JuiceFS here
Would've been more interesting had it taken advantage of newer technologies such as io_uring, NVME over Fabric, RDMA etc.
If Filestore (our managed NFS product) is too large for you, I'd suggest having gcsfuse on each box (or just use the GCS Connector for Hadoop). You won't get the kind of NFS-style locking semantics that a real distributed filesystem would support, but it sounds like you mostly need your data scientists to be able to read and write from buckets as if they're a local filesystem (and you wouldn't expect them to actively be overwriting each other or something where caching gets in the way).
Edit: We used gcsfuse for our Supercomputing 2016 run with the folks at Fermilab. There wasn't time to rewrite anything (we went from idea => conference in a few weeks) and since we mostly just cared about throughput it worked great.
GCS has pretty high latency. So even if you write a single byte, expect it to be like 50+ms. This is slower than a spinning disk in an old computer. If you’re just updating a single person’s notebook, they’ll feel it a bit on save (but obviously each person and file is independent).
But you can also do about 1 Gbps per network stream (and putting several of those together, 32+ Gbps per VM) such that even a 1 MB file is also probably done in about that much time. I think for streaming writes (save model) gcsfuse may still do a local copy until some threshold before writing out.
I’d probably put your models directly on GCS though. That’s what we do in colab and with TensorFlow (sadly, it seems from a quick search that PyTorch doesn’t have this out of the box).
Filestore and multi-writer PD will naturally improve over time. But I’m guessing you need something “today”. Feel free to send me a note (contact in profile) if you want to share specifics.
Discloser: Founder of JuiceFS here
Is HopsFS helping in this area?
I have to disagree here. If you look at benchmarks on internet, yes, it will look like S3 is dead slow. But that is a client problem, not an S3 problem. For instance, boto3 (s3transfer) is an awful implementation that was so overengineered with a reimplementation of futures, task stealing, etc, that the download throughput is pathetic. Most often it will make you top below 200MB/s.
But S3 itself scales very well if you know how to use it, and skip boto3.
From my experience and benchmarks, each S3 connection will deliver up to 80MB/s, and with range requests you can easily have multiple parallel blocks downloaded.
I wrote a simple library that does this called s3pd (https://github.com/NewbiZ/s3pd). It's not perfect and is process based instead of coroutines, but that will give you an idea.
For reference, using s3pd I can saturate any EC2 network interface that I found (tested up to 10Gb/s interfaces, with download speed >1GB/s).
Boto is really doing bad press to S3.
For small files the fixed cost will be the most important factor. The "transfer time" after this "first byte latency" might actually be 0, since all the data could be transferred within a single write call on each node.
Here are some strategies to make S3 go faster: https://news.ycombinator.com/item?id=19475726
You're right that S3 isn't for small files but for a lot of small files (think 500 bytes), I either plunge them to S3 through Kinesis Firehose or fit them gzipped into DynamoDB (depending on access patterns).
One could also consider using Amazon FSx which can IO to S3 natively.
My experience was the opposite. Small files work acceptably quickly, the cost can't be beat, and the reliability probably beats out anything else I've seen. We saw latencies of ~50ms to write out small files. Slower than an SSD, yes, but not really that slow.
You do have to make sure you're not doing things like re-establishing the connection to get that, though. If you have to do a TLS handshake… it won't be fast. Also, in you're in Python, boto's API encourages, IMO, people to call `get_bucket`, which will do an existence check on the bucket (and thus, double the latency due to the extra round-trip); usually, you have out-of-band knowledge that the bucket will exist, and can skip the check. (If you're wrong, the subsequent object GET/PUT will error anyways.)
It's great that one individual thing is better than one other individual thing, but if you look at the bigger picture it generally isn't that individual thing by itself that you are using.
There is a different case that makes sense: you have object storage and it's not sufficient, so you go look for object storage suppliers that deliver something different so it suits your need. Now it makes sense to look for a service that is relatively similar to what you are already using but is better in a factor that is significant for your application (i.e. speed of object key changes), now it does make sense.
Marketing just says: "Look at us, we are faster". I think that message is not going to matter unless that happens to be your exact problem in isolation, which isn't exactly common; systems don't tend to run in isolation.
Cool work! I love seeing people pushing distributed storage.
IIUC though, you make a similar choice as Avere and others. You're treating the object store as a distributed block store :
> In HopsFS-S3, we added configuration parameters to allow users to provide their Amazon S3 bucket to be used as the block data store. Similar to HopsFS, HopsFSS3 stores the small files, < 128 KB, associated with the file system’s metadata. For large files, > 128 KB, HopsFS-S3 will store the files in the user-provided bucket.
> HopsFSS3 implements variable-sized block storage to allow for any new appends to a file to be treated as new objects rather than overwriting existing objects
It's somewhat unclear to me, but I think the combination of these statements means "S3 is always treated as a block store, but sometimes the File == Variably-Sized-Block == Object. Is that right?
Using S3 / GCS / any object store as a block-store with a different frontend is a fine assumption for dedicated client or applications like HDFS-based ones. But it does mean you throw away interop with other services. For example, if your HDFS-speaking data pipeline produces a bunch of output and you want to read it via some tool that only speaks S3 (like something in Sagemaker or whatever), you're kind of trapped.
It sounds like you're already prepared to support variably-sized chunks / blocks, so I'd encourage you to have a "transparent mode". So many users love things like s3fs, gcsfuse and so on, because even if they're slow, they preserve interop. That's why we haven't gone the "blocks" route in the GCS Connector for Hadoop, interop is too valuable.
P.S. I'd love to see which things get easier for you if you are also able to use GCS directly (or at least know you're relying on our stronger semantics). A while back we finally ripped out all the consistency cache stuff in the Hadoop Connector once we'd rolled out the Megastore => Spanner migration . Being able to use Dual-Region buckets that are metadata consistent while actively running Hadoop workloads in two regions is kind of awesome.
If the file is "small" (under a configure size, typically 128KB), it is stored in the metadata-layer, not on S3. Otherwise, if you just write the file once in one session (and it is under the 5TB object size limit in S3), there will be one object in S3 (variable size - blocks in HDFS are by default fixed size). However, if you append to the file, then we add a new object (as a block) for the append.
We have a new version under development (working prototype) where we can (in the background) rewrite all the blocks in a single file as a single object, and make the object readable by a S3 API. It will be released some time next year. The idea is that you can mark directories as "S3 compatible" and only pay for rebalancing those ones as needed. But you then have the choice of doing the rebalancing on-demand or as a background task, and prioritizing, and so on. You know the tradeoffs.
Yes, it would be easier to do this with GCS. But we did AWS and Azure first, as we feel GCS is more hostile to third-party vendors. The talks we have given at google (to the colossus team a couple of years ago and to Google Cloud/AI - https://www.meetup.com/SF-Big-Analytics/discussions/57666504... ) are like black holes of information transfer.
I’m sorry if you’ve tried to talk to us and we’ve been unhelpful. I’d be happy to put you in touch with some GCS people specifically — the Colossus folks are multiple layers below, while the AI folks are multiple layers above. They were probably mostly not sure what to say!
We worked quite openly and frankly with the Twitter folks on our GCS connector . I’d be happy to help support doing the same with you. My contact info is in my profile.
(Though I’d definitely agree that we’ve also been surprisingly reticent to talk about Colossus, until recently the only public talk was some slides at FAST).
Good point, JuiceFS already provides this "transparent mode", which is called compatible mode.
ObjectiveFS has been around for several years. How is HopsFS better?
We have chatted on these with the founders of ObjectiveFS before creating JuiceFS, they did NOT recommend to use ObjectiveFS in big data workload with Hadoop/Spark, that's why we started to build JuiceFS since 2016.
`doesn’t fully support regular file system semantics or consistency guarantees (e.g. atomic rename of directories, mutual exclusion of open exclusive, append to file requires rewriting the whole file and no hard links).`
HopsFS does provide strongly consistent metadata operations like atomic directory rename, which is essential if you are running frameworks like Apache Spark.
It's conceptually similar to EMR in the way it works. You connect your AWS account and we'll deploy a cluster there. HopsFS will run on top a S3 bucket in your organization.
You get a fully featured Spark environment (With metrics and logging included - no need for cloudwatch). UI with Jupyter notebooks, the Hopsworks feature store and ML capabilities that EMR does not provide.
EMRFS has dozens of optimisations for Spark/Hadoop workloads e.g. S3 select, partitioning pruning, optimised committers etc and since EMR is a core product it is continually being improved. Using HopsFS would negate all of that.
There is a blog post  talking about this use case. Unfortunately, we have not publish a English version, you can read it using google translate .
Disclaimer: Founder of JuiceFS here.