S3 is object-based flat file unlimited storage. It’s unlimited, but that doesn’t mean we should throw files up there without thinking — storage still costs money. It’s not block-based, so it’s not meant for storing operating systems or live databases. But any type of file can be stored (including database file backups), and each can be from 0 bytes up to 5 TB. General knowledge about S3 is one of the key categories in the AWS Certified Cloud Practitioner exam.
Files are stored in buckets, which we can think of as root-level folders. Bucket names must be globally unique because they resolve to URLs, which are global. Most of the time, we wouldn’t expose these URLs publicly except for static S3 websites. When we name our buckets, AWS automatically prefixes “S3” and our region to the name. For example:
https://s3-us-east-1.amazon.com/my-unique-bucket-name. I recommend using reverse-domain naming using an appropriate domain you own. For example, I start all my bucket names with
com.markfreedman. An exception would be when we host a static site. In that case, we need to use a normal domain name (in my case,
markfreedman.com, although I already have this hosted elsewhere).
Although bucket names must be globally unique, storage of the buckets themselves is region-specific. We should select a bucket’s region based on latency requirements. If most access would be from a certain region, create the bucket in the closest available AWS region. Using CloudFront can alleviate this need, though.
Public access is blocked by default. AWS requires us to be explicit in exposing buckets to the public Internet. All those stories of hacked data (often exposed S3 buckets) should make us thankful for this default. We can secure buckets with IAM policies (bucket policies).
We can also set lifecycle management for a bucket, which specifies which storage class to move the bucket to, and when to move it. More on storage classes, below.
When we upload a file to an S3 bucket, AWS considers the file name to be the key, and refers it as key in the S3 APIs and SDKs. The S3 data model is a flat structure. In other words, there’s no hierarchy of subfolders (sub-buckets?). This is why I described buckets as root-level folders. However, you can simulate a logical folder hierarchy by separating portions of the key name with forward slashes (/).
The file content is referred to as the value. Therefore, an S3 file is sometimes referred to as a key/value pair.
Files can be versioned, encrypted, as well as provided with other metadata. We can secure files (objects) with IAM policies (object policies) and set ACLs at the file (object) level. By default, the resource owner has full ACL rights to the file.
When we upload a file to an S3 bucket, we’ll know the upload was successful if an HTTP 200 code is returned. This is most important when uploading programmatically. If we do it manually, AWS will let us know if it succeeded or not.
We can expect 99.99% availability, but AWS only guarantees 99.9%. But it also guarantees 99.999999999% durability (11 x 9s). So we can be confident that our files will always be there.
There are specific “data consistency” rules:
- Read after Write Consistency — when new files are uploaded (technically, PUTs), we can read the file immediately afterwards.
- Eventual Consistency — when files are updated (POSTs) or deleted (DELETEs), immediately attempting to read the file afterwards may result in the old file content. It can take a short period of time (perhaps a few seconds or more) to propagate throughout AWS (replication, cache cleaning), which is why we may see the old file.
S3 supports tiered storage classes, which we can change on demand at the object level. We don’t specify a class at bucket creation time. Keep in mind, when we specify lifecycle rules, we do that at the bucket level, defining the lifecycle rules for the objects in that bucket:
- S3 Standard (most common) is designed to sustain loss of 2 facilities concurrently, and has the best performance.
- S3 IA (Infrequently Accessed) is lower cost, but we’re charged a retrieval fee.
- S3 One Zone IA is a lower cost version of S3 IA, but it doesn’t require multiple zone resilience. It’s the only tier that’s just in one availability zone; all the others are replicated in 3 or more zones.
- S3 Intelligent Tiering allows AWS to automatically move data to the most cost-effective tier using machine learning AI of usage patterns. For most buckets, I recommend using this, although it’s best for long-lived data with unpredictable access patterns.
- S3 Glacier is a secure, durable, low cost archival tier, which allows for configurable retrieval times, from minutes to hours. It provides query-in-place functionality for data analysis of archived data.
- S3 Glacier Deep Archive is the lowest cost tier, but it requires up to 12 hour retrieval time. This is great for archived data that doesn’t need to be readily available.
- S3 RRS is Reduced Redundancy Storage, but is being phased out. It appears to be similar to S3 One Zone IA.
The prices we’re charged (covered in another article) using S3 is based on:
- Storage Management
- Data Transfer
- Transfer Acceleration, which enables fast transfer to distant locations by using CloudFront edge locations, making use of backbone networks (much larger network “pipes”).
- Cross Region Replication, which automatically replicates to another region bucket for disaster recovery purposes.
- We can also configure buckets to require the requester pay for access.