Streaming PostgreSQL pg_dump to S3
pg_dump streaming directly to S3 seems to work fine. I have 350GB database and do not want to create temp additional drives. You need to make sure that the multipart chunk size is big enough, otherwise I ran into 'too many segments' issue. With AWS CLI the commands:
aws configure set default.s3.multipart_chunksize 200MB
time sudo -u postgres pg_dump -Z 9 -v DB_NAME |aws s3 cp - s3://BUCKET/DB_NAME.dump.gz
With my db it took about 8 hours and result was 130GB file in S3. Now restoring has to be done with psql, as pg_restore does not like plain sql dumps what the command above creates. I could not use custom format there, as this wants to create directories which cannot (probably?) be piped away.
Finally restoring same way, without intermediate file saving. Note that I had to uncompress the data before psql using zcat:
wget -O - 'https://S3-URL/BUCKET/DB_NAME.dump.gz' |zcat |sudo -u postgres psql DB_NAME
The restoring seems to take about same time (~8 hours) as dumping, probably depends where and how big is your server (AWS or somewhere else, mine is outside AWS).
You can use s3's multipart upload feature to stream the dump as it's being generated. However that is likely to be error prone and less than reliable. A better approach is to create an ephemeral EBS volume, dump your database to it. And then upload the compressed backup to s3/Glacier if that is where it needs to go.
If you are wanting a backup for point in time recovery doing a pg_basebackup
to an EBS volume and archiving the WAL stream from the point after the backup means you could cut the time to recovery without keeping a full replica node. If your concern is availability then setting up replication is the way to go. Although you will still want backups.
Replication is not backup, if someone drops a table on the Origin it will be dropped on the Replica; so you still need PITR or checkpoint backups.