AWS S3 sync - only modified files, using git status

Problem

As you know I build this website using my custom site generator builder so the whole website is being re-created every time I change something. Similar tools are widely used by many people, like Jekyll.

Then I use the AWS S3 sync command to update the public version of the site in my S3 bucket. The only problem with this is that since the local website is fully re-built every time it seems to be newer than the remote website in the S3 bucket (due to newer modification timestamps), resulting in extra uploads for all files.

Git Solution

The easiest solution to solve this problem is to use git to handle diff changes and then just pass along the modified files to the aws s3 sync command that will sync them against the remote S3 bucket.

So, long story short, assuming you have the site’s directory git-versioned the following script will sync the directory with the remote S3 bucket, including adding new files, removing deleted files, etc.

#!/bin/bash
set -ex

FILES=()
for i in $( git status -s | sed 's/\s*[a-zA-Z?]\+ \(.*\)/\1/' ); do
    FILES+=( "$i" )
done
#echo "${FILES[@]}"

CMDS=()
for i in "${FILES[@]}"; do
    CMDS+=("--include=$i""*")
done
#echo "${CMDS[@]}"

echo "${CMDS[@]}" | xargs aws s3 sync . s3://www.lambrospetrou.com --dryrun --delete --exclude "*" 

Important

You have to remove the --dryrun option in order to actually apply the changes remotely, otherwise it will just fake them.

Explanation

The important part of the above script is the --include and --exclude filters. The order of the filters matters, that’s why we have the exclude first, and the includes last. In case the exclude was last nothing would be updated.

The two for-loops generate the required --include=FileX arguments, which are expanded using the "${CMDS[@]}" trick. Then xargs takes care of sending them as last arguments to the aws s3 sync command, also taking care of very long list of files that exceed the command line length limit.

In addition, I have to use git status instead of git diff otherwise new files will not be synced, since they are not part of the index tree.

Conclusion

Using git along with aws CLI it’s very easy to maintain my website and only upload the real diff, modified files, each time. One can imagine that this can be used in a much more advanced scenario with Github webhooks integrated with AWS CodePipeline or any other CI tool that will release your website automatically.

References