Simple Site Backup Pattern, uses S3 hadoop

The theory here is that you want to backup your website’s document root and the MySQL database on a daily basis. Storing the backup file on your webserver is OK in case you screw up your site, can revert easily, but it’s bad if you lose your server. Best is to have a copy on your web server for easy access, and also store it offsite in case of catastrophe.

In this tutorial, we’ll keep 7 days of backup in a local /backup/ directory, then store 30 days of backups on Amazon’s S3. In order to put the files onto Amazon’s S3, going to use hadoop! Using hadoop, not because I plan on doing Map/Reduce on my backups, but because it provides a simple command line method for putting files into S3! It’s easier than writing my own program to store on S3.

Note: In the past, I’ve written an article on storing backups on S3 using a Deduplication technique. This is pretty clever and will reduce the total disk space consumed on S3. But, it’s much more complex and if you lost your web server and needed access to the backup files, you’d need to reconstruct all the code to reassemble your files. This would be a pain, in a pinch. So, if you just want a super simple way to backup your files, and you can very easily retrieve them from any machine or browser, this is your article.

Note: pre-req, you need to install hadoop on your server. Should be as easy as “yum install hadoop” on a RHEL or CENTOS machine. At the time of writing, I have hadoop-0.20-0.20.2+923.21-1 installed.

Make the directories

mkdir -p /scripts
mkdir -p /backup

/scritps/backup.sh (This makes a backup of your doc root and mysql db in /backup/ and copies to S3)

#!/bin/sh
 
echo mkdir -p /backup/
mkdir -p /backup/
DOMAIN=koopman.me
DOCUMENT_ROOT=/var/www/html
DATABASE_HOST=your_db_host
DATABASE_NAME=your_db_name
DATABASE_USER=your_db_user
DATABASE_PASS=your_db_pass
DATE=`date +"%Y-%m-%d_%H_%M_%S"`
 
echo "mysqldump -h $DATABASE_HOST -u $DATABASE_USER -p --opt $DATABASE_NAME | gzip > /backup/$DATE_$DATABASE_NAME.sql.gz"
mysqldump -h $DATABASE_HOST -u $DATABASE_USER -p$DATABASE_PASS --opt $DATABASE_NAME | gzip > /backup/$DATE_$DATABASE_NAME.sql.gz
echo cd $DOCUMENT_ROOT
cd $DOCUMENT_ROOT
echo tar -czf /backup/$DATE.$DOMAIN.tar.gz .
tar -czf /backup/$DATE.$DOMAIN.tar.gz .
 
#cleanup older than 30 days:
echo "find /backup -type f -mtime +30 | xargs rm -f"
find /backup -type f -mtime +30 | xargs rm -f
 
# Send files to S3:
echo "hadoop fs -put /backup/* s3n://AWS_ID:AWS_KEY@BUCKET/backup/"
hadoop fs -put /backup/* s3n://your_aws_id:your_aws_secret@your_aws_bucket/backup/

/scripts/purge_hadoop.sh (This deletes backups older than 30 days from your_aws_bucket/backup)

#!/bin/sh

echo "hadoop fs -ls s3n://AWS_ID:AWS_KEY@your_aws_bucket/backup/* | sed -re "s/ +/\t/g" > /tmp/d.tmp"
hadoop fs -ls s3n://your_aws_key:your_aws_secret@your_aws_bucket/backup/* | sed -re "s/ +/\t/g" > /tmp/d.tmp

export IFS="
"
for i in `cat /tmp/d.tmp`; do
        DATE=`echo $i | cut -f 4`
        FILE=`echo $i | cut -f 6`
        DELDATE=`echo '<?php print date("Y-m-d", time()-60*60*24*30)."\n"; ?>' | php`
        echo "DATE=$DATE, FILE=$FILE, DELDATE=$DELDATE"
        if [[ "$DATE" < "$DELDATE" ]]; then
                echo hadoop fs -rm s3n://AWS_ID:AWS_KEY@your_aws_bucket$FILE
                hadoop fs -rm s3n://your_aws_key:your_aws_secret@your_aws_bucket$FILE
        fi
done

Make sure to chmod 700 your /scripts/* files:

chmod 700 /scripts/backup.sh 
chmod 700 /scripts/purge_hadoop.sh

Next, make them run daily:

/etc/cron.d/backup (This runs at 01:01 daily)

1 1 * * * root /scripts/backup.sh > /var/log/backup.log 2>&1

/etc/cron.d/purge_hadoop (This runs at 00:05 daily)

5 0 * * * root /scripts/purge_hadoop.sh > /var/log/purge_hadoop.log 2>&1

And that’s all. This is about as simple of a pattern for a daily backup strategy as it gets.

Dave