Automated Backup and Restore using Duplicity and AWS S3

backup-and-restore

Need for Backup

It is impossible to stress how important it is for you to backup your files on a regular basis. if you choose not to back up your own files, you risk losing important data.Backups are one of the basic necessity in any organisation but seldom done the right way. The thought of setting up backups might create a mental barrier for many.  For those who want to make the task of backing up easy, you can make us of this article. Backups are needed and not limited to the below reasons.

  • Hard drives do crash
  • Files can accidentally be deleted or become corrupt
  • Viruses can corrupt or delete files
  • You may upgrade to a new computer and need to move your files.

 

Scenario

There are numerous way to implement backup strategy. It all depends upon your requirement. Consider we are a small  company where our production application runs on a single cloud VPS. Let us assume we are running a LAMP stack in the VPS. Since we have only one server running, we need to find a reliable and cost efficient solution to backup our data in case of aforesaid scenarios. We will use the backup data only when needed and not frequently.

 

Solution:

One way of meeting the above requirement is by using Duplicity and AWS S3. Duplicity is a python based shell application that  creates incremental, encrypted backups. “Incremental” means that each backup only stores data that has changed since the last backup run. This saves a lot of storage space.  “AWS S3” is the storage service provided by Amazon.

Duplicity supports connecting to various backends like ftp,rsync, scp,ssh,Google Cloud Storage, AWS S3, Openstack Swift etc. It is not mandatory to use S3 as a backend. You can choose any of the supported Duplicity backends based upon your requirement.

Since AWS S3 guarantees 99.99999999999 (11 9’s) durability and 99.99% availability, it would be a safe bet to choose S3 as the storage option considering it is more reliable. Also considering the price, in S3 it costs $0.0300/GB for Standard,  0.0125/GB for Standard – Infrequent Access and 0.007/GB for Glacier storage as of today. This is in turn very cheap. You can get detailed pricing of S3 at AWS S3 Pricing

You can compare various cloud storage providers and select one of your choice. But in this article we will be implementing storage option using S3.

 

Let’s Implement it

Before you start, if  you want to learn about S3 buckets or any other AWS services, you can do so by enrolling for a course in Udemy where you can find endless list of courses. As of 15th July 2016 ( Udemy Offer) , you get 50% off on all courses, which you can really make use of. Check out for offers and coupon codes here

So we have decided the tool with which we will perform the backup and the storage option. Let’s get our hands dirty and see how we can implement this.

Following steps are for Ubuntu distribution. You can modify the installation commands as per the distribution you are using.

 

Install Duplicity

Using Package Manager:

Run the below commands as root user

You need to add the duplicity PPA, so that you get the latest version of duplicity

add-apt-repository ppa:duplicity-team/ppa
apt-get update
apt-get install duplicity

 

 

Installing from Source:

If you want to install duplicity from source, then follow the below steps.

‘0.7.06’ is the latest tarball version available as of I’m writing this article.

cd /root
wget https://code.launchpad.net/duplicity/0.7-series/0.7.06/+download/duplicity-0.7.06.tar.gz

 

Unpack the source and move into the package directory that is created:

tar xzvf duplicity*
cd duplicity*

 

We can complete the actual installation with the following command.

apt-get install librsync-dev python-setuptools python-lockfile python-dev

python setup.py install

 

By default it installs it under /usr/local. So if you want it to install it under /usr, then you need to pass in the prefix flag

python setup.py install –prefix=/usr

 

Duplicity needs python boto library for connecting to Amazon S3

apt-get install python-boto

 

We will be using a wrapper script to work with duplicity. The wrapper script makes uses of s3cmd. So lets install that as well.

apt-get install s3cmd

 

Create IAM user and AWS S3 Bucket

In simple terms,  Amazon Web Services (AWS) is a secure cloud services platform. The services include EC2 (Elastic Cloud Compute), S3 ( Simple Storage Service), RDS (Relational Database Service) and many more. You can refer to AWS website on the various services that they offer.

We need to have an AWS account in order to create a bucket in S3. The id with which you create the account is called the root account

It is not recommended to use your root account for managing your services in AWS. In fact it is recommended to delete the API credentials of the root account. You can instead create a new user and then assign admin privileges to that user for managing the services instead of using the root account.

In our example, let us create an IAM user,  named “backup-user” using IAM ( Identity and Access Management). We need to generate API keys for this id, which we will use in duplicity to connect to the S3 bucket through API call. The API key consist of Access and Secret Key. You can view the Secret Keys only at the time of creating it.You will not have access to the secret access keys againYou can follow the steps provided in create IAM user to complete the things said above.

Let us create S3 bucket now. You can follow the steps provided in  create s3 bucket for creating a bucket. Let us consider the name of the bucket to be ‘backup-bucket‘.

We have created an IAM user ‘backup-user’. But we are yet to define what action that user can perform. This can be done by attaching a policy to that user . There are various predefined policies that are available in the policy list. For eg AdministratorAccess, AmazonS3ReadOnlyAccess, AmazonS3FullAccess etc.  But our scenario is such that the ‘backup-user’ will be used only by duplicity to store the backup data. So we do not need full access to Amazon S3. In fact it is enough if we provide access only to ‘backup-bucket’ for just to put,get and delete objects. In order to achieve this, we need to create a custom policy. We will be using the following policy.

{
    "Version":"2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:ListAllMyBuckets",
            "Resource": "arn:aws:s3:::*"
        },
        {
            "Effect": "Allow",
            "Action": [
                       "s3:PutObject",
                       "s3:GetObject",
                       "s3:DeleteOject"
             ],
            "Resource": [
                "arn:aws:s3:::backup-bucket",
                "arn:aws:s3:::backup-bucket/*"
            ]
        }
    ]
}

Note: In place of ‘backup-bucket’ you need to use your bucket name

After creating the policy, you need to attach this policy to ‘backup-user’. Only then the user will be able to access the bucket as that user.

 

Generate GPG Keys

Duplicity supports asymmetric public-key encryption, or symmetric password-only encryption.. This means using a simple password which is fine in most cases. But we will be using GPG asymmetric public-key for extra security and encryption.

The commands will store our keys in a hidden directory at /root/.gnupg if you are running as root user.

gpg –gen-key

 

You will be asked a series of questions that will configure the parameters of the key pair.

Please select what kind of key you want:
   (1) RSA and RSA (default)
   (2) DSA and Elgamal
   (3) DSA (sign only)
   (4) RSA (sign only)
Your selection? 
RSA keys may be between 1024 and 4096 bits long.
What keysize do you want? (2048) 
Requested keysize is 2048 bits
Please specify how long the key should be valid.
         0 = key does not expire
      <n>  = key expires in n days
      <n>w = key expires in n weeks
      <n>m = key expires in n months
      <n>y = key expires in n years
Key is valid for? (0) 
Key does not expire at all
Is this correct? (y/N) y

Press enter to accept the default “RSA and RSA” keys. Press enter twice again to accept the default keysize and no expiration date.

Type y to confirm your parameters.

You need a user ID to identify your key; the software constructs the user ID
from the Real Name, Comment and Email Address in this form:
“Heinrich Heine (Der Dichter) <[email protected]>”

Real name: Your Name
Email address: [email protected]
Comment:
You selected this USER-ID:
Your Name <[email protected]>

Change (N)ame, (C)omment, (E)mail or (O)kay/(Q)uit? o

 

Enter the name, email address, and optionally, a comment that will be associated with this key. Type O to confirm your information.

Next, you will be setting up a passphrase to use with GPG. Make note of the passphrase.

Enter passphrase:
Repeat passphrase:

 

At this point, you will be asked to generate entropy. Entropy is basically a word that describes how much unpredictability is in a system. Your VPS needs entropy to create a key that is actually random. You might get a message like below

We need to generate a lot of random bytes. It is a good idea to perform
some other action (type on the keyboard, move the mouse, utilize the
disks) during the prime generation; this gives the random number
generator a better chance to gain enough entropy.

Not enough random bytes available. Please do some other work to give
the OS a chance to collect more entropy! (Need 280 more bytes)

 

This means your server does not have enough randomness to create the keys. In such case you can run some command in a new terminal which generates lot of activity.

For eg, you can run the below command.

dd bs=1M count=1024 if=/dev/zero of=test conv=fdatasync

 

If still you could not achieve the random bytes required to create the keys, you can follow this article

When you’ve generated enough random pieces of information, your key will be created:

gpg: /root/.gnupg/trustdb.gpg: trustdb created
gpg: key 05AB3DF5 marked as ultimately trusted
public and secret key created and signed.

gpg: checking the trustdb
gpg: 3 marginal(s) needed, 1 complete(s) needed, PGP trust model
gpg: depth: 0 valid: 1 signed: 0 trust: 0-, 0q, 0n, 0m, 0f, 1u
pub 2048R/05AB3DF5 2016-03-16
Key fingerprint = AF21 2669 07F7 ADDE 4ECF 2A33 A57F 6998 05AB 3DF5
uid Your Name
sub 2048R/32866E3B 2016-03-16

 

The highlighted value is your public key id. Make a note of it. In case if you need to check it again, you can list your public keys by running

gpg –list-keys

 

Important: Remember to backup your GPG key pair somewhere safe and off the current machine. Without this key pair your backups are totally useless to you, so if you lose it and need to restore a backup then you will not be able to do it. Follow the steps provided in Backing up GPG Key in order to backup your keys.

 

Backing up GPG Key

You can backup your GPG keys and use it in a different server by using the export and import option available with the gpg command

gpg –export -a “backup” > public.key

gpg –export-secret-key -a “backup” > private.key

 

The above command will export the public and private key. You can email these files by using utility like mutt, swaks etc.

If you ever have to import the keys in a different server, then you can follow the below steps.

gpg –import public.key
gpg –allow-secret-key-import –import private.key

 

It is not enough if you just import the key files in a different server. You need to trust the key file. Follow the below command to trust a gpg key.

gpg –edit-key backup

 

Replace ‘backup’ with the name of your key. Above command will get you to gpg prompt. Type ‘trust’ and press enter. Then type ‘5’ to trust ultimately and press enter. You can refer the below snippet.

gpg (GnuPG) 1.4.11; Copyright (C) 2010 Free Software Foundation, Inc.

This is free software: you are free to change and redistribute it.

There is NO WARRANTY, to the extent permitted by law.

Secret key is available.

pub  2048R/05AB3DF5 created: 2016-03-16  expires: never       usage: SC 

                     trust: ultimate      validity: ultimate

sub  2048R/32866E3B  created: 2016-02-13  expires: never       usage: E   

[ultimate] (1). backup (test) <[email protected]>

gpg> trust

pub  2048R/05AB3DF5 created: 2016-03-16  expires: never       usage: SC

                     trust: ultimate      validity: ultimate

sub  2048R/32866E3B  created: 2016-02-13  expires: never       usage: E   

[ultimate] (1). backup (test) <[email protected]>

Please decide how far you trust this user to correctly verify other users’ keys

(by looking at passports, checking fingerprints from different sources, etc.)

  1 = I don’t know or won’t say

  2 = I do NOT trust

  3 = I trust marginally

  4 = I trust fully

  5 = I trust ultimately

  m = back to the main menu

Your decision? 5

Do you really want to set this key to ultimate trust? (y/N) y

pub  2048R/05AB3DF5 created: 2016-03-16  expires: never       usage: SC

                     trust: ultimate      validity: ultimate

sub  2048R/32866E3B  created: 2016-02-13  expires: never       usage: E   

[ultimate] (1). backup (test) <[email protected]>

What next?

So we have installed duplicity and its prerequisites, we have created S3 bucket and provided access to the bucket and we have created GPG Keys required for encrypting our backup data.

Though we have all the things ready for our backup setup, we need to analyse our requirement and implement duplicity settings based on that.

Consider we have below requirement

  1. Our top most priority is backing up DB. We need to retain at least two months of backup for DB
  2. Our next requirement is that we need to backup all the data under /var/www and /var/log. It is enough if we keep 15 days of this data as backup
  3. We will not be accessing the backed up data frequently. The Infrequent Access (IA) storage option provided by Amazon S3 offers low pricing for storage and more for retrieving the data compared to Standard storage class. The IA was specifically introduced for storing infrequently accessed data. This satisfies our requirement and we will be using IA as the storage class.

Since we have different life cycle policy for DB and other miscellaneous data (/var/www, /var/log) we will create two folder’s underneath backup-bucket which we created earlier. Let us name the folder as ‘DB-backup’ for storing DB backup data and ‘Misc-backup’ for storing other miscellaneous backup data.

The bucket directory structure will look like,

For DB –> All Buckets / backup-bucket / DB-backup

For Misc –> All Buckets / backup-bucket / Misc-backup

 

Mysql dump script

In order to take backup of DB, we need to take mysqldump of all the DB’s before uploading it to S3. Create a script  under ‘/root/duplicity’ with below content and name it db_dump.sh

#!/bin/bash
MYSQL_ROOT_USER=”root”
MYSQL_ROOT_PASSWORD='<your password>’
OUTPUTDIR=”/var/backup/mysql_backup”
MYSQLDUMP=”/usr/bin/mysqldump”
MYSQL=”/usr/bin/mysql”
sqldumpoptions=”–skip-lock-tables –single-transaction –complete-insert –add-drop-table –quick –quote-names”
rm “$OUTPUTDIR/*sql” >/dev/null 2>&1
# get the list of databases
databases=`$MYSQL –user=$MYSQL_ROOT_USER –password=$MYSQL_ROOT_PASSWORD -e “SHOW DATABASES;” | tr -d “| ” | grep -v Database`

for db in $databases; do
echo $db
$MYSQLDUMP –force –opt –user=$MYSQL_ROOT_USER –password=$MYSQL_ROOT_PASSWORD $sqldumpoptions –databases $db > “$OUTPUTDIR/$db.sql”
done

 

Notice, I did not zip the database dump files locally. This is needed because, duplicity will not be able to take an incremental backup if the files are zipped. If you zip the dump files, then you’ll see a reverse effect of duplicity taking a new copy of the dump files instead of incremental backup every time your run it.

If you want to use a more advanced auto mysql backup option, then you can check here

 

Wrapper Script

Duplicity in itself is still a relative pain to use. It has many options — too many if you’re just starting out (you can try running duplicity –help). So we will be using a wrapper script to make our work easy. We will be using the script available at wrapper script

If you have git installed in your server, you can clone the repository by running.

cd /root/duplicity

git clone https://github.com/chrissam/dup-s3-backup.git

Or you can download the zip file directly and then copy it to the server.

Once you clone/unzip the repo in your server, you can see a list of files inside the directory name duplicity-backup.

You can go through the README.md file to know how the script works. The two files that we will be using are,

  • duplicity-backup.conf.example
  • duplicity-backup.sh

We will be creating two separate configuration file for DB and Misc backup. This is because we have to specify different lifecycle policy for each backup. We will be configuring few parameter inside duplicity-backup.conf.  Create two copies of duplicity-backup.conf.example and name it ‘db-duplicity-backup.conf’ and ‘misc-duplicity-backup.conf’.

Consider we are configuring the option for ‘misc-duplicity-backup.conf’. Below is the list of options that we are interested in.

For misc-duplicity-backup.conf:

ROOT=”/

DEST=”s3://s3.amazonaws.com/backup-bucket/Misc-backup/

AWS_ACCESS_KEY_ID=”<your access key>”

AWS_SECRET_ACCESS_KEY=”<your secret access key>”

STORAGECLASS=”–s3-use-ia

INCLIST=( “/var/www”  \

“/var/log” )

ENCRYPTION=’yes

PASSPHRASE=’<your passphrase for GPG>‘

GPG_ENC_KEY=”05AB3DF5

GPG_SIGN_KEY=”05AB3DF5

STATIC_OPTIONS=”–full-if-older-than 15D –s3-use-new-style

CLEAN_UP_TYPE=”remove-all-but-n-full

CLEAN_UP_VARIABLE=”2

LOGDIR=”/var/log/duplicity

LOG_FILE=”duplicity-misc-`date +%Y-%m-%d_%H-%M`.txt

LOG_FILE_OWNER=”root:root

VERBOSITY=”-v3

EMAIL_TO=”<your mail id>”

EMAIL_FROM=”<from email id>”

EMAIL_SUBJECT=”<Subject for the mail>“

 

Most of the options are self explanatory. Take a look at the value given for ‘DEST’. We have added the sub folder ‘Misc-backup’ along with the bucket name to indicate that it has to go inside Misc-backup. In case of ‘db-duplicity-backup.conf’ we will be passing the folder value as ‘DB-backup’

The INCLIST indicate what are the directories that should be backed up. In the case of db-duplicity-backup.conf it should be “/var/backup/mysql_backup” since that is where db_dump.sh will dump the files.

The three options that define the lifecycle policy are

  • STATIC_OPTIONS
  • CLEAN_UP_TYPE
  • CLEAN_UP_VARIABLE

The value “–full-if-older-than 15D –s3-use-new-style” indicates that it’ll generate full backup every 15 days. The “remove-all-but-n-full” value indicates how many full backups should be kept at any point of time. The value of 2 indicates that it’ll keep 2 full backup’s and remove the full and corresponding incremental backups older than the last 2 full backups. In this way we maintain 15 days of backup at any point in time for Misc backup.

The lifecycle policy values for ‘db-duplicity-backup.conf’ will be set as below

STATIC_OPTIONS=”–full-if-older-than 15D –s3-use-new-style

CLEAN_UP_TYPE=”remove-all-but-n-full

CLEAN_UP_VARIABLE=”6

 

By keeping the value of CLEAN_UP_VARIABLE as 6, we are able to retain 60 days of backup.

You need to note that, the incremental backup will be of no use if the corresponding full backup is deleted. This is because the incremental backup is chained to its corresponding full backup.

Note: The “–s3-use-ia” value works only in duplicity version 0.7.06 which is the latest and greatest as of now.

 

Configure s3cmd (optional)

Our duplicity-backup.sh script use s3cmd to perform few actions like checking the disk usage of the DEST that we defined  in our configuration file. This is an optional step. If you do not configure this, you’ll get an error while executing the script ,which you can ignore. Follow the below steps to configure s3cmdn

s3cmd –configure

Enter new values or accept defaults in brackets with Enter.

Refer to user manual for detailed description of all options.

Access key and Secret key are your identifiers for Amazon S3

Access Key []: <Enter you access key here>

Secret Key []: <Enter you private key here>

Encryption password is used to protect your files from reading

by unauthorized persons while in transfer to S3

Encryption password []: <Enter your GPG key pass phrase>

Path to GPG program [/usr/bin/gpg]:

When using secure HTTPS protocol all communication with Amazon S3

servers is protected from 3rd party eavesdropping. This method is

slower than plain HTTP and can’t be used if you’re behind a proxy

Use HTTPS protocol [Yes]:

New settings:

  Access Key: <Your access key will be displayed here>

  Secret Key: <Secret access key will be displayed here>

  Encryption password: <Encryption password will be displayed here>

  Path to GPG program: /usr/bin/gpg

  Use HTTPS protocol: True

  HTTP Proxy server name:

  HTTP Proxy server port: 0

Test access with supplied credentials? [Y/n] Y

Test the script

We have placed all our scripts and configuration files under /root/duplicity. Below command will backup our Miscellaneous data to S3.

cd /root/duplicity

./duplicity-backup.sh -c misc-duplicity-backup.conf -b

 

The -c flag denotes the configuration file to be used. -b flag indicates that we are taking incremental backup. For the very first time -b option will be taking a full backup since there is no prior data. From the second run, it will start to take only incremental backup.

If you get any error, you can refer the logs under ‘/var/log/duplicity’ which is the value given for LOGDIR in the configuration file.

You can see the usage of duplicity-backup.sh script by just running the script without any flags.

Note: During test run, just add a sample test directory in the INCLIST instead of the original directories to backup in order to save time and money.

 

Verify it got backed up

If you had configured mail in your server, then you would have got a mail by now with the results of duplicity to the mail id which you had mentioned in the configuration file. The content of the mail will look something like below

——– START DUPLICITY-BACKUP SCRIPT ——–

Attempting to acquire lock /var/log/duplicity/backup.lock
successfully acquired lock.
—————-[ Duplicity Cleanup ]—————-
———–[ Source Disk Use Information ]———–
2.0G /var/www
502M /var/log

———[ Destination Disk Use Information ]——–
142M Amazon S3 type backend

——— END DUPLICITY-BACKUP SCRIPT ———

 

To verify it manually, Log into AWS console and navigate to All bucket –> <your bucket > –> Misc-backup. You’ll see files with extensions like difftar.gpg, sigtar.gpg and manifest.gpg. For example,

duplicity-full.20160316T102237Z.vol1.difftar.gpg

duplicity-full.20160316T102237Z.vol2.difftar.gpg

duplicity-full.20160316T102237Z.vol3.difftar.gpg

duplicity-full.20160316T102237Z.sigtar.gpg

duplicity-full.20160316T102237Z.manifest.gpg

 

By default the script will upload data in parts if the size exceeds 25 MB. Hence you’ll see naming like vol1, vol2 etc for the difftar files.

 

Wrapper script for DB:

Let us create a single script to combine the steps of mysqldump and duplicity backup for DB. Under /root/duplicity, create a script mysql-duplicity-backup.sh with the below content.

#!/bin/bash

_logname=db_backup.log(date +F%_%R)

_logpath=/var/log/mysql_backup_log

#Run the mysqldump script

/root/duplicity/db_dump.sh 2>&1 | tee $_logpath/$_logname

sleep 10

_error=`grep ERROR $_logpath/$_logname`

if [ -z “$_error” ]

then

/root/duplicity/duplicity-backup.sh -c /root/duplicity/mysql-duplicity-backup.conf -b

else

cat “$_logpath/$_logname” | mail -s “MySQL Backup Failure – `date` ” [email protected]

fi

The script will run the duplicity script only if the mysql dump completes without any error. Else it’ll send a mail with the error report on the mysql dump failure and exit the script. This way we can be sure that the DB backup is uploaded only during successful execution of the dump.

 

Schedule backup

We have got all the things in place in order to schedule our backup through cron. You can schedule the backup as per your requirement (daily, weekly, twice a day etc). The below example will run the cron job once every day at 3 am ( For DB) and 3:15 am (For Misc)

cd /etc/cron.d

vi duplicity

 

Add below content to the file

#Cron tab entry for backup

MAIL TO=””

PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin

#Run mysql backup once a day at 3 am

00 3 * * * root if [ -x /root/duplicity/mysql-duplicity-backup.sh ]; then /root/duplicity/mysql-duplicity-backup.sh; fi

#Run misc backup once a day at 3 :15 am

15 3 * * * root if [ -x /root/duplicity/duplicity-backup.sh ]; then /root/duplicity/duplicity-backup.sh -c /root/duplicity/misc-duplicity-backup.conf -b; fi

 

If the cron job executes successfully, you’ll be receiving two mails. One for DB and the other for Misc Backup.

We have successfully completed implementing our backup plan. Next, let us see how we can restore the backed up data.

 

Restore backup data

The restoration process is fairly simple in duplicity. We make it even easier using our wrapper script. If you are restoring it in a different server, then you need to have the GPG keys which you used for encrypting the data.

For restoring data from a specific time period, you need to mention the timestamp of that data file. The difftar files are named along with the timestamp. For eg, file will be named as duplicity-full.20160223T080419Z.vol1.difftar.gpg. Here 20160223T080419Z is the timestamp which you need to pass in order to restore data from that time

Restore entire backup:

# You will be prompted for a restore directory
duplicity-backup.sh -c misc-duplicity-backup.conf –restore

# You can also provide a restore folder on the command line.
duplicity-backup.sh -c misc-duplicity-backup.conf –restore /home/user/restore-folder

Restore a specific file or directory from backup:

# You will be prompted for a file to restore to the current directory
duplicity-backup.sh -c misc-duplicity-backup.conf –restore-file

# Restores the file img/mom.jpg from 20160223T080419Z to the current directory
duplicity-backup.sh-c misc-duplicity-backup.conf -t 20160223T080419Z –restore-file img/mom.jpg

# Restores the file img/mom.jpg to /home/user/i-love-mom.jpg
duplicity-backup.sh -c misc-duplicity-backup.conf -t 20160223T080419Z –restore-file img/mom.jpg /home/user/i-love-mom.jpg

# Restores the directory rel/dir/path to /target/restorepath
duplicity-backup.sh -c db-duplicity-backup.conf-t 20160223T080419Z –restore-dir rel/dir/path /target/restorepath

 

Important: While restoring, it is not recommended to restore the data directly to the destination directories. Instead, restore the data to a different directory and then move it as per the requirement

We have successfully implemented our backup and restore solution using Duplicity and Amazon S3.