Title: Greenback Backup

Author: Bob Schmidt

Date: 08 July 2020 17:13:07 +01:00 or Wed, 08 July 2020 17:13:07 +01:00

Summary: Paul Grenyer demonstrates a DevOps pipeline.

Body:

Why

When Naked Element was still a thing, we used DigitalOcean almost exclusively for our clientâ€™s hosting. For the sorts of projects we were doing, it was the most straightforward and cost effective solution. DigitalOcean [1] provided managed databases, but there was no facility to automatically back them up. This led us to develop a Python-based program which was triggered once a day to perform the backup, push it to AWS S3 and send a confirmation or failure email.

We used Python due to familiarity, ease of use and low installation dependencies. Iâ€™ll demonstrate this later on in the Dockerfile. S3 was used for storage as DigitalOcean did not have their equivalent, â€˜Spacesâ€™ [2], available in their UK data centre. The closest is in Amsterdam, but our clients preferred to have their data in the UK.

Fast forward to May 2020 and Iâ€™m working on a personal project which uses a PostgreSQL database. I tried to use a combination of AWS [3] and Terraform [4] for the projectâ€™s infrastructure (as this is what I am using for my day job) but it just became too much effort to bend AWS to my will and itâ€™s also quite expensive. I decided to move back to Digital Ocean and got the equivalent setup sorted in a day. I could have taken advantage of AWSâ€™ free tier for the database for 12 months, but AWS backup storage is not free and I wanted as much as possible with one provider and within the same virtual private network (VPC).

I was back to needing my own backup solution. The new project I am working on uses Docker [5] to run the main service. My Droplet (thatâ€™s what Digital Ocean calls its Linux server instances) setup up is minimal: non-root user setup, firewall configuration and Docker install. The DigitalOcean Market Place [6] includes a Docker image so most of that is done for me with a few clicks. I could have also installed Python and configured a backup program to run each evening. Iâ€™d also have to install the right version of the PostgreSQL client, which isnâ€™t currently in the default Ubuntu repositories, so is a little involved. As I was already using Docker it made sense to create a new Docker image to install everything and run a Python programme to schedule and perform the backups. Of course some might argue that a whole Ubuntu install and configure in a Docker image is a bit much for one backup scheduler, but once itâ€™s done itâ€™s done and can easily be installed and run elsewhere as many times as is needed.

There are two more decisions to note. My new backup solution will use DigitalOcean spaces, as Iâ€™m not bothered about my data being in Amsterdam and I havenâ€™t implemented an email server yet so there are no notification emails. This resulted in me jumping out of bed as soon as I woke each morning to check Spaces to see if the backup had worked, rather than just checking for an email. It took two days to get it all working correctly!

What

I reached for Naked Elementâ€™s trusty Python backup program affectionately named Greenback after the arch enemy of Danger Mouse (Green-back up, get it? No, me neitherâ€¦) but discovered it was too specific and would need some work, but would serve as a great template to start with.

Itâ€™s worth nothing that I am a long way from a Python expert. Iâ€™m in the â€˜reasonable working knowledge with lots of help from Googleâ€™ category. The first thing I needed the program to do was create the backup. At this point I was working locally where I had the correct PostgreSQL client installed, db_backup.py (see Listing 1).

db_connection_string=os.environ['DATABASE_URL']
class GreenBack:
  def backup(self):
    datestr = datetime.now()
      .strftime("%d_%m_%Y_%H_%M_%S")
    backup_suffix = ".sql"
    backup_prefix = "backup_"

    destination = backup_prefix + datestr
      + backup_suffix
    backup_command = 'sh backup_command.sh '
      + db_connection_string + ' ' + destination
    subprocess.check_output
      (backup_command.split(' '))
    return destination

Listing 1

I want to keep anything sensitive out of the code and out of source control, so Iâ€™ve brought in the connection string from an environment variable. The method constructs a filename based on the current date and time, calls an external bash script to perform the backup:

  # connection string
  # destination
  pg_dump $1 > $2

and returns the backup file name. Of course, for Ubuntu I had to make the bash script executable. Next I needed to push the backup file to Spaces, which means more environment variables:

  region=''
  access_key=os.environ['SPACES_KEY']
  secret_access_key=os.environ['SPACES_SECRET']
  bucket_url=os.environ['SPACES_URL']
  backup_folder='dbbackups'
  bucket_name='findmytea'

So that the program can access Spaces and another method:

class GreenBack:
  ...
  def archive(self, destination):
    session = boto3.session.Session()
    client = session.client('s3',
    region_name = region, endpoint_url=bucket_url,
    aws_access_key_id = access_key,
      aws_secret_access_key=secret_access_key)
      client.upload_file(destination, bucket_name,
        backup_folder + '/' + destination)
      os.remove(destination)

Itâ€™s worth noting that DigitalOcean implemented the Spaces API to match the AWS S3 API so that the same tools can be used. The archive method creates a session and pushes the backup file to Spaces and then deletes it from the local file system. This is for reasons of disk space and security. A future enhancement to Greenback would be to automatically remove old backups from Spaces after a period of time.

The last thing the Python program needs to do is schedule the backups. A bit of Googling revealed an event loop which can be used to do this (see Listing 2).

class GreenBack:
  last_backup_date = ""

  def callback(self, n, loop):
    today = datetime.now().strftime("%Y-%m-%d")
    if self.last_backup_date != today:
      logging.info('Backup started')
      destination = self.backup()
      self.archive(destination)
      
      self.last_backup_date = today
      logging.info('Backup finished')
    loop.call_at(loop.time() + n, self.callback,
      n, loop)
...

event_loop = asyncio.get_event_loop()
try:
  bk = GreenBack()
  bk.callback(60, event_loop)
  event_loop.run_forever()
finally:
  logging.info('closing event loop')
  event_loop.close()

Listing 2

On startup, callback is executed. It checks the last_back_date against the current date and if they donâ€™t match it runs the backup and updates the last_backup_date. If the dates do match, and after running the backup, the callback method is added to the event loop with a one minute delay. Calling event_loop.run_forever after the initial callback call means the program will wait forever and the process continues.

Now that I had a Python backup program I needed to create a Dockerfile that would be used to create a Docker image to setup the environment and start the program (Listing 3).

FROM ubuntu:xenial as ubuntu-env
WORKDIR /greenback

RUN apt update
RUN apt -y install python3 wget gnupg sysstat python3-pip

RUN pip3 install --upgrade pip
RUN pip3 install boto3 --upgrade
RUN pip3 install asyncio --upgrade

RUN echo 'deb http://apt.postgresql.org/pub/repos/apt/ xenial-pgdg main' > /etc/apt/sources.list.d/pgdg.list
RUN wget https://www.postgresql.org/media/keys/ACCC4CF8.asc
RUN apt-key add ACCC4CF8.asc

RUN apt update
RUN apt -y install postgresql-client-12

COPY db_backup.py ./
COPY backup_command.sh ./

ENTRYPOINT ["python3", "db_backup.py"]

Listing 3

The Dockerfile starts with an Ubuntu image. This is a bare bones, but fully functioning, Ubuntu operating system. The Dockerfile then installs Python, its dependencies and the Greenback dependencies. Then it installs the PostgreSQL client, including adding the necessary repositories. Following that, it copies the required Greenback files into the image and tells it how to run Greenback.

I like to automate as much as possible so while I did plenty of manual Docker image building, tagging and pushing to the repository during development, I also created a BitBucket Pipeline [7], which would do the same on every check in (see Listing 4).

image: python:3.7.3

pipelines:
  default:
    - step:
  services:
    - docker
  script:
    - IMAGE="findmytea/greenback"
    - TAG=latest
    - docker login --username $DOCKER_USERNAME 
      --password $DOCKER_PASSWORD
    - docker build -t $IMAGE:$TAG .
    - docker push $IMAGE:$TAG

Listing 4

Pipelines, BitBucketâ€™s cloud based Continuous Integration and Continuous Deployment feature, is familiar with Python and Docker so it was quite simple to make it log in to Docker Hub [8], build, tag and push the image. To enable the pipeline all I had to do was add the bitbucket-pipelines.yml file to the root of the repository, checkin, follow the BitBucket pipeline process in the UI to enable it and add then add the build environment variables so the pipeline could log into Docker Hub. Iâ€™d already created the image repository in Docker Hub.

The Greenback image shouldnâ€™t change very often and there isnâ€™t a straightforward way of automating the updating of Docker images from Docker Hub, so I wrote a bash script to do it, deploy_greenback (Listing 5).

sudo docker pull findmytea/greenback
sudo docker kill greenback
sudo docker rm greenback
sudo docker run -d --name greenback --restart always --env-file=.env findmytea/
greenback:latest
sudo docker ps
sudo docker logs -f greenback

Listing 5

Now, with a single command I can fetch the latest Greenback image, stop and remove the currently running image instance, install the new image, list the running images to reassure myself the new instance is running and follow the Greenback logs. When the latest image is run, it is named for easy identification, configured to restart when the Docker service is restarted and told where to read the environment variables from. The environment variables are in a local file called .env:

  DATABASE_URL=...
  SPACES_KEY=...
  SPACES_SECRET=...
  SPACES_URL=https://ams3.digitaloceanspaces.com

And thatâ€™s it! Greenback is now running in a Docker image instance on the application server and backs up the database to Spaces just after midnight every night.

Finally

While Greenback isnâ€™t a perfect solution, it works, is configurable, a good platform for future enhancements and should require minimal configuration to be used with other projects in the future.

Greenback is checked into a public BitBucket repository and the full code can be found here: https://bitbucket.org/findmytea/greenback/.

The Greenback Docker image is in a public repository on Docker Hub and can be pulled with Docker: docker pull findmytea/greenback

References

[1] Digital Ocean: https://www.digitalocean.com/

[2] Digital Ocean Spaces: https://www.digitalocean.com/products/spaces/

[3] AWS: https://aws.amazon.com/

[4] Terraform: https://www.terraform.io/

[5] Docker: https://www.docker.com/

[6] DigitalOcean Marketplace: https://marketplace.digitalocean.com/

[7] BitBucket Pipelines: https://bitbucket.org/product/features/pipelines

[8] Docker Hub: https://hub.docker.com/

Paul Grenyer Paul Grenyer is a husband, father, software consultant, author, testing and agile evangelist.

Notes:

More fields may be available via dynamicdata ..