Automating Home Lab Backups with Borg

Jesse

06 Jun 2024 • 8 min read

Backing up your data is one of the most important pillars of your digital presence–right alongside security and privacy. If you care about the data, it should be backed up. Period.

For better or for worse, most people have outsourced their backup solutions to cloud providers like Microsoft OneDrive, Apple iCloud, or Google Drive. However, if you're committed to your own data sovereignty, then you need to create your own backup solution.

🗄️

"Is my backup solution overkill? I sure hope so."

There are many ways to accomplish this, and after trying several, I've landed on Borg Backup as my favorite.

Step 1: Define Your Data

When designing a backup solution, the first thing you need to consider is what is important to back up. If might be tempting to back up entire images of your drives. However, I'd wager that a good percentage of the data on your drive(s) is not irreplaceable data.

Operating systems, downloaded programs, and anything that can be easily re-downloaded are generally not worth backing up. These are replaceable. Personal documents, photos, configuration files, databases, and anything you have created yourself is likely unique and irreplaceable. This is what needs to be backed up.

📂

"The first goal is to consolidate irreplaceable data into as few parent directories as possible."

So the first goal is to consolidate irreplaceable data into as few parent directories as possible. This will make both backing up and restoring data much easier.

Since the majority of my services run in Docker containers, this makes important data consolidation very straight-forward. Any data inside a volume needs to be backed up. This includes documents, media, configuration files, and databases.

The volumes are also mapped to the hosts, so I can put all files that need backing up into a single parent directory. This parent directory is located on the host at /home/{user}/{hostname}-data.

From here I can back up each server's data to a larger NAS S3 bucket, or Data Lake. I also back up that storage server to another redundant offsite storage for more resiliency.

Diagram of server backup model. Three servers are shown to be backing up to a central storage server, which backs itself up to an offsite backup. — Simplified model of my backup solution, following the 3-2-1 backup rule.

Step 2: Set up Borg Backup

Borg Backup is a powerful and performant open source backup tool. However, if you aren't comfortable on the command line, you may want to try Vorta, a GUI for Borg Backup on Linux, macOS, and Windows (via WSL). It has all the same features with a simple graphical interface.

Important: The rest of this setup requires a basic knowledge of bash and ssh.

On Linux, Borg Backup can be installed using most distribution's package manager (e.g. sudo apt update && sudo apt install borgbackup), but you can check the installation documentation for your specific operating system.

Once installed, you need to decide how your devices will communicate with each other. I decided to use ssh since it's the simplest, but you can also mount your backup directories as network drives on the host.

Step 3: Configure SSH

Since Borg will be connecting via ssh, and I don't want to use passwords for automation, I set up proper key pairs for each server. This is done by generating ssh keys on the host that needs to be backed up. Here is how I accomplished that:

Note: From now on, I'll refer to the server being backed up as the "host" and the server receiving the backup as the "backup server".

Run:

ssh-keygen -a 100 -t ed25519 -f ~/.ssh/id_ed25519 -C "{client-hostname}"

This creates a fairly secure Edwards-curve public & private key that ssh will use to authenticate.

Note: This command also works on Windows with the new Windows Terminal, which has built-in openssh.

Next, on the same host run:

cat ~/.ssh/id_ed25519.pub

This outputs the public key to copy to the backup server for easy copying. You can also just open the file in your text editor of choice.

Now, on the backup server, open your known_hosts file:

nano ~/.ssh/known_hosts

Paste in the public key from the previous step, and save.

Note: I prefer micro as my text editor, but I'm suggesting nano as it's usually installed by default on most distributions. Obviously you can use whichever text editor you prefer.

Before proceeding, I'd recommend testing an ssh connection from the host to the backup server. This allows you to ensure ssh is working before attempting to use Borg, and it also adds the backup server's public key to the known_hosts file on the host.

Step 4: Initialize the Repository

Now that your data is in place and your servers are connected, it's time to initialize the Borg repository.

This is where I'd highly recommend reading the Borg Backup Quick Start guide. It outlines Borg's commands and options way better than I ever could. Here's what I ran:

borg init --encryption=repokey {username}@{backup-server}:/mnt/backups/{host-name}-data

You will be prompted for a passphrase to unlock the encryption keys.

Note: On my backup server, I have some large Hard Drives in RAID mounted at /mnt/backups. You may want to use a different file path depending on your storage setup.

This creates a new Borg repository on the backup server at /mnt/backups/{host-name}-data with the encryption key stored on the backup server.

The last step before we automate is to create an initial backup. Depending on how much data you are backing up, this can take a long time:

borg create {user}@{backup-server}:/mnt/backups/{host-name}::{snapshot-name} /home/{user}/{host-name}-data --progress

There's a lot here, so let's break down the borg create command.

Argument	Description
{user}@{backup-server}	This specifies the username and hostname to which Borg will be connecting. This format also lets Borg know to use an ssh connection.
:/mnt/backups/{host-name}	The destination directory on the backup server. This should be the path of the repository you initialized earlier.
::{snapshot-name}	Kinda self-explanatory, but this is the name of the snapshot that will be created. I like to use the current date and time in %Y-%m-%d_%H-%M-%S format.
/home/{user}/{host-name}-data	This is the directory on the host that's being backed up.
--progress	This tells Borg to display the backup progress in real time. It's optional, but it's nice to see progress especially when backups take a long time.

Regarding Permissions: You may need to execute borg as sudo in order to read some files, especially databases. This can present problems when we start to automate, so I recommend taking the time to fix the permissions of your files first.

This may require tweaking which user your services are running as and/or adding your current user to some groups that have read access to these files. Configuring permissions correctly can be a huge pain, but it is much more secure in the long run.

🔐

"Configuring permissions correctly can be a huge pain, but it's much more secure in the long run."

Assuming there were no errors, you now have a backup of your data and a base snapshot to build upon. The nice thing about borg, is that snapshots are additive and deduplicated, so each new snapshot will only store the changes since the last snapshot. This saves on storage space and makes taking new snapshots a lot faster.

Step 5: Automate

The problem with manual backups is that they rely on someone to remember to do them. Personally, I prefer to automate backups and manually check them once in a while.

🕸️

"The problem with manual backups is that they rely on someone to remember to do them."

If you've automated tasks on Linux before, you probably know where this is going. I created a bash script triggered by cron. Starting with the bash script, it looks like this:

nano ~/backup.ssh

#!/bin/bash

# Set your variables. You can reduce the DELAY to speed up the script.
DELAY=0.5
DATE=$(date +%Y-%m-%d_%H-%M-%S)
USER=jesse
HOST=hostname
BACKUP_SERVER=backupservername
DATA_PATH="/home/$USER/$HOST-data"
BACKUP_PATH="/mnt/backups/$HOST-data"

echo "Running backups..."
sleep $DELAY
echo "Existing snapshots:"
echo "===================="
borg list "$USER@$BACKUP_SERVER:$BACKUP_PATH"
echo "===================="
sleep $DELAY
echo "Creating new snapshot for $DATE"
borg create "$USER@$BACKUP_SERVER:$BACKUP_PATH::$DATE" "$DATA_PATH" --progress && \
echo "Done."
sleep $DELAY
echo "Pruning..."
borg prune --list --keep-hourly=24 --keep-daily=7 --keep-weekly=3 --keep-monthly=6 "$USER@$BACKUP_SERVER:$BACKUP_PATH" && \
echo "Done."
sleep $DELAY
echo "Compacting Data..."
borg compact -v "$USER@$BACKUP_SERVER:$BACKUP_PATH" && \
echo "Done."
sleep $DELAY
echo "All backup tasks complete."

Feel free to use this script if you like, just make sure to change the variables to match your setup. Here's a description of those:

Variable	Description
DELAY	Deliberately slows down the script to make it more human readable. You can set this to a different number to speed up / slow down the output text. Numbers are in seconds.
DATE	This outputs the date and time in YYYY-mm-dd_HH-MM-SS format. This helps identify exactly when each snapshot is created.
USER	This should be set to the user name executing the backup. This user must exist on both the host and the backup server. This user must also have read access to all files being backed up.
HOST	Set this to the network host name of your host to facilitate the SSH connection.
BACKUP_SERVER	Set this to the network host name of your backup server to facilitate the SSH connection.
DATA_PATH	Absolute file path of the files being backed up
BACKUP_PATH	Absolute file path of the Borg repository on the backup server

This script is designed to be human friendly, so there's a lot of extra output to fancy things up. If you're never going to run this manually, you may want to take out a lot of the extra fluff. In summary, the script does the following:

Sets variables.
Outputs a list of all the snapshots currently in the repository prior to the current backup run.
Creates a new snapshot.
Prunes any snapshots that don't meet the retention policy. I have the retention policy keeping a maximum of:
1. 1 snapshot per hour for the last 24 hours.
2. 1 snapshot per day for the last 7 days.
3. 1 snapshot per week for the last 3 weeks.
4. 1 snapshot per month for the last 6 months.
Runs a compact on the repository. This:
1. Does a compression pass on the repository data.
2. Does a deduplication pass on the repository data.
3. Deletes any data that isn't associated with the retained snapshots.

This all comes together to get a space-efficient versioned backup of all the data I care about. The last (and probably easiest) step is adding a cron job.

I ran crontab -e and in my text editor of choice, added the following line to the bottom: 0 * * * * /bin/bash /home/{user}/backup.sh >> /home/{user}/backup.log 2>&1 and saved. This runs a backup every hour on the hour, which lines up with my retention policy. It also sends the output to a backup.log file in the same directory as the script.

You don't have to back up this often, especially if you're backing up a lot of data. I recommend at least backing up once a week if you touch the data daily. Check out crontab-generator.org to help with creating the right cron configuration for your situation.

Conclusion

That's it! Now all server data is backing up automatically with plenty of versions to fall back on in case of an incident. If you remember my diagram, I repeated this process for each of my servers and once more to an offsite backup. Is my backup solution overkill? I sure hope so. I'd rather have too many backups than too few.