Setting up a GraphDB cluster

7 min readDec 6, 2019

Once you start getting serious about running Ontotext GraphDB, you are going to want to upgrade to the Enterprise Edition, and set up clusters. Not only will this allow you to scale your concurrent query capacity, but it will also add resilience and more flexible backup operations.

Unfortunately the official documentation about setting up clusters is about as clear as mud, so here’s a simple guide to getting up and running.

GraphDB cluster, from https://www.ontotext.com/products/graphdb/ — Apparently this is what a GraphDB cluster looks like (https://www.ontotext.com/products/graphdb/)

Prerequisites

In this guide, we’re talking about Ontotext GraphDB EE 9.0 running on Ubuntu Linux 18 (on Azure).

We’ve provisioned two VMs, one for a master, one for a worker, and attached and mounted an appropriately sized data disk to store our repository files.

The essential difference between the master and worker VMs is the size of the storage attached: the workers need a fast data disk of sufficient size to store all your triples, the master needs minimal storage. See the Requirements reference.

The root storage directory of the database, referred to as the graphdb.home directory, contains the conf, data, logs and work directories, of which at least data should be on the fast data-disk. In this document, to avoid ambiguity, the master home is on our data-disk under /opt/graphdb-master-home and the worker home under /opt/graphdb-worker-home.

The distribution has been unarchived into /usr/src/graphdb-ee-9.0.0, and we’ve installed OpenJDK 11, and started the database, roughly following the first section of the Quick start guide.

There’s really no difference in this procedure from getting a single node of any GraphDB edition up and running — except that you will need to install your EE licence on the workers (and Workbench) to enable clustering.

Our real setup is of course all terraformed+ansible’d and running under systemd because we are a proper DevOps team! 🤗

The master node in this guide listens on port 8080.
The worker node listens on port 7080.
The GraphDB Workbench is listening on port 6080 on the master node.

You can change all of these in conf/graphdb.properties of course, but this guide uses the settings above to avoid ambiguity (note that these are not the default GraphDB settings).

Before proceeding, check you can connect to the GraphDB Workbench using your browser…

Repository types

GraphDB Repositories come in two flavours: worker repositories that store all the data (i.e. the triples) and run SPARQL queries, and master repositories that act as a coordination point, but do not contain triples.

When you create a clustered repository, you must create instances of both types, hosted on the corresponding nodes.

Setting up a clustered repository

Essentially for each repository you want to be clustered, you set up:

a master repository on a designated master node.
a worker repository on the each of the designated worker nodes you want to be in the cluster.

You then connect the master repository to the worker repositories: the master maintains a transaction log for writes, to ensure that it has written the same data to all the worker repositories, and will distribute queries across workers. There is no sharding of data in GraphDB, so each worker has a copy of the entire repository, and each SPARQL query is executed in its entirety on just one of the workers.

A minimal ‘cluster’: M (master) repo1, W (worker) repo1

The repository IDs on the master and the worker nodes can be different — they just need to be linked together in config to function as a cluster — however it’s almost certainly a good idea to name them the same thing, or at least something derivative!

⚠️ Be careful not to create worker repositories on the master node, as the triple data will also be stored there, and masters are usually only sized to have enough resources to hold the transaction log, which is tiny in comparison to the data repositories.

You can create both types of repositories using the GraphDB Workbench, and connect them together using the Setup/Cluster management panel.

Alternatively, you can use the RDF4J console, which allows programmatic set-up. We’ll cover the RDF4J console first, but feel free to skip to the Workbench instructions below that if this blows your mind…

Setup a repository using RDF4J console

GraphDB Reference: Creating a repository: Using the RDF4J console

We are going to create a clustered repository with IDrepo-1, using the RDF4J console.

ℹ️ You need to set the graphdb.home directory using the GDB_JAVA_OPTS environment variable as shown below when starting the RDF4J console.
The console command needs to be executed by a user with write-permission to the graphdb.home directory, or it may not start up.

Master repository set up

To create the master repository, shell into the master node and run:

cd /opt/graphdb-master-homesudo -u graphdb GDB_JAVA_OPTS=-Dgraphdb.home=$(pwd) /usr/src/graphdb-ee-9.0.0/bin/consoleconnect http://localhost:8080create masterquit

This uses /usr/src/graphdb-ee-9.0.0/configs/templates/master.ttl as a template.

All you need to enter is an ID for the new repository, e.g. repo-1 and a human-readable name.

The master repository settings are stored in the repo as a Java-style .properties file, e.g. /opt/graphdb-master-home/data/repositories/repo-1/cluster.properties.

Worker repository set up

Now on our worker node(s), we run the console again, but create a worker repository:

cd /opt/graphdb-worker-homesudo -u graphdb GDB_JAVA_OPTS=-Dgraphdb.home=$(pwd) /usr/src/graphdb-ee-9.0.0/bin/consoleconnect http://localhost:7080create workerquit

This uses /usr/src/graphdb-ee-9.0.0/configs/templates/worker.ttl as a template.

You have to enter a bunch of parameters (as defined in the template) when you run the create command for the worker, but you can probably accept the defaults for everything except the repository ID e.g. repo-1 and the human-readable name, but see the configuring a repository documentation for details of what the options do, and note that some of them cannot be changed after repository creation.

The repository’s settings are stored in the repo as a Turtle .ttl file, e.g. /opt/graphdb-worker-home/data/repositories/repo-1/config.ttl.

Connect master to workers

We are now going to connect master repo repo-1 from the master node (localhost in this example) to worker repo repo-1 on a worker node (10.7.3.6 in this example):

On the master node, we’ll use the curl tool to call the Jolokia addClusterNode operation on the master node, giving the URL of the worker repository on the worker node (http://10.7.3.6:7080/repositories/repo-1):

curl -H ‘content-type: application/json’ -d “{\”type\”:\”exec\”,\”mbean\”:\”ReplicationCluster:name=ClusterInfo\/repo-1\”,\”operation\”:\”addClusterNode\”,\”arguments\”:[\”http://10.7.3.6:7080/repositories/repo-1\",0,true]}" http://localhost:8080/jolokia

The cluster properties are stored in the master node’s repository, e.g. /opt/graphdb-master-home/data/repositories/repo-1/cluster.properties.

We now have a cluster of one master and one worker: we can add additional workers for repo-1 by following the last two steps again. If we send queries to the master node (on port 8080), it will propagate them to one of the connected worker node(s), and return the result.

Setup a repository using the GraphDB Workbench

The GraphDB Workbench has its own configuration, and doesn’t immediately know about the master database (which just happens to be running on the same node in our case), so it needs to be pointed to all the nodes in the cluster, using the Setup/Repositories panel, in order to be able to interact with them.

You will need to create a Remote Repository Location (reference) pointing at the master (in this guide, http://localhost:8080 because we’re running the Workbench on the same node as the master) and one for each of the worker node(s) (e.g. the worker on http://10.7.3.6:7080 in this guide).

Repository locations showing the activate location as a connected plug — The activate location has a connected plug

The active repository location is a property of the Workbench (as shown by the connected plug) which is the location where all operations, such as repository creation, will take place, so be sure to activate the correct location before creating a repository i.e. switch the location to a worker node to create a worker repository; set it back to the master to create the master repository.

In the Workbench the master configuration is called “GRAPHDB-EE Master” and the worker is “GRAPHDB-EE Worker”, which you will see in the Type drop-down in the Create repository panel:

Create master repository, showing GRAPHDB-EE-Master type selected — Create the Master repo-1 with the master location active

Create worker repository, showing GRAPHDB-EE-Worker type selected — Create the Worker repo-1 with the worker location active

You can connect master and worker repositories made in the Workbench or the RDF4J console using drag-and-drop in the Setup/Cluster management panel rather than using the Jolokia API: simply drag a connection from the master repository to the corresponding worker repository. If you have multiple replicas of a worker repo (as you would in production), connect the master for that repo to all the worker replicas.

Crossed-through worker repository — A repository lacking a Location

If workers in the Setup/Cluster management panel show as “OFF” (crossed through), and the status pop-up says something like ‘the repository has been deleted’, but you know you have already connected the workers using the Jolokia API command above, it’s probably because you haven’t set up a Repository Location for it in the Workbench. Do that and the workers should show as “ON”. Essentially, the master cluster config has the worker repo URLs, but the Workbench doesn’t know about them yet, so complains.

Summary

Clustering GraphDB is actually pretty simple, it’s just the documentation that makes it look complicated!

Add the master and worker node Locations to the Workbench (if using the Workbench).
Create worker repositories on each of the worker nodes.
Create a corresponding master repository on the master node.
Connect the master repository to each of the worker repositories.
Configure your applications to send your SPARQL queries to the master node.

If you want to use multiple masters for more resilience, the setup should now be fairly obvious — I hope!