Setting up a GraphDB cluster
Once you start getting serious about running Ontotext GraphDB, you are going to want to upgrade to the Enterprise Edition, and set up clusters. Not only will this allow you to scale your concurrent query capacity, but it will also add resilience and more flexible backup operations.
Unfortunately the official documentation about setting up clusters is about as clear as mud, so here’s a simple guide to getting up and running.
Prerequisites
In this guide, we’re talking about Ontotext GraphDB EE 9.0 running on Ubuntu Linux 18 (on Azure).
We’ve provisioned two VMs, one for a master, one for a worker, and attached and mounted an appropriately sized data disk to store our repository files.
The essential difference between the master and worker VMs is the size of the storage attached: the workers need a fast data disk of sufficient size to store all your triples, the master needs minimal storage. See the Requirements reference.
The root storage directory of the database, referred to as the graphdb.home
directory, contains the conf
, data
, logs
and work
directories, of which at least data
should be on the fast data-disk. In this document, to avoid ambiguity, the master home is on our data-disk under /opt/graphdb-master-home
and the worker home under /opt/graphdb-worker-home
.
The distribution has been unarchived into /usr/src/graphdb-ee-9.0.0
, and we’ve installed OpenJDK 11, and started the database, roughly following the first section of the Quick start guide.
There’s really no difference in this procedure from getting a single node of any GraphDB edition up and running — except that you will need to install your EE licence on the workers (and Workbench) to enable clustering.
Our real setup is of course all terraformed+ansible’d and running under
systemd
because we are a proper DevOps team! 🤗
- The master node in this guide listens on port 8080.
- The worker node listens on port 7080.
- The GraphDB Workbench is listening on port 6080 on the master node.
You can change all of these in conf/graphdb.properties
of course, but this guide uses the settings above to avoid ambiguity (note that these are not the default GraphDB settings).
Before proceeding, check you can connect to the GraphDB Workbench using your browser…
Repository types
GraphDB Repositories come in two flavours: worker
repositories that store all the data (i.e. the triples) and run SPARQL queries, and master
repositories that act as a coordination point, but do not contain triples.
When you create a clustered repository, you must create instances of both types, hosted on the corresponding nodes.
Setting up a clustered repository
Essentially for each repository you want to be clustered, you set up:
- a
master
repository on a designated master node. - a
worker
repository on the each of the designated worker nodes you want to be in the cluster.
You then connect the master
repository to the worker
repositories: the master
maintains a transaction log for writes, to ensure that it has written the same data to all the worker
repositories, and will distribute queries across workers. There is no sharding of data in GraphDB, so each worker has a copy of the entire repository, and each SPARQL query is executed in its entirety on just one of the workers.
The repository IDs on the master and the worker nodes can be different — they just need to be linked together in config to function as a cluster — however it’s almost certainly a good idea to name them the same thing, or at least something derivative!
⚠️ Be careful not to create
worker
repositories on themaster
node, as the triple data will also be stored there, and masters are usually only sized to have enough resources to hold the transaction log, which is tiny in comparison to the data repositories.
You can create both types of repositories using the GraphDB Workbench, and connect them together using the Setup/Cluster management panel.
Alternatively, you can use the RDF4J console, which allows programmatic set-up. We’ll cover the RDF4J console first, but feel free to skip to the Workbench instructions below that if this blows your mind…
Setup a repository using RDF4J console
GraphDB Reference: Creating a repository: Using the RDF4J console
We are going to create a clustered repository with IDrepo-1
, using the RDF4J console.
ℹ️ You need to set the
graphdb.home
directory using theGDB_JAVA_OPTS
environment variable as shown below when starting the RDF4J console.The
console
command needs to be executed by a user with write-permission to thegraphdb.home
directory, or it may not start up.
Master repository set up
To create the master
repository, shell into the master node and run:
cd /opt/graphdb-master-homesudo -u graphdb GDB_JAVA_OPTS=-Dgraphdb.home=$(pwd) /usr/src/graphdb-ee-9.0.0/bin/consoleconnect http://localhost:8080create masterquit
This uses /usr/src/graphdb-ee-9.0.0/configs/templates/master.ttl
as a template.
All you need to enter is an ID for the new repository, e.g. repo-1
and a human-readable name.
The master repository settings are stored in the repo as a Java-style .properties
file, e.g. /opt/graphdb-master-home/data/repositories/repo-1/cluster.properties
.
Worker repository set up
Now on our worker node(s), we run the console again, but create a worker
repository:
cd /opt/graphdb-worker-homesudo -u graphdb GDB_JAVA_OPTS=-Dgraphdb.home=$(pwd) /usr/src/graphdb-ee-9.0.0/bin/consoleconnect http://localhost:7080create workerquit
This uses /usr/src/graphdb-ee-9.0.0/configs/templates/worker.ttl
as a template.
You have to enter a bunch of parameters (as defined in the template) when you run the create
command for the worker, but you can probably accept the defaults for everything except the repository ID e.g. repo-1
and the human-readable name, but see the configuring a repository documentation for details of what the options do, and note that some of them cannot be changed after repository creation.
The repository’s settings are stored in the repo as a Turtle .ttl
file, e.g. /opt/graphdb-worker-home/data/repositories/repo-1/config.ttl
.
Connect master to workers
We are now going to connect master repo repo-1
from the master node (localhost
in this example) to worker repo repo-1
on a worker node (10.7.3.6
in this example):
On the master node, we’ll use the curl
tool to call the Jolokia addClusterNode
operation on the master node, giving the URL of the worker
repository on the worker node (http://10.7.3.6:7080/repositories/repo-1
):
curl -H ‘content-type: application/json’ -d “{\”type\”:\”exec\”,\”mbean\”:\”ReplicationCluster:name=ClusterInfo\/repo-1\”,\”operation\”:\”addClusterNode\”,\”arguments\”:[\”http://10.7.3.6:7080/repositories/repo-1\",0,true]}" http://localhost:8080/jolokia
The cluster properties are stored in the master node’s repository, e.g. /opt/graphdb-master-home/data/repositories/repo-1/cluster.properties
.
We now have a cluster of one master
and one worker
: we can add additional workers for repo-1
by following the last two steps again. If we send queries to the master node (on port 8080), it will propagate them to one of the connected worker node(s), and return the result.
Setup a repository using the GraphDB Workbench
The GraphDB Workbench has its own configuration, and doesn’t immediately know about the master database (which just happens to be running on the same node in our case), so it needs to be pointed to all the nodes in the cluster, using the Setup/Repositories panel, in order to be able to interact with them.
You will need to create a Remote Repository Location (reference) pointing at the master (in this guide, http://localhost:8080
because we’re running the Workbench on the same node as the master) and one for each of the worker node(s) (e.g. the worker on http://10.7.3.6:7080
in this guide).
The active repository location is a property of the Workbench (as shown by the connected plug) which is the location where all operations, such as repository creation, will take place, so be sure to activate the correct location before creating a repository i.e. switch the location to a worker node to create a worker
repository; set it back to the master to create the master
repository.
In the Workbench the master
configuration is called “GRAPHDB-EE Master” and the worker
is “GRAPHDB-EE Worker”, which you will see in the Type drop-down in the Create repository panel:
You can connect master
and worker
repositories made in the Workbench or the RDF4J console using drag-and-drop in the Setup/Cluster management panel rather than using the Jolokia API: simply drag a connection from the master
repository to the corresponding worker
repository. If you have multiple replicas of a worker
repo (as you would in production), connect the master
for that repo to all the worker
replicas.
If workers in the Setup/Cluster management panel show as “OFF” (crossed through), and the status pop-up says something like ‘the repository has been deleted’, but you know you have already connected the workers using the Jolokia API command above, it’s probably because you haven’t set up a Repository Location for it in the Workbench. Do that and the workers should show as “ON”. Essentially, the master cluster config has the worker repo URLs, but the Workbench doesn’t know about them yet, so complains.
Summary
Clustering GraphDB is actually pretty simple, it’s just the documentation that makes it look complicated!
- Add the master and worker node Locations to the Workbench (if using the Workbench).
- Create
worker
repositories on each of the worker nodes. - Create a corresponding
master
repository on the master node. - Connect the
master
repository to each of theworker
repositories. - Configure your applications to send your SPARQL queries to the master node.
If you want to use multiple masters for more resilience, the setup should now be fairly obvious — I hope!