Getting Started with Graph Embeddings in Neo4j

A brief introduction in how to turn the nodes of a network graph into a vectors

Image by Savionasc, licensed under the Creative Commons Attribution-Share Alike 4.0 International license. No changes were made to the original image.

Introduction

The starting point for all machine learning is to turn your data into vectors/embeddings (if they don’t already have them). Maybe you are lucky in your problem that you already have a lot of columns of normalized floats associated with each data point that easily combine to make this embedding. Or maybe you can easily derive them. A lot of different types of data can be used to generate vectors, such as text, images, etc. But what about when your data is in the form of a graph or other data that is related to each other?

Over the course of the next several blog posts I am hoping to get into some of the nitty gritty of how these vectors can be created and tuned. For the sake of this post, I am going to introduce the three methods existing in the Graph Data Science (GDS) library of Neo4j. (We will save the embedding tuning for a different post or two. There is a lot that goes into those hyperparameters!) We are going to use a small graph that is available using the Neo4j Sandbox (but you can also do this using the Neo4j Desktop or using the custom Docker container I described in this post), which is a free tool that can be used for trying out Neo4j and GDS for free.

This post is the second of a series where we will look at doing data science with graphs, which started with…

  1. “Get going with Neo4j and Jupyter Lab through Docker”

(In future blog posts we will be using that Docker reference more.)

Getting started with Neo4j Sandbox

I have described this in a different blog post, so let’s just hit the highlights here. The first step is to create the Sandbox itself. You can do that here. We are going to create a new Sandbox instance by selecting “New Project” and then “Graph Data Science,” as shown below.

Create a Graph Data Science Sandbox database

Once it completes its setup, click the green button to the right and tell it to “Open in Browser.”

Now let’s click on the button in the upper left that looks like a database icon and see what we have in this pre-populated graph that represents “Game of Thrones.” Cool. We have several node labels and relationship types, which will be super helpful moving forward. Your graph should look like the following when you issue the Cypher command MATCH (n) RETURN n :

Game of Thrones, visualized as a network graph. (Image by author.)

Using GDS to create an in-memory graph

The first step in using GDS is always to create an in-memory graph, which happens through the use of graph projections. The nice thing about graph projections is that you can (and usually should) get specific about which portion(s) of the graph you want to create embeddings for. In general, it is not a great idea to use the entire graph, particularly when the graph gets large. Further, some of the graph algorithms within GDS don’t work with a bipartite or multipartite graph. Lastly, working with in-memory graphs doesn’t have to permanently alter your overall database unless you use the algorithms with .write() , which you can use to write the embeddings as node properties. This will be super helpful when we want to do ML on the graph, and I will show how to do that in this post. So use in-memory graphs. You will love them!

There are two ways to create in-memory graphs and both are graph data models represented as projections. Projections specify both the node types and relationship types, which can be all-inclusive. These two methods include creating the graph via a Cypher projection or a so-called “native” projection. The Cypher projection has the benefit of being simple to write while also providing all of the flexibility of Cypher queries, but at the expense of being much slower than the native projection.

So let’s start by creating an in-memory graph. I am going to use the native projections here, but they can be easily converted to Cypher projections, should you so wish. Suppose I want to look at all of the people within the graph. We would use

CALL gds.graph.create(
'people', {
Person: { label: 'Person' }
},
'*'
)
YIELD graphName, nodeCount, relationshipCount;

to create this in-memory graph. Here the node projection simply specifies every node that has the label Person . The edge projection '*'includes all of the edges associated with the nodes in the node projection.

We could create something a bit more specific, like specifying multiple node types. We could then use the syntax

CALL gds.graph.create(
'a-different-graph', {
Person: { label: 'Person' },
House: { label: 'House' }
},
'*'
)
YIELD graphName, nodeCount, relationshipCount

So now we have both people and houses, which might be useful for ML tasks like predicted links between people and houses. (We will save that for a future post.)

Maybe we also want to include only specific relationship types between people and houses. (In Cypher, you can see all the relationship types by a quick query of MATCH (p:Person)--(h:House) RETURN p, h .) Suppose we only care about the relationship BELONGS_TO . To create that in-memory graph, we would include a specific edge projection:

CALL gds.graph.create(
'belongs-graph', {
Person: { label: 'Person' },
House: { label: 'House' }
},
{
BELONGS: { type: 'BELONGS_TO',
orientation: 'NATURAL'
}
}
)
YIELD graphName, nodeCount, relationshipCount

The edge projection BELONGS has a few things we have included, namely the edge type and orientation. A note about the later though: some graph algorithms in GDS prefer an orientation that is 'UNDIRECTED' however the default orientation is 'NATURAL'. You are encouraged to consult the API docs to determine what each algorithm requires. When in doubt, it is safest to assume undirected, monopartite graphs.

Cool. Now we have some in-memory graphs (check out CALL gds.graph.list()) ). Best practice says you should drop all graphs you are not going to use with CALL gds.graph.drop(graph_name) to free up memory.

Creating embeddings

There are three types of embeddings that you can create with GDS: FastRP, GraphSAGE, and node2vec. Each of these works in their own way to create embeddings of the nodes within the in-memory graph. Before we go through each, let’s go over some of the common parameters for them that you will be using to generate embeddings.

All of the embeddings (and, in fact, all of the graph algorithms) come with a few different methods. The ones we are going to use here are .stream() (which outputs the results to the screen) and .write() (which writes the calculated thing as a node property). For each of them, we will need to provide the name of the in-memory graph, some set of configuration parameters, and what is returned by the algorithm. In the case of .write(),this would be returned via the YIELD statement. When you yield the results back, they are done in terms of node IDs, which are the internal IDs to the graph. Note that they are specific to the in-memory graph and do not match anything in the database itself, so we will show shortly how to convert them back to something recognizable. The configurations tend to be specific to each algorithm and you are encouraged to consult the API docs on this.

Let’s now look at the three embedding algorithms. FastRP, as the name suggests is, well, fast. It uses sparse random projections, which are based on linear algebra, to create node embeddings based on the structure of the graph. Another bonus is that it handles limited memory nicely, which means that it will work in the Sandbox well. node2vec works in a way similar to the NLP vectorization approach of word2vec where a random walk of a given length is computed for each node. Finally, GraphSAGE is an inductive method, meaning you don’t need to recalculate embeddings for the entire graph when a new node is added, as you must do for the other two approaches. Additionally, GraphSAGE is able to use the properties of each node, which is not possible for the previous approaches.

You therefore might be tempted to think that you should always use GraphSAGE. However, it takes longer to run than the other two methods. FastRP, for instance, in addition to being very fast (and thus frequently used for baselining embeddings) can sometimes provide very high-quality embeddings. We will look at the optimization and comparison of embedding results in a future blog post.

So now let’s start with an in-memory graph and look at the most basic ways to create an embedding using FastRP. We will create a monopartite, undirected graph of people:

CALL gds.graph.create(
'people', {
Person: { label: 'Person' }
},
{
ALL_INTERACTS: { type: 'INTERACTS',
orientation: 'UNDIRECTED'
}
}
)
YIELD graphName, nodeCount, relationshipCount

Note that when we create an undirected in-memory graph you are creating relationship projections in both directions (natural and reversed).

To get the FastRP embeddings we would use

CALL gds.fastRP.stream('people',
{
embeddingDimension: 10
}
)
YIELD nodeId, embedding
RETURN gds.util.asNode(nodeId).name AS name, embedding

Here we have told FastRP to create a 10-dimensional vector, streamed to the screen. The final line uses gds.util.asNode() to convert those internal node IDs to something we can understand (the character names in this case). When we run this, we get results that look like this:

FastRP embeddings. (Image by author.)

If we want to write these as properties to the database, we would use

CALL gds.fastRP.write('people',
{
embeddingDimension: 10,
writeProperty: 'fastrf_embedding'
}
)

Now if you look at some Person nodes MATCH (p:Person) RETURN p LIMIT 3 you will see that Jaime Lannister, for example, gives us

{
"identity": 96,
"labels": [
"Knight",
"Person"
],
"properties": {
"fastrf_embedding": [
-0.57976233959198,
1.2105076313018799,
-0.7537267208099365,
-0.6507896184921265,
-0.23426271975040436,
-0.8760757446289062,
0.23972077667713165,
-0.07020065188407898,
-0.15781474113464355,
-0.4160367250442505
],
"pageRank": 13.522417121008036,
"wcc_partition": 2,
"gender": "male",
"book_intro_chapter": "5",
"name": "Jaime Lannister",
"pageRank-1": 3.143866012990475,
"community": 304,
"title": "Ser",
"age": 39,
"birth_year": 266
}
}

We can see that there is a nice embedding waiting for us.

What’s coming next in the series?

In this post we demonstrated the creation of FastRP embeddings on a Neo4j Sandbox instance. But wait, what about node2vec or GraphSAGE?! Those methods require a bit more memory and so we will save them for a future post which we will do with more compute power. So we will talk about those in a future blog post using a Docker container, which can be found through this post. We also will spend some time talking about how to tune these different embeddings, which is a step required of any ML-based solution. And then, of course, where would we be if we did not include a discussion of common ML-based solutions such as automated node classification or link prediction? Stay tuned!


Getting Started with Graph Embeddings was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.