Comparison of Neo4j with Relational Databases

Hello,

are you aware of published comparisons of Neo4J with other graph and/or relational databases regarding runtime and memory consumption?
Is there a tool contest where Neo4J participated?
The comparisons I found are not entirely satisfactory.

Sven

RDBMS stores the data in rows, and the primary and foreign keys are connected via relationships, but separated in individual tables.

But, Neo4j (graph database) natively stores the data and its connections to another data via physical relationships.

If you have Data Model is more than 3NF, then sometimes you need an association table to multi-id join. This won't be in the case of Neo4j, as the relationship directly map to the data.

An example from Neo4j introduction slide.
intro to neo4j and graph database

Dear Dominic,

thank you very much for your response!

I understand the basic difference between graph and relational databases and why it is plausible to believe that some graph database should be faster than some relational database in some cases.

What I am looking for is hard evidence that the implementation of Neo4J outperforms all relational database implementations for a given set of scenarios (and which these scenarios are).
Also, of course, I am interested in comparisons with other graph database implementations for the same reason.

Comparisons are hard to accomplish because experts have to be used for each database solution for a adequate comparison as it is typically the case in tool contests where the developers of the respective tools participate.

In essence, when is Neo4J provably better than any other technology?

Hi Sven,
Say if you are going to query 1 or 2 tables then RDBMS can work pretty good depending on sizes of each tables. Biggest cost you are incurring in RDBMS world is the JOIN. In Neo4j JOINS are avoided and converted to traversal. You can think of the relationships are like JOINS.

Say you have Person table with 10 million rows and Address table with 4 million rows and a JOIN table with 10 million rows that connects the Persons to Address.

When you want to find all the people living at an address in RDBMS these are the steps.

  1. Lookup Address (index lookup)
  2. Lookup person id's in Join table (Index lookup)
  3. Lookup person's by id's (index lookup)

So there are 3 index lookups.

in Neo4j these are the steps.

  1. Lookup address (index lookup)
  2. Traverse to Persons (Insignificant cost as it is a pointer traversal)

In Neo4j you are incurring extra cost during ingestion while creating relationships, but reap benefits as the queries are converted to traversals.

As you start traversing more than 2 levels Neo4j advantages starts becoming apparent.

From that perspective for those scenarios Neo4J outperforms traditional RDBMS. That does not mean it is the best choice in all kinds of scenarios.

To give you an example from my experience about the value of Neo4j.

We have a Million patients DB with 100m nodes and 1B relationships. We are generating a Sankey diagram for a condition get all procedures in the next 90 days with Patient Age/Race as filters across all patients. We can generate this diagram under 5 seconds. We cannot match this kind of performance with RDBMS.

Hi Sven, there are use cases to be considered before coming to the conclusion. What is your use case ?

As mentioned, the question of traversals vs joins are the key. You'll need to view this with an understanding of how efficiency will be impacted as the data within your tables grows, and as the number of joins increases.

RDBMS joins are at best O(log(n)) when indexes are present to support each of those joins, meaning that as you increase the number of records in the table being joined against, performance will eventually be impacted. But this is O(log(n)), so it won't grow by much, but in this age of big data and data lakes, we are increasingly working with larger and larger sets of data. The point is that the efficiency of the join is still based upon the amount of data in the table being joined.

With index-free adjacency, there are no joins when performing traversal, these are pointer hops, O(1). No matter how much many data is present, no matter how many nodes there are for the given label you are attempting to traverse to, the cost of the traversal of a node's relationships remains O(1), since we only have to look at this node's relationships, not all the relationships or nodes in the graph.

And if we're only doing comparisons for a single traversal, the differences between O(1) traversal and O(log n) joins are small enough to be trivial in most cases. But how many joins/traversals does the query require? Graphy problems usually require many traversals, sometimes an unknown amount, and where in a graph db this would be O(1) * number of traversals, for a relational db it's log(n) * number of joins, where the ns all differ depending on which tables are being joined. That can add up.

Additionally there's the issue of how to specify the join conditions. If using vanilla SQL you would need to know more about the data in your db, how to perform the joins, which keys from one table are used to join to another table. In Neo4j a relationship between two nodes doesn't have to be based on any of their properties. We don't even need to access the properties at all. And for that matter, we can even traverse through nodes of different labels using relationships of multiple types. Or we may not care so much about the labels or types. In SQL this would be the equivalent of being able to join through any table at will, without knowing ahead of time which tables are being joined or how to join against them, or how many joins will be performed. I'm not sure SQL has such an ability to match that versatility. This is why graph dbs are good for discovering patterns or links between things when there needs to be a level of permissiveness and malleability in the rules followed to find those patterns (think Panama Papers, or analysis of Trumpworld finances, fraud detection, etc)

The caveat of course is that if your app doesn't need to make graphy queries like this, and it doesn't fit into traditionally graphy use cases, then you might not need a graph database. Relational dbs are fantastic for what they do well. But they are not a golden hammer, and their implementation can make them a poor tool to use for graphy use cases.