Tyler from Texas - Massive Dataset

Hello all.

My team is trying to figure out if Neo4j meets our needs for our next project. We have have a dataset of 200 million records of two different types, I'll call them A (180M records) nodes and B (20M records) nodes, and need to validate whether or not Neo4j can handle that. Every A node will be connected to approximately 10 B nodes. Querying performance is not a huge concern since everything will be properly indexed, we are more concerned with how long it will take to import 200M records and the simple connections between them. Is it even possible to import this many records in a single Neo4j database? And what kind of processing power/storage is necessary for a database this size?

Thank you


If I understand the situation, what you describe is roughly 200million nodes (two different labels, A & B), and 1.8 billion relationships? is that correct? There are a few things to be aware of when you pass 34 billion nodes or 34 billion relationships, but you are well below those numbers.

In brief, it is my understanding that there are larger neo4j databases just by the numbers, so this could be loaded, and ground up loading of neo4j using the bulk loader is quite fast. the hardware specs and config will impact the load speed.

I think the question really is about the overall design and how it would be used, right?
Is the graph acyclic or cyclic? What is the complete meta model? Additional nodes and relationships?
typical cypher query would be what? (e.g. path lengths, result set size, how many properties, etc)
neo4j can scale horizontally, so I wouldn't be concerned with number of users, if $$ isn't an issue.

I would mention that indexes help you find nodes quickly (if using an indexed label in the match clause), but they aren't involved in following relationships, or searching relationship properties.

you might try the hardware sizing calculator. I've not used it but...

1 Like

Thanks @Joel! This is useful information. Yes, 200M records and 1.8B relationships is right, In the future my team and I are looking to add more data, maybe up to 10x this amount, still under the 34B mark. As far as I know the fastest way to do this is by using apoc.load.

The main queries we'll be running look like this:

MATCH (a:AType { id: ITEM_ID })
WITH a AS start
MATCH (start)-[prop1]->(b:BType)<-[prop2]-(c:AType) RETURN a,b,c;


MATCH (a:AType { id: ITEM_ID })
WITH a AS start
MATCH (start)-[prop1]->(b:BType)<-[prop2]-(c:AType)-[prop3]->(d:BType)<-[prop4]-(e:AType) RETURN a,b,c,d,e;

This is the basics, there will be some limiting and filtering based on the relationship properties, but all queries will start with an indexed A node. We would limit the results down to ~200 items/pathways.

At the moment there aren't going to be any additional nodes or relationships, and it is an acyclic graph.
Good to know it scales horizontally! Probably 5-10 properties on each node and 5-10 on each relationship. I used the hardware sizing calculator tool at the bottom of the page and it maxes out around 50,000,000 nodes and 250,000,000 relationships, but doing the calculations by hand with 10 properties on each node and relationship it puts us at 882 gigs of storage required, which is within our reach.