Minimum "viable" size of data for neo4j

Hi, I was wondering whether there's some rule of thumb when Neo4j starts making sense over something like Postgres. I guess I more or less understand the theory with constant time in graph db vs O log N in postgres' indexes. But what size of dataset are we talking till this shows? Thousands of rows? Millions? I know it will be a vast generalization.

I have fairly simple model like (User)-[:ENROLLED_IN]->(Class) with thousands of classes, tens of thousands of users and hundreds of thousands of enrolled_in edges (each user has 20-30 of them). It's not really a huge dataset. I don't dig too deep in graph. I just almost always want a user with all his classes and class with all its users - in postgres this means always hitting the join table in between them.

So I'm wondering whether I'd get some noticable performance gains with this dataset, or it's just that it kinda feels more natural to store this as graph...?

I am also thinking a little bit about more distant future, where I want to let user pre-register in classes and since the capacities are limited, I want to do some optimization to maximize "satisfaction" – I believe this might be heavily graph-based algorithm, but at this stage my idea is fairly vague and I don't wanna do premature optimization.

Thank you

This depends a little bit. You're right traversing that edge will always take a JOIN in postgres. The size of your data matters and how you've set up your machine. If your data is ultra-small, to the point where the entire dataset is in memory, for read lookup operations neo4j will probably perform better, but the difference might be hard to notice because the data is fundamentally small and all operations are happening in memory. The difference in speed will come from the hash join operations the database isn't doing (because there are no joins in Neo4j, just pointer dereferencing to traverse edges)

Typically the performance benefit stacks up the more relationships you traverse. For example, let's say that you wanted to make a list of students who shared at least 3 classes in common; this would require traversing the relationship/doing the join at least twice, and so on. The more the relationship traversal/joins stacks up, the bigger the performance gain will be.

Don't underestimate too the value of having the data in a format that's more intuitive for you, and also flexible; there's benefit to storing the data in a way that feels the most natural because you'll spend less future coding time "adapting your thinking to the machine" so to speak, and more working on the problem you want to solve.

Your "maximizing satisfaction" future query does sound very graphy and similar to recommendation engine stuff Neo4j has published about Cypher. (Example: Tutorial: Build a Cypher Recommendation Engine - Developer Guides) -- the question might be formulated this way: how do you recommend a class to a student that was enjoyed by students like them, that they haven't already taken? Yes that's very graphy.

2 Likes

Thank you for detailed answer. It's a fair point to use data structure that feels more natural for a given case.

My "maximising satisfaction" feature was less about recommendations, but rather about satisfying excessive demand for limited capacities in some optimized way. For example if 3 people fight for 1 spot, then maybe take into account how high it's in each of them's "wishlist", whether they have another meaningful combination available to them, what is their "klout", etc. Some sort of "optimizing the graph" basically :-)