When to share/reuse nodes and when not?

To be specific let's talk about the following simple example. We want to use Neo4j to model a survey system. Each survey consists of a set of questions and each question has only 2 answers i.e.: Yes and No.

My idea is to model each question and each answer as a node and connect them with a relation e.g. HAS_ANSWER. Now I wonder if nodes representing answers should be shared/reused across surveys.

  1. If they are shared/reused even if I have millions of surveys in a database I will still have only 2 nodes Yes and No what seems good. However, at the same time, these 2 nodes will have millions of incoming edges that may cause performance problems. If so, at some point I may need to split heavy nodes into smaller ones.

  2. Another approach is to duplicate Yes/No answers across surveys. In this case, I will have millions of "duplicated" nodes in a database. On the other hand, I will not have very heavy nodes will millions of incoming edges.

What is important I anticipate that in my queries I will NOT make traversals from Yes /No nodes upward in the hierarchy i.e. from answers to questions.

How do you think which approach is better? Maybe there is some limit e.g. up to X millions of incoming relations go with solution 1 but if you have more go with solution 2.


I already collected some experience regarding those storage questions.

In an ongoing data visualization project I was tracking events in a Google form and later rebuilt those events into a graph database. To gain better insights into the ratings, I stored each rating into a node, so I could see - and also show - how big those rating distributions where. But writing Cypher queries for that was really complex at that point, and I am also not sure if the data is correct.

  1. Neo4j should be pretty good with millions of relations, especially since all those should have the same runtime. Duplication might even get a bigger runtime since you will traverse more nodes.
    An easy thing to do in that regard is to aggregate your answers into survey result nodes from time to time, if you want a tabular view on those issues. Another tactic would be to build new facts out of your results. You can find more about those in the O'Reily Graph Databases book.

  2. From a visualization standpoint this seems great, but basic modelling proposes to find unique answers as templates for your graph nodes. And in the context of a survey, a "Yes" answer might always lead to the same conclusion.

So with my limited knowledge I'd go for 1