Label vs property on change intensive nodes

Neo4j 4.0

Hi, I have a scenario which I am not sure what would be the best practice both semantically and performatically.

The scenario is the following:

We have on our database nodes that represent content. This content can have branches and leaves, which are translated via relationships and nodes. So basically a tree. Content, along everything else you can do with it can be marked as public. This makes them exposed to unauthenticated users.

In practice, if a content (top level) is marked as public, it's whole tree is traversed to mark everything in it as public as well (because the default behavior is to make everything public inside that content). The user can choose to hide some its internal parts, by making them non public.

Now, I see two ways to do this:

1 - Use labels to mark the graph nodes as public
2 - Have a boolean property to tell if the node is public or private (maybe index the property)

Now, I expect that this is something that is going to be used frequently, so if we choose label, that means we gonna see nodes constantly getting the label public or having the label public removed. What is the best strategy for this? Performance wise and scalability wise. I read the docs on the costs of each and I know property set is roughly double of label set, memory allocation wise. But that's it, still need more info.

I fear the following:

  • Using labels may affect index performance and storage size negatively. Since I don't know how index fragmentation is handled in neo4j 4, it could be a bad idea to use labels for this case. I don't want to have to worry about rebuilding indexes.

  • Enter some performance bottlenecks due to locking.

  • If I choose to use property indexing, how different would it be from label indexing.

I am still not confident what is the most semantically correct choice, regardless of database I/O implications. What do you guys think?

Thanks for the help.

The decision you're making here is mostly about deciding where and when to put the metal to work, or where and when you want performance. That's it. Most graph database design is about optimizing the queries, because as a graph get large and complex, many queries quickly become so demanding that they are not solvable. That said, here are your choices for optimization:

  • Query performance
    • Use the Labels (nodes can have multiple Lables), and go ahead and keep on re-indexing.
  • Write performance
    • Use a boolean property, and filter on that.
  • Balanced but tricky
    • Use a relationship property.

IMO, Labels are a bit of a misuse for this, but it would do the job well. This might be a bit hacky, but you could index a :Content label, and use the :Public label in the WHERE clause:

CREATE (a:Content {id: text:"Please, no more Lorem Ipsum."})
SET a :Public;

MATCH (a:Content)
WHERE a:Public
MATCH p =(a)<-[r]-(a)

Hey Tony, thanks for the detailed reply.

The performance trade offs of one choice or the other are a concern, but not exactly what really worries me, it's the fragmentation. And I agree, it's a lot simpler to use labels for matching, I don't really understand why you used the where clause, instead I would do it like this:

MATCH (a:Content:Public)
MATCH p =(a)<-[r]-(a)

And as far as I know every label is already indexed, hence my worries about fragmentation. Marking and unmarking nodes with the public labels a lot could make fragmentation the norm. But that's the assumption I am trying to validate as I don't think it would be very healthy to re-index frequently, or even know how to determine the frequency of re-indexing.

Another upside I see using the label is that I can easily inject the label in our query engine (along other sanitizations we do). It makes safer development and allows us to create some generic cypher queries that would only require a label injection to reuse it for public content on the same feature.

I'm a bit curious about the need for that. Labels are intended to be groupings of like things.

Labels do indeed naturally behave like an index, and are the first means of reducing the quantity of records to be operated on. However, there is no automatic indexing. Indices must be explicitly defined.

Fragmentation could only become an issue in an index when large volumes in that index are moved and removed often. Even then, it will only affect indices explicitly defined, and Neo4j makes it trivial to rebuild an index.

Because doing it that way would enable adding/removing the :Public label without affecting any indices, while still leveraging an index on :Content. The query planner does its best to use indices where it can, but an index only on :Content may not be used when matching with a double-label like that.

I was implying a pattern like...

CREATE INDEX content_key FOR (a:Content) ON (;
CREATE INDEX content_name FOR (a:Content) ON (;
CALL db.index.fulltext.createNodeIndex("content_body",["Content"],["title", "descr", "body"]);

These indices would significantly accelerate queries using just the id, or just the name, property. The fulltext index is just a good idea when dealing with content.

With indices define like that, nothing you do with the :Public label will affect the index. However, ensuring you are actually using the index still requires some Cypher magic.

Any of these could then be used as the first step of your query, using the index previously defined.

CALL db.index.fulltext.queryNodes("content_body", "something to search") YIELD node
MATCH (node:Content {id: 22})
MATCH (node:Content {name: "my-article"})

The results of from any one line above, could then be followed by the following clause to give you the graph of only the nodes with the :Public label, without straining the index.

WHERE node:Public
WITH node
MATCH p=(node)<-[]-(node)

Feel free to mix and match

MATCH (node:Content)
WHERE node:Public AND >200 AND STARTS WITH "note: "
WITH node
MATCH p=(node)<-[]-(node)

Oh wow, now that's some priceless pieces of information, which are very difficult to put together. Thank you. The least I can do:

We do use labels for groupings, the public label is the only one that escapes that rule a little bit, it felt a little bit in a gray area, hence my original question on semantics, because it didn't feel right and I think it's because of exactly that. It's not really a grouping (well, kindof, public vs non-public), it's more of a tool to facilitate our cyphers and provide some level of a security blanket where any cypher query that goes through our cypher query engine and does not explicitly say the data it wants is not public, the cypher will return nothing. Minimizing the changes of accidental leaks.
Roughly It works a bit like a OO query language built in typescript: return eng.match('n').relates('r', Direction.Right).node('z').return('*');

This allows us to configure engine to inject things like labels easily without interfering how the query is built. So it adds a security layer over the queries we write, that includes, tenancy labels, which we also have. There is a plan to use neo4j 4 new access control feature, but that requires some research to implement still.

Thanks a lot for the help, it's this kind of details that are very hard to come by when working on very specific scenarios.

1 Like

I just thought of an addendum to my example, that is probably best for your use case, and simplifies your Cypher a bit:

MATCH p=(target:Content {id: 123})<-[*]-(child:Content)
WHERE child:Public

Though this has the drawback of returning paths where (target)<--(private)<--(public), unless you put public/private on the relationship. Then, you could do the following, very quickly, without any private entries in the result:

MATCH p=(target:Content {id: 123})<-[:PUBLIC*]-(child:Content)
WHERE child:Public

One thing to note (and this usually surprises people). An index on a boolean doesn't work very well.

so having
SET n.public = true
SET n.public = false

won't work so well because indexes work better when there are many different values.

It would be a bit better to have:
REMOVE n.public

but I suspect having a Public label will be better

Hi @clem ,can you please provide a valid test case and a profiled proof for the statement -> "An index on a boolean doesn't work very well."

This was told to me years ago by a wise DB expert.

I found this page that explains the details from IBM:

Hi @clem , even though a lot of wiser DBA's agree with me, I don't think this thread is meant for a direct conclusion that index on boolean is bad without any evidence. It all depends on query and type of system.

I can write a query that can use the index, and same time, rewrite the same query can be used not to use the index.

There are years and years DBA's are using (have used) Bitmap index, and indexes on isActive, isMale etc etc.

There are two types OLTP and OLAP.

Also, from your DB2 example (also in Oracle and SQL Server), I can make the query use an effective of the index (low cardinality), by running statistics and creating custom distribution buckets. If I do that even a low cardinality index is very efficient.

From your example

Distribution statistics are not collected or parameter markers do not allow DB2 to make use of distribution statistics

Doesn't the DB2 optimizer take care of this and ensure it is not choosing an access plan that requires more data pages? Well, yes, it frequently does. However, the optimizer is only as good as the data and statistics available to it. The optimizer is also capable of mistakes.

To properly find an access plan that truly fits the data, collect distribution statistics and detailed statistics on all indexes. Also, the use of parameter markers (which show up in queries in the package cache as ? ), disallows DB2 from making proper use of distribution statistics. DBAs can control what kind of statistics are collected, but they may not be able to control the use of parameter markers. There are advantages that may lead developers or vendors to choose parameter markers.

The real problem comes in when DB2 has to make an assumption that the distribution is standard, and then comes up with skewed results it can cause sort overflows and other effects that can both erratically and severely impact the performance of only certain instances of queries.

RDBMS database have the capability to run manual, auto and custom statistics.

P.S. Since this thread is already answered by Tony, I will take pause on answering in this thread.