Help optimize query

I am using Neo4j for analyzing LDAP servers, and would appreciate some help / feedback on the data model as well as the query for building relationships.

Currently, I am creating nodes for each LDAP entry with a label equal to its objectClass attribute (e.g. organizationalUnit, user, group, etc.).

The problem is, I want to create relationships between nodes based on the RDN (e.g. (OU=Users,DN=foo)-[:CONTAINS]->(CN=Bob,OU=Users,DN=foo). So far I have this:

EXPLAIN MATCH (child)
WHERE child
WITH
    child,
    substring(reduce(parent_dn = "", rdn IN tail(split(child.dn, ",")) | parent_dn + "," + rdn), 1) as parent_dn
MATCH (parent {dn: parent_dn})
MERGE (parent)-[:CONTAINS]->(child)
RETURN count(parent);

Which is obviously not good since it doesn't specify labels so doesn't use indices.

I'm stuck on how I should optimize this because of that. Should I not use objectClass as a label since I have maybe 20 unique values? Should I just use one label for everything?

I originally thought it would be good to use objectClass as a label since Neo4j would color them differently and allow me to differentiate between types.

I'm far from an LDAP master but whenever you have embedded strings like this, your data model probably isn't right. You shouldn't ever need to parse text in cypher as you're doing right now. This is a strong indication that what you need is 3 different properties and possibly label types.

For example, what you reference as (CN=Bob,OU=Users,DN=foo), you could consider modeling this as:

(c:CN { id: "Bob" }), (o:OU { id: 'Users' }), (d:DN { id: 'foo' }), (entry)-[:REF]->(c), (entry)-[:REF]->(o), (entry)-[:REF]->(d)

In other words, if you can do that text parsing, do it once upfront when you load the model, and then never again

Hi David,

This is exactly the right approach. I want to emphasize the fact that the underlying benefit of this approach is 'Scalability'.

1 Like

@david_allen @ameyasoft The entire DN (CN=Bob,OU=Users,DN=foo) is actually what is the "ID" in the sense that CN=Bob,OU=Users,DN=foo should be unique across all types of nodes (not just CN), but "Bob" isn't necessarily unique across anything, even CN. There could be CN=Bob,OU=Users,DN=foo as well as CN=Bob,OU=SomethingElse,DN=foo. Are you mainly just suggesting to use CN / OU / DN as node labels, and I could do (c:CN { dn: 'CN=Bob,OU=Users,DN=foo' }), (o:OU { dn: 'OU=Users,DN=foo' }), (d:DN { dn: 'DN=foo' }), (entry)-[:REF]->(c), (entry)-[:REF]->(o), (entry)-[:REF]->(d) so they can be indexed by dn (formerly id) / guaranteed to be unique? Or is there a benefit to keeping (c:CN { id: 'Bob' }) instead of the whole DN?

Also, regarding properties, I'm not sure how I would store them as such. The DN (the entire string) is kind of like a URL / path in the sense that the order matters. I may have OU=foo and OU=bar for one node, but OU=foo,OU=bar is totally different from OU=bar,OU=foo. Think of it like components to a URL.

The relationships which would be created are really related to substrings. CN=Bob,OU=Users,DN=foo is part of OU=Users,DN=foo which is part of DN=foo.