Building a graph with unknown relationships and 'things'

geronimo4j · December 29, 2020, 9:14pm

I'm trying to figure out how to build a graph of things, imagine the Graph Movie Database but where Movie was a variable (could be anything).

So of course if you use movies as the example, then you could have:

Tom Hanks ACTED_IN Forrest Gump but that assumes you know the thing is a Movie and a relevant relationship in this case is "ACTED_IN".

How can you develop a model that would account for unknowns, so if tomorrow the 'thing' is CEOs, then you are prepared for a new Thing. Using a generic node and relationship like Tim Cook IS_CHILD_OF Apple comes to mind but doesn't feel to be as performant.

And when developing for this, how do you track what the relationships will all be? E.g. if you did use a specific node type like "movie" and have a specific "ACTED_IN" then how do you avoid writing all of the possible relationships out by hand? E.g. if you use "IS_CHILD_OF" that's one relationship, but if you now need "IS_CEO_OF", "IS_EMPLOYEE_OF", etc. there could be hundreds of thousands.

Is there a name for this type of model?

clem · December 29, 2020, 10:22pm

This is exactly why you need Neo4J.

You can add Labels (better thought of Node Types.). Added flexibility, Nodes can have multiple labels, so if you give Tim Cook the Person Label, but later on you realize he should also have CEO label, you can add the CEO label and keep the Person Label or remove it.

Relationships can only have one Label but you can have more than one set of Relationships between Nodes.

You might at first think that a Property should be added to a Node or a Relationship, but later on you might decide it's better to make Nodes (or relationships) out of the properties.

With the functions in the APOC library, you can modify the DB to reflect your better understanding of your data, like reverse the direction of a Relationship, Rename Labels, etc. See: Graph Refactorings - APOC Extended Documentation

You have all this flexibility without needing to change your Schema, unlike in a Relational DB.

If you are exploring what you can do, I recommend cloning your Neo4J DB (and keeping notes!) to backup your DB in case you make a huge change that you want to back out of.

This is the great thing about Neo4J: the Graph DB can evolve as your understanding of your data evolves. This is unlike a RDBS where you have to firm up your schema before you necessarily understand everything.

geronimo4j · December 29, 2020, 10:59pm

As I'm exploring this all seems really encouraging, and likely will use Neo4j. Thank you for your response! I didn't realize you could give nodes multiple labels. I think the thing I'm interested in is, is it against best practice to create something as generic as just a Thing, instead of a Person or CEO?

Since I think I want to minimize the specificity of the data, I feel like it might make sense to have all "topics" or "nouns" (things) just be a Thing and then specify as-needed with a field within that node, like type: CEO.

Is there any known best practice around how many of a single node you have, so that you can minimize lookups, etc. when querying?

clem · December 29, 2020, 11:32pm

MATCH(n:Label) matching is done with a Set of pointers to all nodes that have that Label.

So, having multiple Labels on a Node means that pointer will be in multiple Sets. The Sets do have some overhead, but it should be relatively cheap in terms of memory. In terms of performance, scanning over a smaller set of CEO labels is much cheaper than the larger set of Person labels.

If all nodes have the Label Thing, it's not very useful. because it's the equivalent to MATCH(n) which means scans against all Nodes, which is expensive as your DB grows. So it's better to think of all Nodes being implicitly Things. (Unless other nodes are distinctly different from Things e.g. Ideas.)

A lot of this depends too, on the types of queries you expect to make.

So hypothetically, if you have a left handed, death metal, Swedish bassist, you could:

CREATE(p:Lefthanded:Deathmetal:Swedish:Bassist {name:"Lars"})

and match:
MATCH(p:Lefthanded:Deathmetal:Swedish:Bassist)

That match will look at the intersection of the Labels Lefthanded, Deathmetal, Swedish, Bassist which shouldn't be too expensive since the sets are small.

On the other hand, if you do this type of query a lot and the sets are large, you could make a single Label:

CREATE(p:LefthandedDeathmetalSwedishBassist {name:"Lars"})

Note that you could also put the Nationality as a Property:
CREATE(p:Lefthanded:Deathmetal:Bassist {name:"Lars", nationality:"Swedish"}) but this query then becomes more expensive as you have to search the Properties.

MATCH(p:Lefthanded:Deathmetal:Bassist {nationality:"Swedish"})

The cool thing, is you can made modifications to your Neo4J data as your understanding evolves.

Also, once you get your data into Neo4J, you can use the Browser to explore your data. Often you will discover insights that hadn't occurred to you, which will make you rethink your Labels and Relationships.

geronimo4j · December 30, 2020, 12:11am

That makes a lot of sense, thank you again. I guess my lingering question is (but maybe I'm just not yet seeing it!) still, what if you don't want to have any concept of something being describable, by human? Like "Lefthanded, Deathmetal, Swedish, and Bassist" feel like they would need some level of human classification. Instead of, say, importing all future things with some level of categorization that you need to track?

Like if I go in and manually import all movies right now, that's easy.

In a year, if somebody wanted to go and upload different hot sauces, then it's less easy unless they're all just one generic "Thing".

Am I missing something obvious here, around how to avoid Node labels that could need some level of management/tracking, vs something more generic like Thing.

There would be Users, Posts, Things, Playlists.

Posts and Playlists can reference Things. So assuming 2+ posts for any "thing" the number of Things would be (hopefully) dwarfed by the number of posts, etc. It's not clear when to assume the set will be too large. Since you mentioned i can make modifications, I'm probably better off not over-engineering this, just trying to fully understand the options. Thank you, again!

clem · December 30, 2020, 12:17am

Well, you can just create a Node without a label:

CREATE (NoMan)

See:

Topic		Replies	Views
Design decisions - Nodes, Relationships, Attribute types etc Modeling	1	570	April 15, 2020
Suggestions for 4.0-intro-neo4j-exercises/03.html Documentation	1	390	January 26, 2021
Graph DB + ITS Neo4j Graph Platform	3	199	December 24, 2020
Newbie Relationship Question Modeling	2	374	June 23, 2020
Use cases for node properties vs relationship properties Newbie Questions performance	4	372	January 8, 2021

Building a graph with unknown relationships and 'things'

Related Topics