Database load balancing

Today, after watching another cs50 video for load balancing, they suggested database partitioning in sql. For eg.

Rather than having flights table; separate them into flights_domestic and flights_international.

So, I would like to know implementing it in neo4j and what would be good practice between following:

Different nodes:

DomesticFlight
InternationalFlight

VS:

Single node:

Flight {type: 'domestic'}
Flight {type: 'international'}

I am just learning programming and have such basic questions. And it would be my pleasure to know good things first that even single node will be faster in neo4j or not.

Currently in Cypher, label access is cheaper than property access, so in this case we'd favor using labels for this.

Also keep in mind that Cypher allows multi-labeled nodes, so if you do need to address these nodes both as flights as well as domestic and international, there's no reason why you can't use :Flight for all flight nodes and then label the relevant subsets with :DomesticFlight and :InternationalFlight as well.

This type of partitioning looks like it's most useful for initial lookup of nodes, and this is usually one of the cheaper parts of a query, so although you may be saving a little on the initial lookup, the most expensive parts of a query are the traversals (which would be the table joins in SQL) used to figure out where flights are going and when (or to work backward and figure out from origin and destination and some time range which flights are available).

For those kinds of queries, we highly recommend partitioning via using specific relationship types, as selection of the types to traverse is cheaper than filtering.

See Max De Marzi's blog post on flight search with Neo4j.

1 Like

So, you mean MATCH (:Flight) will be faster than MATCH (:Flight {type: 'domestic'})? If so, will this ever be improved in near future? As in JavaScript, selection with particular element with id or class will be faster than selection of element node. I think similar approach should be implemented in neo4j?

Regarding next option with subset, will it not be better?

MATCH (f:Flight {type: 'domestic'})
MATCH (f) ...

Than:

MATCH (:Flight) - [:HAS] -> (d:DomesticFlght)
MATCH (d) ....

I should clarify, when the planner performs an index lookup, when the node is a starting node, then it will be quite fast.

But if you're doing a filtering operation, such as during the traversal somewhere (expanding to :Flight nodes WHERE flight.type = 'domestic') then it will be using property access, and that's slower than doing WHERE flight:Domestic or just using :Domestic within your MATCH pattern.

I don't think this pattern (:Flight) - [:HAS] -> (d:DomesticFlight) will come up in a flight graph, I don't think that makes much sense. The characteristic of the flight would best be modeled as a label (preferred) or property.

As for improvements, yes, changes are planned to the store sometime in the future, likely with a few minor releases out along the 4.x line.

Then what kind of subset were you talking about?

Ah, there will be no slower performance when I do MATCH (:Flight {type:'domestic'}) and will quite as faster as MATCH (:Flight) rather than filtering?

Sorry, I'm not understanding what you're asking here.

Please review the documentation on indexes which are used to find starting nodes in the graph. An index lookup is very different than a filter (when you traverse to a node and then perform property access and comparison) or a projection.

Okay finally, If I get it right; map projection and index lookup is very faster than performing property access. And I should not worry if I don't perform property access?

Index lookup will be faster than property access (properties are involved, but we're looking up via the index, not hitting the property store), but map projection does need to do property access.

In many queries you will need to do property access, and that's fine. You can't avoid it. I'm just saying that where you have a choice between modeling a type for a node (that you will need to filter on during large traversals, and where it makes sense) that filtering by label is more efficient than filtering by property.

So, the performance is same in both of the following queries?

MATCH (:Flight {type:'domestic'})

And

MATCH (f:Flight)
WHERE f.type = 'domestic'

I mean both queries will be behaving same internally?

This isn't map projection. Check out the map projection documentation for examples.

As for the snippets you provided, whether the property is inline in the MATCH or in the WHERE clause, this will be planned the same way.

This would likely use an index lookup, if this is used as a starting node in the query, and if you had an index on :Flight(type).

But if this was part of a longer pattern, and there was a better candidate for a starting node, then it might do a filter on property access instead.

For example:

MATCH (:Airport {code:'SAN'})<-[:FLIGHT_FROM]-(:Flight {type:'domestic'})-[:FLIGHT_TO]->(:Airport {code:'SFO'})
...

The planner would likely use an index lookup (if an index exists on :Airport(code)) to one node or the other (or both), expand one of the relationships, filter on the label (:Flight) and on the property, then expand into the other airport node.

If instead you used a separate label, :Domestic, then the filter operation would just be on the label, and property access on that flight node wouldn't be needed.

1 Like

I had read the docs but it seems I am not getting them quite clearly. I have read all the docs that you have linked in this post but I'm getting confusion one after another while you post an answer. So, can you please give me some examples on index lookup, map projection, and filter, etc. rather than documentation? This may seem noob to you but it's my fault that I'm not quite clear on them. Thanks.

Sure.

Map projection first, that should be easiest.

Map Projection

A map projection is a custom projection from either a map or node properties. Here's an example:

MATCH (p:Person {name:'Keanu Reeves'})-[:ACTED_IN]->(m:Movie)
WITH p, collect(m) as movies
RETURN p {.name, movies, whoa:true} as Keanu

Map projection is used in the RETURN so we get a custom map consisting of a mix of node properties and other values. We project out a map consisting of keanu's name, and then the movies collection from earlier (the key for this in the map will be the variable name 'movies', and the value will be the variable's value), and a custom property whoa which we set to true.

Map projections allow us to customize the map, using a custom projection. More examples are in the previous documentation.

Node Lookups

We need to lookup starting nodes in the graph before we start traversing and looking for matching patterns.

There are 4 main categories of node lookups:

  1. Lookup a node by id. This is the fastest lookup, but it requires that you know the graph id of the node (or relationship). We don't recommend storing ids outside of Neo4j for later lookup, since you might not know if a node has been deleted and recreated, and if so the id may not point to anything (because the node was deleted), or it might point to a whole different node (since we reuse ids post-deletion). It's uncommon to do these lookups.

  2. AllNodesScan. When no label is available in the MATCH pattern, then the planner has no choice but to scan through every node in the graph, performing property access (if properties are present in the pattern) to find the nodes that meet what you're looking for. For example, MATCH (n {name:'Keanu Reeves'}). No label is present, so every node in the graph must be checked. This lookup is the slowest available, since this lookup slows down the more nodes there are in the graph, and unless you're doing a graph-wide operation regardless of label, this is usually evidence of a mistake in the query, telling you that you should at least add a label to the MATCH pattern.

  3. NodeByLabelScan. If a label is present in the pattern, then the planner only has to look through nodes of that type. For example, MATCH (p:Person {name:'Keanu Reeves}). There is a :Person label present, so only :Person nodes need to be checked. If there is no index on :Person(name), then this will do a NodeByLabelScan followed by a property Filter operation, performing property access on those nodes until all matching nodes are found. This is a medium speed lookup, and is dependent upon how many nodes there are for the given label. Usually when there is a property present for which you want to filter quickly, you should add an index so this becomes an index lookup instead.

  4. NodeIndexSeek. When a label and property are present in a MATCH pattern, and there is an index for that label and property combination, then instead of the previous NodeByLabelScan + filter, we can do an index lookup instead. Take the example from the previous section: MATCH (p:Person {name:'Keanu Reeves}). Without an index present, this is a NodeByLabelScan followed by a property filter on p.name. But if we create an index for this: CREATE INDEX ON :Person(name), now the planner can use an index lookup instead. It does not access properties, but gets the nodes quickly via the index. So within a Cypher query you do not know if a certain lookup on a starting node is going to do a label scan and filter or an index lookup. You would need to know what indexes are available in your graph (CALL db.indexes()), and you could also check by running an EXPLAIN on the query, which will show you what operations the planner will use to execute the query. You could see from the query plan which type of lookup will be used to find your starting nodes.

After starting nodes are found, Neo4j uses traversal and filtering to find the rest of the pattern (as opposed to table joins, which is what a relational database would do). The executor will start from the starting nodes and expand relationships, filter, and perform whatever other operations are associated with the cypher query to find the result.

If this is still tough to understand, consider a real world example of the approaches you could use to find Keanu Reeve's phone number.

The analog for an AllNodesScan is: MATCH (p {name:'Keanu Reeves'}) RETURN p.phoneNumber. You have no label, you don't know what type of thing this is. So you start asking every single object in the world if they have the name Keanu Reeves, and if so you ask for their phone number. After you have asked every single object in the entire world, you now have the phone numbers of everything that told you that yes their name is Keanu Reeves.

The analog for a NodeByLabelScan is: MATCH (p:Person {name:'Keanu Reeves'}) RETURN p.phoneNumber. The difference now is that you know that this must be a person, so you only need to ask every person in the world if their name is Keanu Reeves, and if so you get their phone number, and you have to ask everyone in the world since there may be several people with that name. This is far more efficient than asking every thing in the world for their name, but it will still take a long time. If instead we had MATCH (p:Actor {name:'Keanu Reeves'}) now we know we only have to ask actors, which cuts down the number of people we need to ask significantly.

But consider if we had some web page where (for people only) we could enter a name and it would tell us the people who had that name anywhere in the world, and then we could get their phone numbers. This is the equivalent of an index on :Person(name), and is very fast (we don't need to ask people for their names, we ask the index instead, which was constructed specifically for looking up people by their name very quicky). Creating an index takes time (usually quick, but can be longer for millions or billions of nodes or more). But once the index is created, it knows how to lookup nodes of certain types by a certain property (or properties). The planner is aware of the indexes available, and when it sees a pattern with potential starting nodes that include the label and property (or properties) of an index, it will use an index lookup to find the starting nodes.

Filtering

As for traversal and filtering, consider this query:

MATCH (p:Person {name:'Keanu Reeves'})-[:ACTED_IN]->(m:Movie)
WHERE m.released = 2003
RETURN p, collect(m) as moviesIn2003

Note that this query is equivalent in every way to this one:

MATCH (p:Person {name:'Keanu Reeves'})-[:ACTED_IN]->(m:Movie {released:2003)
RETURN p, collect(m) as moviesIn2003

If we only have indexes on :Person(name) and :Movie(title), then the planner will choose to use p as the starting node, performing an index lookup for :Persons with the name of 'Keanu Reeves'. Then it will expand outgoing :ACTED_IN relationships, and then it will filter on the m node, first performing a label filter to make sure m is a :Movie node, then doing property access and making sure the node's released property equals 2003. Nodes that are not labeled as :Movie, or that do not have a released property, or have a released property that isn't equal to 2003 will be filtered out.

1 Like

Thank you so much for lovely answer. However, I have still few questions:

Are we okay to create index on non-unique field ie. in type property?

Flight {type: 'domestic'}

And to be a map projection, must it contain custom property? Sorry, I'm still not getting it properly. Is map projection is what we need to care in RETURN statement?

Indexed fields do not have to be unique on nodes of a given type, such as :Person(name), since multiple people can have the same name. If you do need uniqueness, then you need unique constraints. A unique constraint ensures that the property is unique for all nodes, and includes an index for quick lookup.

That said, labels are meant to aid in modeling of node types. A flight's type is better modeled as a label, not as a property.

Map projection is just a tool when you need a custom map of values returned. Before this existed, you could return all of the properties of a node, or return a custom map with map literals, but map projection makes it easy to mix and match and project out your own custom map. That's its purpose.

I think you're getting this confused with what I said about property access. Property access is just that: whenever you are accessing a node's properties, whether it's to do a comparison, or project it into a variable for later use or return, or if you're returning a node variable (since that by default also has to return all of that node's properties). Right now this is a bit expensive, but there are very few queries that don't need to access properties. My advice was on attempting to minimize unnecessary and redundant property access as much as possible.

I am just curious if we use property access like MATCH (f:Flight {type:'domestic'}) and later list all domestic flights and no more expanding queries, filtering. Then wouldn't it be enough fine with type is indexed than opposed to do with MATCH (d:DomesticFlight) and list all domestic flights?

You are correct in this case, and if you only use the type property for initial lookup (triggering an indexed lookup) then you're fine.

1 Like

I'm extremely sorry but after re-reading your statement I got confused.

Can you please give example on this (as selection of the types to traverse is cheaper than filtering)?

Also, if we have multi-label node, then it will be faster anyhow we have to traverse and do filtering along with the relationships? I mean you wanted to insist me to create multi-label node in any case instead of property accessible node specifically example with Flight which have domestic and international flights?

And when should I not use multi-label node?

See Max De Marzi's blog entry on flight search I linked to earlier. The idea is that the structures we use for relationships (when a node becomes dense at 50 relationships or so) are categorized optimally for selection by type and direction, so it does not need to perform a filtering across all connected relationships on a node it, just selects the interested types and directions per type to quickly get the interested relationships.

So if a node has 1000 :FRIENDS_WITH relationships and 3 :ENEMIES_WITH relationships, if a traversal is on :ENEMIES_WITH relationships then it goes through its data structure going through type first (there are two types) to select the :ENEMIES_WITH category, then selecting incoming and outgoing (since we didn't specify which direction we're interested in, but if we did we would be able to quickly select just the direction we wanted), and we would immediately have references to the only relationships we wanted to traverse, unaffected by the presence of the other relationship types we weren't interested in.

Otherwise, if we used generic :RELATES_TO relationships with type properties, we would have to filter every relationship on the node and do property access on them all to find the ones we were interested in.

A lot of the questions you're asking really depend upon what specific queries you want to run. As I mentioned earlier, in general when we're talking about some kind of "type" or "categorization" of a node, labels tend to be a better fit than properties, since that is the purpose of a label.

1 Like

Awesome! Thank you again.

Can you please look in my old question?