I am trying to get a better understanding of the terms inside the profile operators:
I understand (1) well enough, however I have trouble fully understanding (2-6).
For (2) Estimated rows, I don't understand how this is determined, i.e. the number of estimated rows. I also want to understand if it is always better to reduce this number or if there are any specific cases where that would be the exception?
For (3) DB hits, I understand that this refers to the number of database hits and that for tuning a query this number should be minimized. What I don't understand is why sometimes this value is higher/ lower in specific cases (assuming no indexes).
Example: I have created 5000 nodes, then when I try to match one of the nodes,
PROFILE MATCH (t:Test1)
WHERE t._uuid = 'test1'
I took at look at the documentation. I understand better (2) Estimated rows, (4) Cache hits & (5) Cache Misses. However, I still don't fully understand the DB hits as explained in the exampled. Furthermore, I don't see the anything on memory.
Thanks, the memory articles you gave was very useful to understanding it better. However, I still don't really understand DB hits fully. Is my explanation right for the above example, for the 2nd plan I get 10000 hits because there are 2 properties for each node (id and _uuid)? However, I still cant understand why the 1st plan has 5001 hits instead of 5000 hits.
Each operator will send a request to the storage engine to do work such as retrieving or updating data. A database hit is an abstract unit of this storage engine work.It is at a conceptual level and if you want to know how it translates to actual work you can review the Neo4j Architecture in detail.
This is arguably the most important thing in a PROFILE plan, since Cypher operations execute per row. The idea with query tuning is to reduce the work done as much as possible while still getting the correct answer, so streamlining a query to reduce unnecessary rows during query execution is usually a win. Watch for where the rows spike in your query, that may be a good opportunity to tune.
These are calculated/estimated from graph statistics, and are usually important for the query planner when formulating and comparing plans. I haven't found them to be very useful for tuning myself, as these are often ballpark figures.
As Sameer says, these are abstract units of storage engine work whenever data in the database is touched, and as such db hits are not necessarily equal to each other 1:1. You can also treat this as kind of a ballpark figure, smoke to draw your attention to where there may be a fire to put out, but not always so. Queries have to do work to deliver correct results, and that requires db hits. Watch for massive db hit spikes, and where you see this, let it draw your attention to the rows flowing between operations near these points of the query.
4., 5., and 6. (for cache hits, misses, and memory
This has to do with the pagecache, which is the in-memory cache of the graph. Whenever possible the pagecache is used for db operations, to avoid the I/O hit of having to access the graph on disk. High pagehits and low cache misses indicate good utilization of the pagecache. If you start seeing higher cache misses and fewer cache hits, it may mean that your pagecache isn't big enough to hold the whole graph in memory, so you may need to look at adjusting your memory allocation, or increasing the memory in the system.
Here's some resources for memory:
Also important to check in PROFILE and EXPLAIN plans, check how nodes are looked up in the plan. In your example, for instance, we see a NodeByLabelScan followed by a Filter on the property. It's more efficient to create an index (on :Test1(_uuid) ) so the index can be used for quick lookup, you'd see a NodeIndexSeek instead, with far fewer db hits and avoiding the need for filtering across many rows.
The section of query operators may be a helpful reference, most of the node lookup operators are at the top:
In general, lookup via index is going to be more performant than a label scan (which requires looking at all nodes in the label), which in turn is going to be more performant than an all nodes scan (which has to look at all nodes in the graph).