Distance calculation between OSMNode nodes in a large graph does not return

Hello to the Neo4j community. I have started developing a OSM-based neo4j project a few months ago. All ran well on a small 40k-node test graph but I ran into difficulties scaling it to the large production graph with 500million nodes. Please see the details below.

I have imported OpenStreetMap data using the OSM importer tool (GitHub - neo4j-contrib/osm: OSM Data Model for Neo4j). I tweaked the importer not to create OSMWayNodes, just OSMNodes. The OSMNodes are chained with NEXT relationships and the first node of a way is linked to the OSMWay node through a FIRST_NODE relationship.

Next I calculated and added the 'distance' property to the NEXT relationship with this query:

call apoc.periodic.commit(
'MATCH (an:OSMNode)-[r:NEXT]->(bn:OSMNode)
WHERE NOT exists(r.distance)
WITH an, bn, r LIMIT $limit
SET r.distance = distance(an.location, bn.location)
RETURN COUNT(*)',
{limit: 10000}
);

In a small test graph with 40k nodes the query runs perfectly and creates the desired result. In a large database of 500million nodes (the final production database), the query ran for 36 hours with not sign of finishing the job. 49 neostore.transaction.db.X files of size 263.3MB were created as a result.

There was no indexing on the nodes in the graph. I plan to add indexing on the OSMNode.location property for finding nearest nodes to an input set of coordinates (tested it and worked in the small graph).

I am running Neo4j v4.1.3, desktop v1.3.11, browser v4.2.0.

Attached a screenshot of the profiling result. Profiling was run on the small database, because I could delete and recreate the distance properties on NEXT relationships easily in the small graph.

My issue is that the query does not scale to a large 500million DB. I would like to know if I need to change the query, the graph structure or the development environment, and what changes I should do to run this query and scale the production graph.

Please let me know if I can give any more data or clarifications. Thanks a lot for the help.

Tiha

I think you will have much better luck with apoc.periodic.iterate instead of apoc.periodic.commit. The latter will keep restarting the original search every 10000, which can use a progressively larger amount of memory. But the former iterate will have one query sequentially scanning, while the other query updates the relationships in blocks. I think this will suffer less from slowing down as the database gets bigger. It will still a while on a large database, but the time it takes is more likely to be a function of the database size, which is more reasonable. If the performance is still too slow, we could add a distance calculator to the OSM project as a procedure.

For now, try:

CALL apoc.periodic.iterate(
  'MATCH (an:OSMNode)-[r:NEXT]-(bn:OSMNode)
   WHERE NOT exists(r.distance) RETURN an,bn,r',
  'SET r.distance=distance(an.location,bn.location)',
{batchSize:10000, parallel:false});

Watch the computer memory. If the process uses too much, or you see GC logging in the neo4j logs, consider a different batch size.

The profiling you attached appeared to be after you already added all distances, since it claims to have produced no results. I think a profile of the original query would be more informative.

Also, keep in mind that the removal of the OSMWayNodes changes the OSM model in quite a fundamental way, and might make things more complex for you later. It all depends on what you plan to use this for. If it is for routing, then you very likely have lost some intersection information, although the original OSM model does not really have enough intersection information for proper routing anyway.

If you are only interested in a OSMNode-OSMNode connected graph, you might as well also delete (or not import) the OSMWays and OSMRelations.

Hi @craig.taverner , thanks a lot for your response. I watched all your presentation videos on OSM and Neo4j, it is an honour to get advice from you.

I will run the query you sent and check GC logging and neo4j logs.

The profiling issue of my query is pretty weird. I tested it again now. I deleted the distance property on all NEXT relationships, then did a count on it like this:

match (an:OSMNode)-[r:NEXT]->(bn:OSMNode)
where exists(r.distance)
return count (*)

The query returned 0. Then I ran the query to add the distance property with profiling and it gave the same result as in the screenshot with 0 db hits. Running the count of r.distance again gave ~40k. The distance property was added to all NEXT relationships even though profiling returned 0 db hits. I am not sure how this happened, but it may point towards a possible bug.

It would be great to add distance calculation to the importer tool. I tried to do it, but it is a little more complicated than I thought. When the NEXT relationship is created between the nodes, the lat/lon properties of the nodes are not populated yet. So the distance needs to be calculated after adding all the lat/lon properties to the nodes or the NEXT relationship needs to be added after adding the properties to the nodes. If you could give some directions how to proceed, I could try to implement it and do a PR to merge it back into the project. Whatever works best.

I removed the OSMWayNodes to reduce graph size (from ~1billion nodes to ~500million) and to reduce the complexity of the graph and of queries. For my implementation (routing along the OSMNodes and along connected OSMWay nodes) it worked fine. I tested all routing queries in the small dataset, and they worked fine. I expect or at least hope that the routing queries will scale to the large graph too.

Thanks again for your help. I will report back with the findings after running your query to add the distance property.

I suspect profiling the procedure will not allow the profiler to see the inner queries, which is why you see no profile counts. You would need to profile an alternative query that does not use apoc.periodic.*, but uses LIMIT appropriately to only add a certain number of distances.

When I wrote the osm library I did try find a way to calculate the distances during the original import, but as you say, when loading the relationships we only know the node-ids from the osm-id..node-id mapper table, and do not know any of the node properties. I considered adding another mapper table for id--location, but that would have been a lot more work. It is still worth considering, and if you are interested in trying to add that yourself, I could try give advice on how I think it could be done.

The short version is that the underlying bulk-import libraries inside the Neo4j code-base have support for mapping from externally defined identities to Neo4j specific identities. This is so that the bulk importer can first load all nodes, streaming directly onto the end of the 'node store' file, and then load all relationships also streaming onto the 'relationship store'. But when loading the relationships they need to contain the correct Neo4j node IDs for both ends of every relationship. The input data will have only the external identities (eg. in our case the OSM node IDs, which are probably the original PostgreSQL node table unique keys). The mapper is an in-memory hash table that allows us to lookup the new Neo4j node ids. I suspect it is possible to provision an additional mapping table for mapping to locations as well, and we could then calculate and store the distances during the import. I did not investigate this further at the time.

Thank you, I'll keep in mind how to do profiling.

On adding distance property at OSM import: using the right query you provided above was really fast so there is no real need to add distance at import.
It is rather costly to add it at import and fast and easy (with the right query) when everything is at the right place in the DB. Thanks a lot for your help.