Running graph algorithms with APOC periodic.iterate

Hello,

I am trying to run closeness centrality on 624985 nodes and 54191395 edges with the following query (which worked fine for a smaller instance on 3566981 edges within minutes):

CALL algo.closeness('alias', 'through_citations', {graph:'huge', direction: 'BOTH', write:true, writeProperty:'GraphProperty_closeness.centrality_throughCitations'})
YIELD nodes,loadMillis, computeMillis, writeMillis;

This query is running for more than 2 hours and has not written the graph properties for any nodes. What are the next steps I could take to make sure if it will run ok? Is there a way to make this run with any APOC procedure?

Thanks,
Lavanya

Update: I tried

CALL algo.memrec('alias', 'through_citations', "algo.closeness", {graph: "huge"}) YIELD nodes, relationships, requiredMemory, bytesMin, bytesMax RETURN nodes, relationships, requiredMemory, bytesMin, bytesMax

to see if I can check any memory requirements for running the algorithm, with no luck:

Error
Neo.ClientError.Procedure.ProcedureCallFailed
Neo.ClientError.Procedure.ProcedureCallFailed: Failed to invoke procedure `algo.memrec`: Caused by: java.lang.IllegalArgumentException: The procedure [algo.closeness] does not support memrec or does not exist, the available and supported procedures are {beta.k1coloring, beta.modularityOptimization, beta.wcc, graph.load, labelPropagation, louvain, nodeSimilarity, pageRank, unionFind, wcc}.

This is probably not an answer. I am just sharing my experience with running graph algorithms against large graphs:

  • algo.memrec is not supported for all algorithms. You can see a list of the supported ones in the error message you got.
  • Make sure you give neo4j large enough heap and pagecache.
  • Use concurrency parameter if you have the enterprise version.
  • Run the algorithm against a small portion of your graph first to see how it performs.
  • Use community detection algorithms (or any other algorithm) to break your graph into smaller subgraphs and then run closeness separately on every subgraph.
1 Like

You can also turn on the debug log and you'll see output/progress report as your algorithm executes.

@shan @alicia.frame Thanks for the suggestions.

I now computed the connected components of the graph and want to try to speed up the below query by running the below query for each individual component:

CALL algo.unionFind('alias', 'through_citations', {graph:'huge', seedProperty:'GraphProperty_wcc_throughCitations', write:true, writeProperty:'GraphProperty_wcc_throughCitations'})
YIELD nodes AS Nodes, setCount AS NbrOfComponents, writeProperty AS PropertyName;

Is cypher projection the only way out? I do not think the below code is parallelizing the computations, does it?

CALL algo.closeness('MATCH (n:alias) RETURN id(n) AS id',
  'MATCH (n)-[:through_citations]-(m:alias) where n.GraphProperty_wcc_throughCitations == GraphProperty_wcc_throughCitations RETURN id(n) AS source, id(m) AS target', {graph:'huge', direction: 'BOTH', write:true, writeProperty:'GraphProperty_closeness_centrality_coauthors'})
YIELD nodes,loadMillis, computeMillis, writeMillis;

Thanks,
Lavanya

You could either:

  • Set a label an additional label on each alias with its community id, and then loop over each community label and run closeness centrality on that subset, or
  • Use a cypher statement to identify all the labels, and loop over the communities using a cypher projection. Your query is almost there but you'll need to update the node and relationship queries:
CALL algo.closeness('MATCH (n:alias) WHERE n.GraphProperty_wcc_throughCitations = [value] RETURN id(n) AS id',
  'MATCH (n)-[:through_citations]-(m:alias) where n.GraphProperty_wcc_throughCitations = [value] RETURN id(n) AS source, id(m) AS target', 
  {graph:'huge', direction: 'BOTH', write:true, writeProperty:'GraphProperty_closeness_centrality_coauthors'})

Closeness centrality is parallelized, but if you want to make the loop over the communities itself parallel, you could use something like apoc.mapParallel2.

I would probably use a threshold and ignore any components with fewer than (for example) 5 members just to limit the number of communities you inspect (and if they're small the closeness will be low anyways).

3 Likes

@alicia.frame

Here is what I am using:

MATCH (n:alias) WITH DISTINCT n.GraphProperty_wcc_coauthors as value
CALL algo.closeness('MATCH (n:alias) WHERE n.GraphProperty_wcc_coauthors = $value RETURN id(n) AS id',
  'MATCH (n)-[:co_authors]-(m:alias) where m.GraphProperty_wcc_coauthors = $value RETURN id(n) AS source, id(m) AS target', 
  {graph:'cypher', params: {value: value}, write:true, writeProperty:'GraphProperty_closeness_centrality_coauthors'})
  YIELD nodes,loadMillis, computeMillis, writeMillis
  RETURN nodes,loadMillis, computeMillis, writeMillis

Will incorporate apoc.mapParallel2 soon.

Thanks

@alicia.frame

The above query returned the following error:

Neo.ClientError.Procedure.ProcedureCallFailed: Failed to invoke procedure `algo.closeness`: Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 2 out of bounds for length 2

kindly let me know how I may troubleshoot this.

Best,
Lavanya

@shan @alicia.frame @andrew_bowman

Update:

CALL apoc.periodic.iterate(
  "MATCH (comp:GraphProperty_wcc_throughTopic) RETURN comp.GraphProperty_component AS component", 
  "CALL algo.closeness('MATCH (n:alias  {GraphProperty_wcc_throughTopic : $component}) RETURN id(n) AS id',
  'MATCH (n)-[r:through_topic]-(m:alias) RETURN id(n) AS source, id(m) AS target, r.weight as weight', 
  {graph:'cypher', params: {component: component}, write:true, writeProperty:'GraphProperty_closeness_centrality_throughTopic'})
  YIELD nodes,loadMillis, computeMillis, writeMillis
  RETURN nodes,loadMillis, computeMillis, writeMillis", {batchSize:5000, parallel:true})           
YIELD batches, total, errorMessages;

I got this above query working for smaller instance. For my bigger instance on 624985 nodes and 54191395 edges broken down to 390639 connected components on which closeness centrality is set to run, the query is running for more than 30 min now. Do I have to switch gears and maybe try apoc.mapParallel2 as @alicia.frame suggested in this thread?

Thanks,
Lavanya