After wasting a lot of time with LOAD CSV, I succeeded in importing 176M records from Semantic Scholar into my Neo4j database using neo4j-admin import
I was able to complete the operation 2.7 hours, once I upgraded to an r5.2xlarge with 64GB RAM
Thanks for your help @stefan.armbruster!
Now I am trying to run benchmarking queries on the database, and I am running into more trouble.
Here are some stats on the data:
MATCH (p:Paper) return count(p); --> 176452537
MATCH (a:Author) RETURN count(a); --> 58748409
MATCH ()-[c:CITES]->() return count(c); --> 581057025
MATCH ()-[h:HAS_AUTHOR]->() return count(h); --> 452616106
I have copied the queries below that are giving me the most trouble.
I expect them to be quite costly, but I let the first one run all day and it still did not finish, so I aborted it for now...
Can anyone give me advice how to construct them better?
"Find which papers have the most citations"
EXPLAIN MATCH (:Paper)-[r:CITES]->(p:Paper)
RETURN p.title, p.id, COUNT(r)
ORDER BY COUNT(r) DESC
LIMIT 10;
+-------------------+----------------+-------------------------+---------------+--------------------------+
| Operator | Estimated Rows | Identifiers | Ordered by | Other |
+-------------------+----------------+-------------------------+---------------+--------------------------+
| +ProduceResults | 24105 | COUNT(r), p.id, p.title | COUNT(r) DESC | |
| | +----------------+-------------------------+---------------+--------------------------+
| +Top | 24105 | COUNT(r), p.id, p.title | COUNT(r) DESC | COUNT(r); 10 |
| | +----------------+-------------------------+---------------+--------------------------+
| +EagerAggregation | 24105 | COUNT(r), p.id, p.title | | p.title, p.id |
| | +----------------+-------------------------+---------------+--------------------------+
| +Filter | 581057025 | anon[7], p, r | | `anon[7]`:Paper |
| | +----------------+-------------------------+---------------+--------------------------+
| +Expand(All) | 581057025 | anon[7], p, r | | (p)<-[r:CITES]-(anon[7]) |
| | +----------------+-------------------------+---------------+--------------------------+
| +NodeByLabelScan | 176452537 | p | | :Paper |
+-------------------+----------------+-------------------------+---------------+--------------------------+
"Find which authors have published the most papers"
EXPLAIN MATCH (:Paper)-[r:HAS_AUTHOR]->(a:Author)
RETURN a.name, a.id, COUNT(r)
ORDER BY COUNT(r) DESC
LIMIT 10;
+-------------------+----------------+------------------------+---------------+-------------------------------+
| Operator | Estimated Rows | Identifiers | Ordered by | Other |
+-------------------+----------------+------------------------+---------------+-------------------------------+
| +ProduceResults | 21275 | COUNT(r), a.id, a.name | COUNT(r) DESC | |
| | +----------------+------------------------+---------------+-------------------------------+
| +Top | 21275 | COUNT(r), a.id, a.name | COUNT(r) DESC | COUNT(r); 10 |
| | +----------------+------------------------+---------------+-------------------------------+
| +EagerAggregation | 21275 | COUNT(r), a.id, a.name | | a.name, a.id |
| | +----------------+------------------------+---------------+-------------------------------+
| +Filter | 452616106 | anon[7], a, r | | `anon[7]`:Paper |
| | +----------------+------------------------+---------------+-------------------------------+
| +Expand(All) | 452616106 | anon[7], a, r | | (a)<-[r:HAS_AUTHOR]-(anon[7]) |
| | +----------------+------------------------+---------------+-------------------------------+
| +NodeByLabelScan | 58748409 | a | | :Author |
+-------------------+----------------+------------------------+---------------+-------------------------------+
"Find which authors have the most citations"
EXPLAIN MATCH (:Paper)-[r:CITES]-(:Paper)-[HAS_AUTHOR]-(a:Author)
RETURN a.name, a.id, COUNT(r)
ORDER BY COUNT(r) DESC
LIMIT 10;
+-------------------+----------------+-------------------------------------+---------------+--------------------------------+
| Operator | Estimated Rows | Identifiers | Ordered by | Other |
+-------------------+----------------+-------------------------------------+---------------+--------------------------------+
| +ProduceResults | 54598 | COUNT(r), a.id, a.name | COUNT(r) DESC | |
| | +----------------+-------------------------------------+---------------+--------------------------------+
| +Top | 54598 | COUNT(r), a.id, a.name | COUNT(r) DESC | COUNT(r); 10 |
| | +----------------+-------------------------------------+---------------+--------------------------------+
| +EagerAggregation | 54598 | COUNT(r), a.id, a.name | | a.name, a.id |
| | +----------------+-------------------------------------+---------------+--------------------------------+
| +Filter | 2980923603 | a, anon[26], anon[7], r, HAS_AUTHOR | | `anon[7]`:Paper |
| | +----------------+-------------------------------------+---------------+--------------------------------+
| +Expand(All) | 2980923610 | a, anon[26], anon[7], r, HAS_AUTHOR | | (anon[26])-[r:CITES]-(anon[7]) |
| | +----------------+-------------------------------------+---------------+--------------------------------+
| +Filter | 452616107 | anon[26], HAS_AUTHOR, a | | `anon[26]`:Paper |
| | +----------------+-------------------------------------+---------------+--------------------------------+
| +Expand(All) | 452616107 | anon[26], HAS_AUTHOR, a | | (a)-[HAS_AUTHOR:]-(anon[26]) |
| | +----------------+-------------------------------------+---------------+--------------------------------+
| +NodeByLabelScan | 58748409 | a | | :Author |
+-------------------+----------------+-------------------------------------+---------------+--------------------------------+
@mike_r_black suggested the following structure (Thanks!), but I am not sure what makes it better, and how I can use that lesson to improve my query design...
The fact that it only touches 176M rows instead of 581M rows seems like a good sign though.
I guess the idea is that you can just run a "SIZE()" query on each node individually, which is relatively cheap. (Is that just an O(1) operation on each node? Basically calculating the length of the list of relations?)
"Better way to find which papers have the most citations"
EXPLAIN MATCH (p:Paper)
RETURN p, SIZE( (:Paper)-[:CITES]->(p) ) AS cite_count
ORDER BY cite_count DESC
LIMIT 10;
+------------------+----------------+--------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------------------+
| Operator | Estimated Rows | Identifiers | Ordered by | Other |
+------------------+----------------+--------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------------------+
| +ProduceResults | 10 | anon[33], cite_count, p | cite_count DESC | |
| | +----------------+--------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------------------+
| +Top | 10 | anon[33], cite_count, p | cite_count DESC | cite_count; 10 |
| | +----------------+--------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------------------+
| +Projection | 176452537 | anon[33], cite_count, p | | {cite_count : SIZE(`anon[33]`)} |
| | +----------------+--------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------------------+
| +RollUpApply | 176452537 | anon[33], p | | anon[33] |
| |\ +----------------+--------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------------------+
| | +Projection | 2 | anon[32], NODE33, REL41, p | | { : PathExpression(NodePathStep(Variable( NODE33),SingleRelationshipPathStep(Variable( REL41),OUTGOING,NilPathStep)))} |
| | | +----------------+--------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------------------+
| | +Filter | 2 | NODE33, REL41, p | | ` NODE33`:Paper |
| | | +----------------+--------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------------------+
| | +Expand(All) | 2 | NODE33, REL41, p | | (p)<-[ REL41:CITES]-( NODE33) |
| | | +----------------+--------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------------------+
| | +Argument | 1 | p | | |
| | +----------------+--------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------------------+
| +NodeByLabelScan | 176452537 | p | | :Paper |
+------------------+----------------+--------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------------------+