I have some doubts regarding importing data into Neo4j.
I have a large volume of data (i have 100k JSON files and each JSON contains 200k records).
what is the best way to import this data?
I am using Pyspark and neo4j-admin import currently. is there any alternative method for this or can I import this much of huge data using pyspark only?
Maybe this blog is helpful
If you can describe a more specific issue you having with your current method, then maybe the community may give you more ideas back?
Using apache spark only will most likely result in deadlock situations for large graphs. Creating files and using the admin import is currently the best option I believe.
There might be a possibility to run a clustering algorithm on your graph in spark and separate the clusters so you get rid of the deadlocks. At the end you need to create the cluster connections again of course.
This is in no way an easy solution though.
You could try this utility written in python.
It has a config yaml file, where you can specify the file URL and corresponding cypher to ingest the data. It import each file in sequence.
If you want to parallelize the import, you can create multiple config yaml files and run them in parallel.
As others mentioned, when you run in parallel there is a possibility of dead locks, as relationship creation locks both side nodes.
Neo4J admin import would still be fastest way to import huge amount of initial data.