Query of load csv not completing even after 12 hours

Hi, I have been using Neo4j for quite a while now. I ran this query earlier before my computer crashed 7 days ago and somehow unable to run it now. I need to create a graph database out of a csv of bank transactions. The original dataset has around 5 million rows and has around 60 columns. This is the query I used, starting from 'Export CSV from real data' demo by Nicole White:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///Transactions_with_risk_scores.csv" AS line
WITH DISTINCT line, SPLIT(line.VALUE_DATE, "/") AS date
WHERE line.TRANSACTION_ID IS NOT NULL AND line.VALUE_DATE IS NOT NULL
MERGE (transaction:Transaction {id:line.TRANSACTION_ID}) 
SET transaction.base_currency_amount =toInteger(line.AMOUNT_IN_BASE_CURRENCY),
transaction.base_currency = line.BASE_CURRENCY,
transaction.cd_code = line.CREDIT_DEBIT_CODE,
transaction.txn_type_code = line.TRANSACTION_TYPE_CODE,
transaction.instrument = line.INSTRUMENT,
transaction.region= line.REGION,
transaction.scope = line.SCOPE,
transaction.COUNTRY_RISK_SCORE= line.COUNTRY_RISK_SCORE,
transaction.year = toInteger(date[2]),
transaction.month = toInteger(date[1]),
transaction.day = toInteger(date[0]);

I tried:
Using LIMIT 0 before running query as per Micheal Hunger's suggestion in a post about 'Loading Large datasets'.
Used single MERGE per statement (this is first merge and there are 4 other merges to be used) as suggested by Michael again in another post.
Tried CALL apoc.periodic.iterate and apoc.cypher.parallel but doesn't work with LOAD CSV (seem to work only with MERGE and CREATE queries without LOAD CSV).
Increased max heap size to 16G as my laptop is of 16GB RAM. Btw finding it difficult to write this post as I tried running again now with 'PROFILE ' and it is still running since an hour.

Help needed to load query of this 5 million rows dataset. Any help would highly be appreciated.Thanks in advance! I am using Neo4j 3.5.1 on PC.

Do you have duplicate entries in your file? if not you don't need the distinct
I would also move the filter up before the split.

Did you create the constraint for :Transaciton(id) upfront?

Check your statement with EXPLAIN so you see what it does.

I would use heap-size: 12G, page-cache 2G and

1 Like

No, it doesn't . I will remove the DISTINCT. Do you think that would make a difference anyway?
I am sorry, what do you mean by moving the filter up?

Yes, I created constraint for Transaction node
The explain statement gives a plan as shown in figure with either 0 or 1 estimated rows. Can't understand it:


Oh, interesting. Why 12G? Does it give some cushion for a 16GB RAM?
Can't find page-cache in neo4j.conf file. Where can I set it?

The key was to NOT USE DISTINCT. Thanks again Michael, it works. Struggled with this for 4 days!

1 Like

The OS also needs memory as does teh database for page-cache.

It's odd, so you have 5M lines? how many columns, and how long is a line (in chars)?
Or do you have an example line?

It's odd that the distinct affects the runtime so much, would be good to test and fix if it's an issue.

Hi Michael,

So I have 5 million rows and around 76 variables. Please find below the header and top 2 sample lines from the database. I checked that 1st and 2nd lines have 684 and 687 characters respectively and the header has 1557 characters. Hope this helps you help me. Thanks in advance!

//Headers

"","REGISTRATION_COUNTRY_CODE","ADDRESS_COUNTRY_CODE","NATIONALITY_COUNTRY_CODE","V1","ORG_UNIT_NAME","ENTITY_SOURCE_REFERENCE_ID","LINE_OF_BUSINESS","BUSINESS_TYPE_CODE","CUSTOMER_TYPE_CODE","CUSTOMER_TYPE_ALTERNATIVE_CODE","ENTITY_BRANCH_CODE","RISK_LEVEL","CUSTOMER_STATUS_CODE","DATE_OPENED","DATE_CLOSED","PEP_SOURCE","PEP_CODE","CLASSIFICATION_EXCEPTION_CODE","ENTITY_OFFSHORE_FLAG","ACCOUNT_SOURCE_REFERENCE_ID","CURRENCY_CODE","AMOUNT_IN_BASE_CURRENCY","BASE_CURRENCY","CREDIT_DEBIT_CODE","ORIGINAL_CURRENCY","AMOUNT_IN_ORIGINAL_CURRENCY","TRANSACTION_TYPE_CODE","VALUE_DATE","BOOKING_DATE","SWIFT_FLAG","DATE_INSERTED","TRANSACTION_BRANCH_CODE","SENDER_COUNTRY","F50_ORIGINATOR_OFFS_FLAG","F59_BENEFICIARY_OFFS_FLAG","F52_ORDERING_BANK_CTRY_CODE","F52_ORDERING_BANK_OFFS_FLAG","F52_202_OUT_ORD_BANK_CTRY_CODE","F57_BENEFICIARY_BANK_CTRY_CODE","F57_BENEFICIARY_BANK_OFFS_FLAG","OFFSHORE_ORDERING_SIDE","OFFSHORE_BENEF_SIDE","F70_REMITTANCE_INFO","SOURCE_SYSTEM_CODE","SOURCE_SYSTEM_SUB_CODE","SWIFTDIN","SWIFTDOUT","TRANSACTION_SOURCE_REF_ID","DEBIT_PARTY_CTRY_CODE","CREDIT_PARTY_CTRY_CODE","INSTRUMENT","REGION","SCOPE","F59_BENEFICIARY_CTRY_CODE","F50_ORIGINATOR_CTRY_CODE","ORDER_TYPE","ACCOUNT_ID","ACCOUNT_TYPE_CODE","TRANSACTION_ID","CUTOFF_DATE","day_of_transac","suspicious_customers_transac","F50_ORIGINATOR","F59_BENEFICIARY","panama_flag","freq_transacs_panama","transac_amounts_panama","Risk..Score.x","Risk..Score.y","Risk..Score","COUNTRY_RISK_SCORE","RISK_SCORE","CUSTOMER_RISK_SCORE","CUSTOMER_RISK_SCORE_SCALED","F52_ORDERING_BANK"

//1st line


"1","66","66","66",101277,"RBI",2816221,"CORP","66-ANDER-KONTO","HILF",36,31000,"No Risk Assigned","ACTIVE","18-MAY-05",NA,"NONE","NO_PEP","RVV","N","02816221-937","EUR",3.63,"EUR","C","EUR",3.63,"SEC-DOM-EX","06-JUL-17","06-JUL-17","N","08-JUL-17","31000",NA,NA,NA,"AT","N",NA,NA,NA,"N","N","Gutschrift WP-KESt-Verrechnung BLZ 37449 VTNr.: 080005705 70961177/05.^07.2017 Depot 60.003.506 Basisgeschäft 75937477/05.07.2017","BDT","BDT",0,0,"BDT37449170705W1707057096117702",NA,NA,"SECURITIES","DOMESTIC","EXTERNAL",NA,NA,NA,294231040,"G-00",3348757869,"06-JUL-17",6,0,"Browsecat","Lakin and Sons",0,1,3525.3,0,0,0,0,1175.43333333333,1175.43333333333,-0.244258005859137,"VISA 13 digit"

//2nd line


"2","66","66","66",101278,"RBI",2816221,"CORP","66-ANDER-KONTO","HILF",36,31000,"No Risk Assigned","ACTIVE","18-MAY-05",NA,"NONE","NO_PEP","RVV","N","02816221-937","EUR",1186.64,"EUR","C","EUR",1186.64,"SEC-DOM-EX","06-JUL-17","06-JUL-17","N","08-JUL-17","31000",NA,NA,NA,"AT","N",NA,NA,NA,"N","N","Rücknahme Fonds 75937477-05.07.17 Depot       60.003.506 37449 AT0000A^00XN0        R-VIP 75 I (T) 8,000 Stk zu     148,330 EUR   05.07.17","BDT","BDT",0,0,"BDT37449170705W1707057593747700",NA,NA,"SECURITIES","DOMESTIC","EXTERNAL",NA,NA,NA,294231040,"G-00",3348757867,"06-JUL-17",6,0,"Skiba","Bosco-Cremin",0,1,3525.3,0,0,0,0,1175.43333333333,1175.43333333333,-0.244258005859137,"Voyager"

Bharat

Thanks a lot for the details! How long did it run after the DISTINCT was removed?

Around 20 Minutes? Sorry I actually didn't use EXPLAIN , so dont have exact time

i used also the same real data demo of the 'response complaint data model
see below further
//Complaints,companies, responses update relaties
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM "file:///Consumer_Complaints_ANC.csv" AS line
MATCH (:Response)-[r:TO]->(:Complaint)
SET r.timely = CASE line.Timely response? WHEN 'Yes' THEN true ELSE false END,
r.disputed = CASE line.Consumer disputed? WHEN 'Yes' THEN true ELSE false END
the complaints and response totaal above 2mn are in the data base as well as the relation wish to set additional variable to the relation

cypher is still running for one hour now any suggestions are welcome

you are not matching your response/complaint to your CSV data.
You should at least look them both up by id or such.