Query of load csv not completing even after 12 hours

ammubharatram · January 17, 2019, 11:19am

Hi, I have been using Neo4j for quite a while now. I ran this query earlier before my computer crashed 7 days ago and somehow unable to run it now. I need to create a graph database out of a csv of bank transactions. The original dataset has around 5 million rows and has around 60 columns. This is the query I used, starting from 'Export CSV from real data' demo by Nicole White:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///Transactions_with_risk_scores.csv" AS line
WITH DISTINCT line, SPLIT(line.VALUE_DATE, "/") AS date
WHERE line.TRANSACTION_ID IS NOT NULL AND line.VALUE_DATE IS NOT NULL
MERGE (transaction:Transaction {id:line.TRANSACTION_ID}) 
SET transaction.base_currency_amount =toInteger(line.AMOUNT_IN_BASE_CURRENCY),
transaction.base_currency = line.BASE_CURRENCY,
transaction.cd_code = line.CREDIT_DEBIT_CODE,
transaction.txn_type_code = line.TRANSACTION_TYPE_CODE,
transaction.instrument = line.INSTRUMENT,
transaction.region= line.REGION,
transaction.scope = line.SCOPE,
transaction.COUNTRY_RISK_SCORE= line.COUNTRY_RISK_SCORE,
transaction.year = toInteger(date[2]),
transaction.month = toInteger(date[1]),
transaction.day = toInteger(date[0]);

I tried:
Using LIMIT 0 before running query as per Micheal Hunger's suggestion in a post about 'Loading Large datasets'.
Used single MERGE per statement (this is first merge and there are 4 other merges to be used) as suggested by Michael again in another post.
Tried CALL apoc.periodic.iterate and apoc.cypher.parallel but doesn't work with LOAD CSV (seem to work only with MERGE and CREATE queries without LOAD CSV).
Increased max heap size to 16G as my laptop is of 16GB RAM. Btw finding it difficult to write this post as I tried running again now with 'PROFILE ' and it is still running since an hour.

Help needed to load query of this 5 million rows dataset. Any help would highly be appreciated.Thanks in advance! I am using Neo4j 3.5.1 on PC.

michael.hunger · January 17, 2019, 11:43am

Do you have duplicate entries in your file? if not you don't need the distinct
I would also move the filter up before the split.

Did you create the constraint for :Transaciton(id) upfront?

Check your statement with EXPLAIN so you see what it does.

I would use heap-size: 12G, page-cache 2G and

ammubharatram · January 17, 2019, 11:50am

No, it doesn't . I will remove the DISTINCT. Do you think that would make a difference anyway?
I am sorry, what do you mean by moving the filter up?

Yes, I created constraint for Transaction node
The explain statement gives a plan as shown in figure with either 0 or 1 estimated rows. Can't understand it:

Oh, interesting. Why 12G? Does it give some cushion for a 16GB RAM?
Can't find page-cache in neo4j.conf file. Where can I set it?

ammubharatram · January 17, 2019, 1:39pm

The key was to NOT USE DISTINCT. Thanks again Michael, it works. Struggled with this for 4 days!

michael.hunger · January 17, 2019, 10:19pm

The OS also needs memory as does teh database for page-cache.

It's odd, so you have 5M lines? how many columns, and how long is a line (in chars)?
Or do you have an example line?

It's odd that the distinct affects the runtime so much, would be good to test and fix if it's an issue.

ammubharatram · January 18, 2019, 11:37am

Hi Michael,

So I have 5 million rows and around 76 variables. Please find below the header and top 2 sample lines from the database. I checked that 1st and 2nd lines have 684 and 687 characters respectively and the header has 1557 characters. Hope this helps you help me. Thanks in advance!

//Headers

"","REGISTRATION_COUNTRY_CODE","ADDRESS_COUNTRY_CODE","NATIONALITY_COUNTRY_CODE","V1","ORG_UNIT_NAME","ENTITY_SOURCE_REFERENCE_ID","LINE_OF_BUSINESS","BUSINESS_TYPE_CODE","CUSTOMER_TYPE_CODE","CUSTOMER_TYPE_ALTERNATIVE_CODE","ENTITY_BRANCH_CODE","RISK_LEVEL","CUSTOMER_STATUS_CODE","DATE_OPENED","DATE_CLOSED","PEP_SOURCE","PEP_CODE","CLASSIFICATION_EXCEPTION_CODE","ENTITY_OFFSHORE_FLAG","ACCOUNT_SOURCE_REFERENCE_ID","CURRENCY_CODE","AMOUNT_IN_BASE_CURRENCY","BASE_CURRENCY","CREDIT_DEBIT_CODE","ORIGINAL_CURRENCY","AMOUNT_IN_ORIGINAL_CURRENCY","TRANSACTION_TYPE_CODE","VALUE_DATE","BOOKING_DATE","SWIFT_FLAG","DATE_INSERTED","TRANSACTION_BRANCH_CODE","SENDER_COUNTRY","F50_ORIGINATOR_OFFS_FLAG","F59_BENEFICIARY_OFFS_FLAG","F52_ORDERING_BANK_CTRY_CODE","F52_ORDERING_BANK_OFFS_FLAG","F52_202_OUT_ORD_BANK_CTRY_CODE","F57_BENEFICIARY_BANK_CTRY_CODE","F57_BENEFICIARY_BANK_OFFS_FLAG","OFFSHORE_ORDERING_SIDE","OFFSHORE_BENEF_SIDE","F70_REMITTANCE_INFO","SOURCE_SYSTEM_CODE","SOURCE_SYSTEM_SUB_CODE","SWIFTDIN","SWIFTDOUT","TRANSACTION_SOURCE_REF_ID","DEBIT_PARTY_CTRY_CODE","CREDIT_PARTY_CTRY_CODE","INSTRUMENT","REGION","SCOPE","F59_BENEFICIARY_CTRY_CODE","F50_ORIGINATOR_CTRY_CODE","ORDER_TYPE","ACCOUNT_ID","ACCOUNT_TYPE_CODE","TRANSACTION_ID","CUTOFF_DATE","day_of_transac","suspicious_customers_transac","F50_ORIGINATOR","F59_BENEFICIARY","panama_flag","freq_transacs_panama","transac_amounts_panama","Risk..Score.x","Risk..Score.y","Risk..Score","COUNTRY_RISK_SCORE","RISK_SCORE","CUSTOMER_RISK_SCORE","CUSTOMER_RISK_SCORE_SCALED","F52_ORDERING_BANK"

//1st line


"1","66","66","66",101277,"RBI",2816221,"CORP","66-ANDER-KONTO","HILF",36,31000,"No Risk Assigned","ACTIVE","18-MAY-05",NA,"NONE","NO_PEP","RVV","N","02816221-937","EUR",3.63,"EUR","C","EUR",3.63,"SEC-DOM-EX","06-JUL-17","06-JUL-17","N","08-JUL-17","31000",NA,NA,NA,"AT","N",NA,NA,NA,"N","N","Gutschrift WP-KESt-Verrechnung BLZ 37449 VTNr.: 080005705 70961177/05.^07.2017 Depot 60.003.506 Basisgeschäft 75937477/05.07.2017","BDT","BDT",0,0,"BDT37449170705W1707057096117702",NA,NA,"SECURITIES","DOMESTIC","EXTERNAL",NA,NA,NA,294231040,"G-00",3348757869,"06-JUL-17",6,0,"Browsecat","Lakin and Sons",0,1,3525.3,0,0,0,0,1175.43333333333,1175.43333333333,-0.244258005859137,"VISA 13 digit"

//2nd line


"2","66","66","66",101278,"RBI",2816221,"CORP","66-ANDER-KONTO","HILF",36,31000,"No Risk Assigned","ACTIVE","18-MAY-05",NA,"NONE","NO_PEP","RVV","N","02816221-937","EUR",1186.64,"EUR","C","EUR",1186.64,"SEC-DOM-EX","06-JUL-17","06-JUL-17","N","08-JUL-17","31000",NA,NA,NA,"AT","N",NA,NA,NA,"N","N","Rücknahme Fonds 75937477-05.07.17 Depot       60.003.506 37449 AT0000A^00XN0        R-VIP 75 I (T) 8,000 Stk zu     148,330 EUR   05.07.17","BDT","BDT",0,0,"BDT37449170705W1707057593747700",NA,NA,"SECURITIES","DOMESTIC","EXTERNAL",NA,NA,NA,294231040,"G-00",3348757867,"06-JUL-17",6,0,"Skiba","Bosco-Cremin",0,1,3525.3,0,0,0,0,1175.43333333333,1175.43333333333,-0.244258005859137,"Voyager"

Bharat

michael.hunger · January 18, 2019, 1:32pm

Thanks a lot for the details! How long did it run after the DISTINCT was removed?

ammubharatram · January 18, 2019, 1:35pm

Around 20 Minutes? Sorry I actually didn't use EXPLAIN , so dont have exact time

arthcras · April 30, 2019, 12:41am

i used also the same real data demo of the 'response complaint data model
see below further
//Complaints,companies, responses update relaties
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM "file:///Consumer_Complaints_ANC.csv" AS line
MATCH (:Response)-[r:TO]->(:Complaint)
SET r.timely = CASE line.Timely response? WHEN 'Yes' THEN true ELSE false END,
r.disputed = CASE line.Consumer disputed? WHEN 'Yes' THEN true ELSE false END
the complaints and response totaal above 2mn are in the data base as well as the relation wish to set additional variable to the relation

cypher is still running for one hour now any suggestions are welcome

michael.hunger · June 19, 2019, 10:11pm

you are not matching your response/complaint to your CSV data.
You should at least look them both up by id or such.

Topic		Replies	Views
CSV import issue Import / Export	26	532	June 21, 2023
Loading very a very large dataset Newbie Questions import	9	523	February 21, 2022
Creating relationship using MERGE runs endlessly Cypher	12	531	May 12, 2021
LOAD CSV failed to apply transaction and Neo4j cannot accept any query until restart Neo4j Graph Platform load-csv , transaction , index	0	360	November 25, 2021
Load CSV data Neo4j Graph Platform migrated	4	101	January 1, 2023

Query of load csv not completing even after 12 hours

Related Topics