I am trying to create a graph where nodes are products and they're linked by the ids of customer who've bought both products. The basic idea is contained in "Graph-Based Structures for the Market Baskets Analysis", and the key figure is here:
Eventually I want to assign a weight to the relationships based on how many customers bought two products, probably with a Jaccard algorithm in the data science library.
It wasn't hard to create a radial structure:
LOAD CSV WITH HEADERS FROM 'file:///BV-1350-Market_Basket_Input_Data.csv' AS line
WITH
line.HUB_PRODUCT_KEY AS HUB_PRODUCT_KEY,
line.HUB_CUSTOMER_KEY AS HUB_CUSTOMER_KEY
MERGE (p:Product {hub_product_key: HUB_PRODUCT_KEY})
MERGE (c:Customer {hub_product_key: HUB_PRODUCT_KEY,
hub_customer_key: HUB_CUSTOMER_KEY})
WITH p,c
MATCH (p),(c)
WHERE
p.hub_product_key = c.hub_product_key
MERGE (p)-[:BOUGHT_BY]->(c)
RETURN p,c
And my hope was that I'd be able to create a subgraph consisting of products linked by customers who'd bought more than one product:
MATCH path=(p1:Product)-[:BOUGHT_BY]->(c1:Customer)<-[:BOUGHT_BY]-(p2:Product)
RETURN path
This returned no changes, no records
, even though I've verified in Snowflake that there are customers who have bought more than one product.
I don't think this is the best approach anymore, but I include it for completeness in case it's helpful.
Then I thought I might create product nodes with a 'customer id' property containing a list of all the customers who've bought that product:
LOAD CSV WITH HEADERS FROM 'file:///BV-1350-Market_Basket_Input_Data.csv' AS line
MERGE (p:Product {hub_product_key: line.HUB_PRODUCT_KEY})
ON CREATE SET p.hub_customer_key = [line.hub_customer_key]
ON MATCH SET p.hub_customer_key = p.hub_customer_key + line.hub_customer_key
With this I would create some sort of relationship which looks at the list of customer ids for products and links them that way.
I've since discovered that Neo4j hates you trying to put more than one value in a property, and at any rate I keep getting Property values can only be of primitive types or arrays thereof
errors.
This still seems like the best approach, but I could be wrong.
Finally, I tried loading product nodes in with the customer ids as properties, which I'd use to make links between products by matching on the customer ids, and this tanked the database so hard I have to completely reset it:
LOAD CSV WITH HEADERS FROM 'file:///BV-1350-Market_Basket_Input_Data.csv' AS line
WITH
line.HUB_PRODUCT_KEY AS HUB_PRODUCT_KEY,
line.HUB_CUSTOMER_KEY AS HUB_CUSTOMER_KEY
MERGE (p:Product {hub_product_key: HUB_PRODUCT_KEY,
hub_customer_key: HUB_CUSTOMER_KEY})
WITH p
MATCH (p1:Product), (p2:Product)
WHERE
p1.hub_customer_key = p2.hub_customer_key AND
NOT p1.hub_product_key = p2.hub_product_key
MERGE (p1)-[:CUSTOMER {customer_key:p1.hub_customer_key}]->(p2)
RETURN p1, p2
I'm not sure why, because when I inspect the product nodes they each contain one customer id each, so I don't know why there'd be so much draw it'd crash my local instance.
So I guess my question is: given what I've tried so far, how can I set up a graph of products that are connected by customers who've purchased both products such that I can then assign weights between the products.