Posting a random idea/use case using Neo4J for customer churn under telcom domain. We all know that predicting a customer churn is important for any company and making necessary things to retain works in almost all industries. We specifically focus on telcos, and we are brainstorming with data for improving the churn model that is built using XGBoost. Traditional ML algos work well , give a decent rate of accuracy. But we see the potential of GraphDBs here, but the customer's data is all in traditional RDBMS format, the challenge is how do we use it with neo4j? Pushing huge data everyday would be huge effort and time consuming, Any effective ways are there?
Any thoughts are welcome!
There are a lot of different integration approaches from batch to streaming, but yes basically you end up taking the data from RDBMS and writing it into a graph shape so that you can use graph approaches like GDS. Which exact approach you take depends on what's in place, how fresh you need the data and so on.
Some customers use the spark connector (Neo4j Connector for Apache Spark v5.0.0 - Neo4j Spark Connector) to pull batches and engineer to a graph shape. Others use the kafka integration to set up source connectors for their original DBs and replicate data into the graph. There are a bunch of other ways besides this too depending on what you're trying to do
Thanks for writing David. It was helpful, If you can point to some example github repos or demo it would be great. The freqency of the data update will be everyday, and the solution is in AWS. I guess, good services like AWS Glue help in automating the pyspark solution of conversion of the data from RDBMS to GraphDB.
You can write some scripts to connect to RDBMS and pump the data into graph store.The semantics of script will vary from dB to dB but the central idea is that you need to connect to source and pull the data periodically from source system into target. The frequency of pull mechanism can be tuned based on the capacity of store in which you are pushing the data as well as amount of information available at the source. If it is just one time data upload scheme then it is better to trigger it on happening if event in point of time.But if it is a regular happening then you can use scheduler of your choice to pull the data from source and push it into the store.
Hopes that this bring clarify things.