Help with Link Prediction and Graph Design to Solve - Different Categories of Persons, Identifiers, and Works

Hi there,

TL: DR: -- I am using linkPrediction (1.8 preview :slightly_smiling_face:) but am worried that my graph projection decision doesn't fit well. Having trouble melding co-author information and identifier / name information together (names and ID's can be multiple for a single person) and extra trouble with categorical membership types.

Essentially I have a few things that I am trying to parse through whether or not graph algorithms will work for me. I thought it would, partly because CO-AUTHORSHIP reviews is one way our domain experts manually figure things out as well. But upon reviewing the outputs in the real data after running the pipelines its not looking so hot, and I am trying to understand whether my graph design and plan of attack makes sense or not.

Here is the lay of the land --
I have two types of "people" -- Writers and Publishing Companies:

  • Writers and Publishers Both can have the following:

  •     Primary ID
  •     Slightly Less Primary ID (that ends up being multiple for the same person)
  •     Legitimate Multiple Names for the same person
  •     Membership to various organizations
  • Works have a set of writers and publishers that are "ON_WORK" - like co-authorship -- Publishers tend to be involved in many many more things than specific writers (a number of publishers have 300k different people they work with),

  • When we receive new Work information (in-general) one person will have good information -

    • however the others that are involved sometimes receive sparse information like just a First and Last Name.

The approach I originally took was to have a node for each Identifier and Name (because each "person" can have multiple of each") and I essentially did something like (person1)-[:HAS_ID]-(ID)-[:HAS_ID/NAME]-(person2) for each thing... Same for people that are on a work together.

  • Certain Identifiers are more important than others, and I was going down the route of essentially saying forget the intermediary node... just setup relationships between the people that share an ID or MembershipType and assign a default weight for each of those relationship types.

I intended to keep the types of relationship separate, since in reality I am not sure what an appropriate weight really is for each ID that's the same....

Another question is does keeping the works individualized serve me at all? I tried the "Co-Author" approach where the number of times people collaborated together becomes a weighted relationship between them... not sure why but that feels different in its level of richness than retaining the individual works in the network. It did quite poorly (although it looked great on the train and test metric from the pipeline -- and I tried a number of different approaches for the imbalance issues )

Partially because the submissions we get really aren't rich in the information they provide even regarding co-authorship.

Where I am extra stuck though is the situation of membership. These membership groups are kind of super nodes as well (just like the big publishing companies).... so I wanted to just have a property that could be compared categorically on the nodes... obviously that's not really supported cleanly. I tried hot-encoding, but wasn't sure how to get the encoding on-to the in-memory graph -- because I also need to collect the distinct list of membership types to do that.

Any help is greatly appreciated!

Best regards,

Things I've watched and read --
9 - Building an ML Pipeline in Neo4j Link Prediction Deep Dive - YouTube
Exploring Supervised Entity Resolution in Neo4j - Neo4j Graph Database Platform
And a ton of documentation, googling about class weights, and other principles... also sluething these forums

Essentially my issue is that one of my key characteristics that helps me to understand whether folks may or may not be the same person (taking a link prediction approach) is a categorical variable that represents membership to one of 230 different organizations. Each of these organizations contains 10's of thousands to a few million members.

Previously I had been using each organization as its own node, and essentially creating a relationship to the effect of [:HAS_MEMBERSHIP]. I am trying to understand whether that is a better approach than one-hot-encoding... which I also wasn't really able to accomplish; because not all the nodes have membership -- because a number of the nodes represent things like "fullnames" or "identifiertype1" - identifiertypeother.

Also I think I have some level of misunderstanding with ways to deal with matching people up... I kind of keep running into an issue where essentially everything is always homogenous and friendly in the examples I've seen people demonstrating.

And while I've finally figured out a way to some degree to treat the people as a same type of node and

Please keep the following things in mind:

  1. did you search for what you want to ask before posting?
  2. please use tags for additional info
  3. use a self-descriptive title

Please format code + Cypher statements with the code </> icon, it's much easier to read.

Please provide the following information if you ran into a more serious issue:

  • neo4j version, desktop version, browser version
  • what kind of API / driver do you use
  • screenshot of PROFILE or EXPLAIN with boxes expanded (lower right corner)
  • a sample of the data you want to import
  • which plugins / extensions / procedures do you use
  • neo4j.log and debug.log