Best way to model this data

I'm totally new to neo4j, and I have data very similar to the example movie data, but I'm not sure if the same model works for my case.

I have Shows nodes and People nodes, with relationships such as acted in, directed in etc., but my issue is that there are many more roles in my data than the obvious ones. It's very likely that at some point more will be added beyond the scope of what I can think of. For example, one Person may be credited as assistant director and flower arranger.
To further complicate matters, all these roles have English and Chinese translations, which I do not want to duplicate. So I can create a relationship :DIRECTED_IN and give it properties

english: "director",
chinese: "導演"

But when I create another relationship of the type :DIRECTED_IN, it will not have these properties, so it will need to be translated again. If a translation is changed or corrected, it would have to be done for each :DIRECTED_IN relationship. Perhaps this is how it should be, but it seemed awkward to me.
My first thought was to do this:

(a: Person)-[:WORKED_AS]->(b:Role)-[:ON]->(c:Show)

But then I found myself having to make the roles separately for each person anyway so it wasn't really any better. Should I be dealing with translations in a different way?

I read some other posts about translations, and I am now considering this, with each person having unique Role nodes as described above.

(b:Role)-[:HAS_TRANSLATION]->(d:Translation {en: "Director", zh: "導演"})

Does this make more sense? Still a little unsure if I'm overcomplicating things.

There isn't really a "right" or "best" way to model your data. How to structure your data should only be informed by what that data represents, and how you are going to use it. One of the biggest things you should be thinking about, is what kind of queries you'll be running. Maybe even create a few ideal queries, that you'd want to be able to make, when the data is there. With those, consider which data patterns will be easier/faster to work with, to get the data you want.

IMO, the direction you're going is the right one, but there's a few semantic clarifications I would make.

  1. (:Person)-[:WORKED_AS]->(:Role) Role here isn't a singular role. In fact, you'll wind up with one of these, for every credit in a movie.

Some possibilities:

Maybe keep the data simple, and the language out for now?

(:Person)-[:CREDIT {role: 'director'}]->(:Show)
# You could then have *one big node* which maps `en` roles to other languages:
(:Lang { code: "zh", director: "導演", ... })

Keep the language in, with the "Role" semantics clarified

(:Person)-[:CREDITED_AS]->(c:Credit)-[:ON]->(c:Show),
(c)-[:ROLE]->(:Role)-[:TRANS]->(:Translation {en: 'Director'})

Split the language even further, if that's the part you're really focused on:

(:Person)-[:CREDITED_AS]->(c:Credit)-[:ON]->(c:Show),
(c)-[:ROLE]->(:Role)-[:TRANS {code: 'en']->(:Translation {code: 'en', value: 'Director'})->(:Lang {code: 'en', name: 'English'})

The real questions you need to carefully analyze:

  1. Why are you building the graph.
  2. What kind of data do you want out.
  3. How are you going to use the data.
  4. How might you want to change the data.

Answers to those questions could help us better help you.

Thank you for all your help.
The second option you suggest about clarifying the semantics seems like a good choice for me. The first one would be difficult as data can come in either in Chinese or English only at first, so the role property of :CREDIT would be inconsistent.
You make a very good point about how the data will be used in the end.
Languages won't be mixed, so it should be possible to query for, let's say, all the shows an actor worked on - either by the actor's English or Chinese name, and return all the show names in that same language. I would like to be able to form queries using either English or Chinese to search by, without it being overly complicated to search for either. So being able to do something like MATCH (a:Person{name:{zh:"李小龍"}})-[:CREDIT {role: {zh:"演員"}]-(s:Show) RETURN s.title.zh would be most convenient in terms of getting the data back, but like I said, that would mean the :CREDIT relationships will have to duplicate translations. I'm not even sure how I'd form the query if translations are done as in your third example, but I will do my best to work it out.

Sorry, I'll try and answer your questions more properly too:

  1. One reason is research/archive, so it's important to be able to make queries etc. that I haven't thought of yet - so maybe I'd want to know if the proportion of female actors working on more than 2 shows per year is greater in 2019 than 2006. Maybe I'd like to see if directors are always working with the same group of actors, or whether shows in certain genres are more likely to be funded by certain funding bodies. Later it's intended to have a website and public api to make this easier to access for research purposes, but it's important that it works in either English or Chinese even at the query level.
  2. I think I answered this.
  3. And this.
  4. New data will be added regularly, in terms of people and shows. If the new data is in Chinese only, for example, it would be good if it's not necessary to retranslate anything that's already translated elsewhere, like people's names, roles etc. New properties will probably get added later to either people or shows. For example I will add funding bodies later and add relationships between them and shows.

Hope that's enough info - I realise perhaps there is not a 'best' way but I'm grateful for any advice because I really don't have any experience with this kind of database. Thanks.

So obviously I wrote this before actually testing any of my proposed ideas and ran into "Property values can only be of primitive types or arrays thereof". I realise I need some time to reevaluate what I'm doing and try things out with the actual data. I still think this is a good case for using a graph database, but I'm going to have to get a bit more familiar with it before importing this massive amount of data into it.

You can't have objects stored in node-properties, so this isn't ok:

(a:Person {name:{zh:"李小龍"}})

...but you could do this:

(a:Person {name_zh:"李小龍"})

While that would make the data a little messy, with everything you've told me it's probably the best way to go. You could then keep the data and graph simple, while still supporting nodes without translations.

(p:Person {name_en:"Bob", name_zh:"鲍勃"})
(c:Credit {name_en:"Director", name_zh:"導演"})
(p)-[:CREDITED]->(c:Credit)-[:ON]->(s:Show)
(c)-[:ROLE]->(r:Role)
...

The problem you're going to run into is duplication, and MATCH/MERGE.

// Does the movie "Matrix" already exist in the DB?
MERGE (m:Movie {name_en: "Matrix"}) // what about zh? 矩阵?
MATCH (m:Movie) WHERE m.name_en = "Matrix" OR m.name_zh = "矩阵" // And if it hasn't been added yet?

If you can come up with an identity-scheme, or automatic pseudo-translation scheme, or maybe just use Google Translation API ($20/million characters), before you send stuff into the graph? Somewhere, at some point, you need fields that can be used as an identity, so you don't have to know if a node has been added or not. "How will you write your data-import into MERGE commands, and have no CREATE commands?" If you can manage that, you've got a solution.

This is very helpful, thank you!

I did consider some kind of pseudo-translation to run on data before it goes into the database, and this would probably be possible with roles, but when it comes to people's names, or even show names, it will not be so easy.

At the moment I am working on the MERGE commands to import the data, and struggling a bit with if/else and dealing with all the possible cases in cypher. I do think it will be possible to make it work like this, but I will start a separate thread if I have any questions, since it's no longer about data modelling.

When modelling the airports in our airline network and routes between them, with localized properties on each of them, I found it easiest to have a Destination node that had all of the relationships with other Destinations, etc and then have one DestinationLabels node for each locale, with the relationship labeled by the locale.

This graph drives several of our guest-facing APIs. If I need localized labels for an airport, I can traverse the relationship based on the negotiated locale and the name is always name, etc. I can also use COALESCE() to get the localized value if it exists, but default to English if it hasn't been translated yet.

MERGE (d:Destination {code: "YYC"})
SET
    d.timezone = "America/Edmonton",
    d.cityCode = "YYC",
    d.latitude = 51.131394,
    d.longitude = -114.010551,
    d.countryCode = "CA"
MERGE (d)-[:enCA])->(en:DestinationLabels)
SET
    en.provinceStateName = "Alberta",
    en.airportName = "Calgary International Airport",
    en.name = "Calgary",
    en.countryName = "Canada"
MERGE (d)-[:zhCN]->(zh:DestinationLabels)
SET
    zh.provinceStateName = "艾伯塔省",
    zh.airportName = "卡尔加里国际机场",
    zh.name = "卡尔加里",
    zh.countryName = "加拿大";
2 Likes

I like that. It's clean and clever. Only works if Ivor can define a PK (like your :Dest code).

@ivor, if you can find a way to define a language-agnostic identity for your nodes, use @dave.pirie's method. It'll save you headaches, with only one drawback:
Where you only have one translation, you'll see that one in the other lang. I.E.: if you only have ZH for a specific Director, you'll get the ZH value when you try to get EN.

Thanks a lot for this suggestion - it seems like a great way of working with language data. I'm not sure I can make it work for my data in this project, but I will definitely try it out.

@ivor, if you don't figure out a way to define an identity field for entries in your data, you're gonna have a bad time. (In the form of fragmentation and duplication)

Thank you for continuing to follow up and help me. I implemented your suggestion of using name_en and name_zh, and wrote the necessary logic to enter data with either/both, merging when necessary, and avoided any problems or duplicates with people. I've found that I can (and will have to due to the nature of the data) rely on the Chinese name string being unique for people. Unfortunately I still have some data that exists only in English at the moment. Of course I'm generating uuids when I enter data, but for importing the data initially, I accept that it's going to be more than a bit fiddly.

Honestly, that sounds like the best way to go about it! Nicely done.

It also means you should be able to combine the Dave's airport system with the one you have, to get clean and simple language/translation in the data.

You're almost there! Congrats!