Turning my nested course data into a graph

Hi there, I am new to neo4j, but I like graphs. My goal is as follows, I'd like to insert some information about a course I am taking into the database. Right now here is what the data looks like:

As you can see some definitions depend on each other, I'd like to model this in my graph.

Here are some other relationships I'd like to model.

  • Each topic has definitions, theorems, propositions, and exercises (HAS relationship)
  • Theorems may use other theorems in their proof (USES relationship)
  • For an exercise or proof that must be written by the learner, we can obtain a list of definitions and theorems that are available for us to use, a definition or theorem is available if the definition was made before the exercise or the theorem was proven before. (IS_AVAILABLE relationship)

(One way I was thinking to implement IS_AVAILABLE is that we could give everything a number attrubte, and check if the number is less than the current number, that way Definition 1.7.1 cannot used Definition 1.7.2 etc... or Theorem 1.8.1)

So now I have defined a few relationships. I'd like to know how you would suggest I get all of my information into the database as well as constructing these relationships between them. Note that the data is ordered, eg)

The application I'm using for these notes is workflowy, and I've found this on their page about exporting: https://workflowy.zendesk.com/hc/en-us/articles/202610369-How-to-export-or-copy-content-from-Workflowy.

So It seems like I could either extract the data in one of those formats or directly in text. I'm wondering if someone could outline how I could most efficiently get this data into my database. Thanks!

Update: I managed to extract the data via a textfile, so if having that would be useful, here it is, sample_data.txt (17.5 KB)

There's a lot of modeling and processing questions there, focusing on the data load; I suggest, based on the zendesk link, investigate the OPML format. It's structured (XML) and will be easier than text processing. There's a number of converters from OMPL to other formats if you're more comfortable with JSON or graphQL. Neo4j has an excellent helper library called APOC, it has an XML load procedure and you should be able to find plenty of examples (apoc.load.xml).

I think you're on the right track with your initial relationships. I'd defer implementing IS_AVAILABLE till a later phase, unless you think the source data is corrupt. If you need to edit the data directly in the graph, you can add an RI check to the UI or to the database (multiple options).

Hope that helps

1 Like

Hi Robert,

Thanks so much for your reply. Thanks for explaining a possible way I could import the data, I'm going to try that soon. And I'll let you know how that goes.

As for the relationships, I didn't really understand your comments on the IS_AVAILABLE relationship, were you trying to say that this could be extrapolated from the source data in some way (based on your comment that it wouldn't be available if it was corrupt) ?

As for your last comment about editing the data directly in the graph, I'm not too sure what adding an RI check to the UI or database means.

My end goal is to be able to edit the data directly from a website, but until I have that editing it by hand would have to be what I do until then.

My bad, when I read your initial description I assumed "IS_AVAILABLE" was a check to be performed while loading data. My advice to defer was based on that. I used the term Referential Integrity (RI) because I saw it as a check to preserve integrity. RI checks can be done in (each) UI or in the database.

I understand the intent now, you want to limit the user's (learner) choices. I suspect this filter is something you'd want to perform at runtime vs. creating relationships but would need to understand the use case more to evaluate the trade-off.

Hope that helps, sorry for the confusion.

Post a sample OPML, I'll try to look at it, that will go a long way to determining how easy it will be to create the HAS & USES relationships.

1 Like

Hi Robert,

No worries for the confusion.

I'll explain in full detail what I'm trying to do. I'm trying to create a new method of learning which is similar to a textbook, but provides the learner the data in the form of a tree structure, where you can keep opening nodes to view more detail (This way you can decide whether you just want to get the concepts without any of the details, (to reduce cognitive load) but still have the ability to look deeper.

As for the data, I exported it here is a sample (shows the nesting of the definitions), thought right now the relationships saying whether a definition is included in another definition doesn't exist (the image I drew in the original post)

    <outline text="Mathematics">
      <outline text="Logic">
        <outline text="Content">
          <outline text="Structures &amp;amp; Languages">
            <outline text="Proof Techniques">
              <outline text="For Proving things about terms or formulas, since they are given in terms of a recursive definition, in the base case we show it holds on the atomic elements, then in the inductive step we assume it holds on the smaller elements, and show it holds on this element. Structural Induction.">
                <outline text="Example proof, is show that for any formula, the number of left parens is equal to right parens. ">
                  <outline text="Base case:  Nothing has parens" />
                  <outline text="Ind Case: Consider a formula of the form 3 4 5, assume it true for the formulas that they are constructed from (alpha, beta). It's true because each of 3, 4, 5 add an even number of braces, and even + even is even." />
            <outline text="Definitions">
              <outline text="Language">
                <outline text="Definition 1.2.1: A first-order language L is an infinite collection of distinct symbols, no one of which is properly contained in another, separated into the following categories" _note="What we make stuff out of">
                  <outline text="Parenthesis: (, )." />
                  <outline text="Connectives: V, ~" />
                  <outline text="Quantifier: (forall)" />
                  <outline text="Variables: for each positive integer n: v1, v2, ..., vn, ... . The set of variable symbols will bed denoted as Vars." />
                  <outline text="Equality symbol: =" />
                  <outline text="Constant symbols: Some set of zero or more symbols" />
                  <outline text="Function symbols: For each positive integer n, some set of zero or more n-ary function symbols." />
                  <outline text="Relation symbols: For each positive integer n, some set of zero or or more n-ary relation symbols" />
                <outline text="Definition 1.3.1. Term of L" _note="Nouns of our language">
                  <outline text=" If L is a language, a term of L is a nonempty finite string t of symbols from L such that either:">
                    <outline text="1. t is a variable, or" />
                    <outline text="2. t is a constant symbol, or" />
                    <outline text="3. t :≡ ft1t2 . . . tn, where f is an n-ary function symbol of L and each of the ti is a term of L" />
                <outline text="Definition 1.3.3. Formula of L" _note="Assertions about the objects of the structure">
                  <outline text="If L is a first-order language, a formula of L is a nonempty finite string φ of symbols from L such that either:">
                    <outline text="1. φ :≡ = t1t2, where t1 and t2 are terms of L, or" />
                    <outline text="2. φ :≡ Rt1t2 . . . tn, where R is an n-ary relation symbol of L and t1, t2,. . . , tn are all terms of L, or" />
                    <outline text="3. φ :≡ (¬α), where α is a formula of L, or" />
                    <outline text="4. φ :≡ (α ∨ β), where α and β are formulas of L, or" />
                    <outline text="5. φ :≡ (∀v)(α), where v is a variable and α is a formula of L." />
                    <outline text="If a formula ψ contains the subformula (∀v)(α) [meaning that the string of symbols that constitute the formula (∀v)(α) is a substring of the string of symbols that make up ψ], we will say that the scope of the quantifier ∀ is α" />
                <outline text="Definition 1.5.1. Number Theory Language">
                  <outline text="The language L_NT is {0, S, +, ·, E, &amp;lt;}, where 0 is a constant symbol, S is a unary function symbol, +, ·, and E are binary function symbols, and &amp;lt; is a binary relation symbol. " />
                <outline text="Definition 1.5.2. Suppose that v is a variable and φ is a formula. We will say that v is free in φ if" _note="In the integral from 1 to x of 1/t dt, x is free but t is not, it doesn't make sense to choose a value for t, but it does for x">
                  <outline text="1. φ is atomic and v occurs in (is a symbol in) φ, or" />
                  <outline text="2. φ :≡ (¬α) and v is free in α, or" />
                  <outline text="3. φ :≡ (α ∨ β) and v is free in at least one of α or β, or" />
                  <outline text="4. φ :≡ (∀u)(α) and v is not u and v is free in α." />
                  <outline text="Note: (∀v1∀v2(v1 + v2 = 0)) ∨ v1 = S(0). Is free in v1, because it is free in the RHS, note that it is not free in LHS of or symbol" />
                <outline text="Definition 1.5.3. A sentence in a language L is a formula of L that contains no free variables." _note="It's either true or false, since there are no free vars">
                  <outline text="Ex: L = {0, 1, 2, +}, 1+1 = 2, and (∀x)(x+1=x)" />

I'm going to make it an online website using a tree-view (https://www.jstree.com/).

For example, inside of the "Mathematics" node, you would find "Logic" and "Combinatorics" (Combinatorcs omitted for brevity in the above) as children (I'm thinking this might be more of a SUBFIELD relationship, being more specific than HAS), then inside of "Combinatorics" there would be sections (SECTION) relationship

In terms of the relationships existing at runtime vs not, I'm not entirely sure how that would work, but I suppose my thought process was that I should put it inside of the database, my reasoning was from the following paragraph. (But honestly I'm not sure what the best choice is here, so any reasons you could give to help me decide would be great)

For example every time a new mathematical definition is introduced, if that definition uses any other definitions, I would like them to be able to open that definition right there and it is displayed (could be another treeview opened in a split). Also if a student is trying to prove a statement, I want them to be able to see exactly what tools they have at their disposal (IS_AVAILABLE).

Anways, hopefully this gives you more of an idea as to what I'm trying to do.
Thanks so much.

I am following.

Subfield is generally called a Subclass, (i.e. Mathematics is a class, Logic is a subclass) but Field and Subfield is functionally equivalent.

There's definitely potential (and challenges) for parsing the text and making relationships from it. Less challenges than "natural language", so doable.

Hi Robert,

I have now moved my focus on the more implementation side of things. I realised that my data right now will probably require some editing, as I want to include some mathematical formatting. So I'm not too worried about getting the data in via an import. (Let me know if it seems sane)

Instead what I'd like to do is construct an API so that I can construct the tree via the web interface. Here's what I understand so far.

  1. Obtain a computer elsewhere (server) , install neo4j on it.
    I found this link: Hosting Neo4j in the Cloud - Developer Guides with some possibilities, what would the next couple steps look like once I have a machine there? Would I be SSH'ing into it and writing the code for the web API ? Would I still be able to view the graph graphically like I can on the desktop?

  2. Use a different language on the server to intercept HTTP requests (GET, POST, DELETE), and upon each one of these either construct a new node in the graph. Specifically; turned into a cyper query and then returned back in some format that works well with jstree. (One possibility could be nodejs, python and neo4j)
    For example, let's say I'm reconstructing the data I gave in the last comment from scratch. Then I might make a request (POST?) which has some data like "Title": "Mathematics", then that gets created in the tree
    Then I might make a new request where I am wanting to make the Logic (SUBCLASS), then I suppose in the front-end I would want the option to choose from relationships and somehow specify exactly where in the tree this new data will be appended to. (Q: How would one do this? Would I have to give the full path to the part in the tree? Like "TOPIC/SUBCLASS/SUBCLASS/DEFINITION" ?, or would there be a way to uniquely identify the parent node directly and then simply append the new child to the parent. )

  3. The front end would then get back a message (a responce from the POST request), and update the tree with this information.

Also I'm thinking that returning all the information wouldn't be required, but only as they start opening out the structure of it, we load in more data (say they click on a node that has the attribute not_loaded=true, then we would go and load the next couple layers of the tree starting at that point)

Before I told you that I wanted it to have a similar structure as a textbook, so different levels of nesting so inside of a topic, we have sub(classes/fields) and inside of those we have definitions and theorems, etc. These relationships are only structural in a sense, and I'd like a way to filter this type of relationship. Like in the above example, I want to load the next couple layers of the tree only based on a structural (SUB(ANYTHING)) type of relationship. How would that look?


You might look at Aura (cloud based Neo4j). Otherwise install locally (to start) Neo4j desktop.

Unless you're wedded to a particular UI framework you might want to consider GRAND stack. treejs is compatible with react and shouldn't pose too much of a problem. we should be looking at graphql

Here's a link highlighting the use with Aura - Video: Apollo React Hooks & Deploying To Aura | Building A Real Estate Search App w/ GRANDstack Part 5 - Neo4j Graph Data Platform

Basic getting started with GRAND stack - Video: Hands-On With The GRANDstack Starter Project - GraphQL, React, Apollo & Neo4j Database - Neo4j Graph Data Platform

Filtering relationships/nodes is straightforward. cypher is powerful and should support all your requirements, if needed you can embed cypher in the graphql

Updating the graph is also straightforward, using graphql mutations or you can revert to cypher directly.

1 Like

Hi Robert,

I took your advice and started figuring out the GRAND stack. So far I have gotten the starter project (about users and businesses) and started modifying the schema.graphql to match my use case:

type AreaOfStudy {
    title: String
    subfields: [Subfield]
        statement: "MATCH (this) <- [:SUBFIELD_OF] - (sub) RETURN sub"

type Subfield {
    title: String!
    topics: [Topic]

type Topic {
    topicId: ID!
    title: String!
    sections: [Section]

type Section {
    sectionId: ID!
    title: String!
    definitions: [Definition]

type Definition {
    definitionId: ID!
    title: String!
    content: String
    definitionsUsed: [Definition]

type Proof {
    proofId: ID!
    title: String!
    content: String
    definitionsUsed: [Definition]
    theoremsUsed: [Theorem]

type Theorem {
    theoremId: ID!
    title: String!
    content: String
    definitionsUsed: [Definition]

type Exercise {
    theoremId: ID!
    title: String!
    content: String
    definitionsUsed: [Definition]
    suggestedKnowledge: [Theorem]

type User {
    userId: ID!
    name: String!
    completedExercises: [Exercise]

type Knowledge {
    definitions: [Definition]
    theorems: [Theorem]

type Query {
    allAreasOfStudy: [AreaOfStudy]
    allSubfieldsOf(aos_id: String!): [Subfield]
    allSectionsOf(sub_id: String!): [Section]
    allTopicsOf(sec_id: String!): [Topic]
    allDefinitionsOf(top_id: String!): [Definition]
    allTheoremsOf(top_id: String!): [Theorem]
    allExercisesOf(top_id: String!): [Exercise]
    definitionsUsedBy(def_id: ID!): [Definition]


In addition I now have neo4j-desktop with some sample data:

And from the graphql playground I am able to get some data:

Though I am having an issue when trying to get all the subfields of the area of study: (the fourm isn't letting me embed more than two items as a new user so I'm putting an external link to it) https://i.imgur.com/cTFEZxx.png

Here is the full error:

  "errors": [
      "message": "Unknown function 'apoc.cypher.runFirstColumn' (line 1, column 106 (offset: 105))\n\"RETURN areaOfStudy{.title, subfields: [areaOfStudy_subfields IN apoc.cypher.runFirstColumn(\"MATCH (this) <- [:SUBFIELD_OF] - (sub) RETURN sub\", {this: areaOfStudy}, true) | areaOfStudy_subfields{.title}]} AS areaOfStudy\"\n                                                                          ^",
      "locations": [
          "line": 2,
          "column": 3
      "path": [
      "extensions": {
        "code": "INTERNAL_SERVER_ERROR",
        "exception": {
          "code": "Neo.ClientError.Statement.SyntaxError",
          "name": "Neo4jError",
          "stacktrace": [
            "Neo4jError: Unknown function 'apoc.cypher.runFirstColumn' (line 1, column 106 (offset: 105))",
            "\"RETURN areaOfStudy{.title, subfields: [areaOfStudy_subfields IN apoc.cypher.runFirstColumn(\"MATCH (this) <- [:SUBFIELD_OF] - (sub) RETURN sub\", {this: areaOfStudy}, true) | areaOfStudy_subfields{.title}]} AS areaOfStudy\"",
            "                                                                          ^",
            ": ",
            "    at captureStacktrace (/home/cjm/projs/Math/api/node_modules/neo4j-driver/lib/result.js:263:15)",
            "    at new Result (/home/cjm/projs/Math/api/node_modules/neo4j-driver/lib/result.js:68:19)",
            "    at newCompletedResult (/home/cjm/projs/Math/api/node_modules/neo4j-driver/lib/transaction.js:449:10)",
            "    at Object.run (/home/cjm/projs/Math/api/node_modules/neo4j-driver/lib/transaction.js:287:14)",
            "    at Transaction.run (/home/cjm/projs/Math/api/node_modules/neo4j-driver/lib/transaction.js:123:32)",
            "    at /home/cjm/projs/Math/api/node_modules/neo4j-graphql-js/dist/index.js:156:25",
            "    at TransactionExecutor._safeExecuteTransactionWork (/home/cjm/projs/Math/api/node_modules/neo4j-driver/lib/internal/transaction-executor.js:134:22)",
            "    at TransactionExecutor._executeTransactionInsidePromise (/home/cjm/projs/Math/api/node_modules/neo4j-driver/lib/internal/transaction-executor.js:122:32)",
            "    at /home/cjm/projs/Math/api/node_modules/neo4j-driver/lib/internal/transaction-executor.js:61:15",
            "    at new Promise (<anonymous>)"
  "data": {
    "AreaOfStudy": null

If you could help me figure out the issue here that would be great! (I know it has to do with apoc.cyper.runFirstColumn not being defined or something in relation to that)

Anyways, thanks for showing me GRANDstack, and helping me start figuring this out!