Neo4j Database – Part 06 – Creating Our Data Model

In the last blog I discussed the various types of data loading options Neo4j offers. In this blog I will take the Northwind database and convert it to a Neo4j graph database model.

Contents

  • Converting an RDBMS Model
  • Data Types
  • Converting Tables that do not participate in Joins
  • Converting One to One/One to Many Joins
  • Converting a Self-Join
  • Converting a Many to Many Join (No Relationship Properties)
  • Converting two more One to Many Joins
  • Converting a Many to Many Join (With Relationship Properties)
  • Finished Model

Converting an RDBMS Model

I like the Northwind database for simple tutorials. I have used it for years because it contains one to one, one to many and many to many relationships. It turns out that Neo4j examples also like it and you will find lots of examples using this database. Rather than shy away and use another example database like AdventureWorks, I am still going to use Northwind but convert it from my perspective rather than rehash what is already written.

Data Types

When moving to Neo4j you need to think about the data types. For example; Neo4j does not have a datetime type and therefore you need to make a decision on how you will represent a date and time in Neo4j. Some designs store the number of milliseconds since a given date. Neo4j does have a timestamp() function that does this. Some store the date as YYYYMMDD. There are many ways and this value can be reformatted during the extract and transform phase when loading the data. Also by default a value will be stored as a string in Neo4j so you will need to use one of the functions toInteger(), toFloat() or toBoolean() to get the correct representation.

Converting Tables that do not participate in Joins

Tables that do not participate in joins are easy candidates for converting. Each row becomes a node and each column in the row is added as a property. The table name is added as a Label to each node. Northwind does not have any tables that do not participate in any joins.

Converting One to One/One to Many Joins

Denormalization is a natural design decision in NoSQL databases unlike RDBMS’s which only perform it if there are performance or scalability issues. Neo4j kind of sits on the fence of RDBMS/NoSQL and since it thrives on data relations, actually retaining a fully normalized model is not a bad design decision. It always boils down to the queries you will be performing. Let’s use the Products table as our example here. Always profile your data to check if it is a one to one or a one to many. In our case the Categories table has a one to many mapping. Do we retain categories as separate nodes or do we denormalize the data and add it into each Product node. If you add it to the Product node it is likely you will need to index it. A reason for not adding it would be if the Categories table has lots of columns. There would be a lot of duplication for maybe no benefit.

For this tutorial I am going to include the CategoryName into each Product node and exclude the other three columns. I will then index this column.

The Suppliers table also has a one to many mapping to the Products table however; it has allot of useful columns so we will leave it as it is and build individual Supplier nodes. Before we do that we need to look at the primary keys for Products and Suppliers. We have a ProductID and a SupplierID. Do we include these columns in our Neo4j model? If the key is a Natural Key then the answer is yes every time. If we use an alternate field that is also unique it is likely you will create a unique constraint. Using our Suppliers example; we could use CompanyName however; it’s a large string and this would create a large index. For this tutorial I will keep both integer keys. Another question is “do we include foreign keys”? Not in the Neo4j model however; keep them in the CSV file and it can be used to create the SUPPLIES relationship.

So looking at the RDBMS schema above, we will convert it to the following Neo4j schema.

We will create two CSV files using the following SQL.

SELECT P.*, C.CategoryName
FROM   Products   AS P
JOIN   Categories AS C ON (p.CategoryID = c.CategoryID)

SELECT *
FROM   Suppliers

Quick note on the CSV files. We will be using LOAD CSV:

  • Save them as UTF-8 (No BOM)
  • Always include a header. Remove spaces between the commas.
  • Watch out for commas stored in description or note type fields.
  • If the CSV contains NULL for a field, if you replace it with a blank field so you get two commas side by side, LOAD CSV will not add it as a property. Anything else will be added as a property into Neo4j.
  • Comment our or change the server setting dbms.directories.import if you need to specify a different local folder.

Here are the LOAD CSV commands to create the Product and Supplier nodes and the SUPPLIES relationship.

-- Product
LOAD CSV WITH HEADERS FROM "file:///products.csv" AS data
MERGE (n:Product {ProductID: data.ProductID})
ON CREATE SET 
n.Product         = data.ProductName,
n.Category        = data.CategoryName,
n.QuantityPerUnit = data.QuantityPerUnit,
n.UnitPrice       = toFloat(data.UnitPrice),
n.UnitsInStock    = toInteger(data.UnitsInStock),
n.UnitsOnOrder    = toInteger(data.UnitsOnOrder),
n.ReorderLevel    = toInteger(data.ReorderLevel),
n.Discontinued    = (data.Discontinued  "0")

CREATE CONSTRAINT ON (n:Product) ASSERT n.ProductID IS UNIQUE

-- Supplier
LOAD CSV WITH HEADERS FROM "file:///suppliers.csv" AS data
MERGE (n:Supplier {SupplierID: data.SupplierID})
ON CREATE SET
n.Company       = data.CompanyName,
n.Contact       = data.ContactName,
n.ContactTitle  = data.ContactTitle,
n.Address       = data.Address,
n.City          = data.City,
n.Region        = data.Region,
n.PostalCode    = data.PostalCode,
n.Country       = data.Country,
n.Phone         = data.Phone,
n.Fax           = data.Fax,
n.HomePage      = data.HomePage

CREATE CONSTRAINT ON (n:Supplier) ASSERT n.SupplierID IS UNIQUE

-- Product to Supplier Relationship
LOAD CSV WITH HEADERS FROM "file:///products.csv" AS data
MATCH (p:Product),(s:Supplier)
WHERE p.ProductID = data.ProductID AND s.SupplierID = data.SupplierID
CREATE (s)-[:SUPPLIES]->(p)

Notice:

  • We did not have to include every field in our node
  • We also took the opportunity to rename the properties.
  • We included the to… functions to ensure the correct data type was used.
  • The discontinued property in the CSV was not a string so I had to use the expression (data.Discontinued “0”) in order to create a Boolean type. I could not use toBoolean() because it expects a value of “true” or “false”, not 1 or 1.

After the data is loaded in we should see something like the following:

MATCH (n) -[]- (o) WHERE n.Company = ‘New Orleans Cajun Delights’ RETURN *

Converting a Self-Join

The employees table has a self-join which creates a hierarchy of employee to manager. The table uses the ReportsTo column to point to the EmployeeID of its manager. In order to model I will use a slightly different approach to what we did in the previous example. Instead of including the ReportsTo column in the CSV, I will generate a second CSV which just the relationships. Also the Employee table includes three dates and I will be converting the dates to YYYYMMDD in the CSV file.

So looking at the RDBMS schema above, we will convert it to the following Neo4j schema.

We will create two CSV files using the following SQL.

SELECT EmployeeID, LastName, FirstName, Title, TitleOfCourtesy, CONVERT(VARCHAR,BirthDate,112) AS [BirthDate], 
       CONVERT(VARCHAR,HireDate,112) AS [HireDate], Address, City, Region, PostalCode, Country, Extension, Notes
FROM   Employees

SELECT E2.EmployeeID, E1.EmployeeID AS [ReportsTo]
FROM   Employees AS E1
JOIN   Employees AS E2 ON (E1.EmployeeID = E2.ReportsTo)

Here are the LOAD CSV commands to create the Employee nodes and the REPORTS_TO relationship.

-- Employee
LOAD CSV WITH HEADERS FROM "file:///employees.csv" AS data
MERGE (n:Employee {EmployeeID: data.EmployeeID})
ON CREATE SET
n.LastName        = data.LastName,
n.FirstName       = data.FirstName,
n.Title           = data.Title,
n.TitleOfCourtesy = data.TitleOfCourtesy,
n.BirthDate       = data.BirthDate,
n.HireDate        = data.HireDate,
n.Address         = data.Address,
n.City            = data.City,
n.Region          = data.Region,
n.PostalCode      = data.PostalCode,
n.Country         = data.Country,
n.HomePhone       = data.HomePhone,
n.Extension       = data.Extension,
n.Notes           = data.Notes

CREATE CONSTRAINT ON (n:Employee) ASSERT n.EmployeeID IS UNIQUE

-- Employee Relationship to Manager
LOAD CSV WITH HEADERS FROM "file:///Employees_ReportsTo.csv" AS data
MATCH (n:Employee {EmployeeID: data.EmployeeID}), (o:Employee {EmployeeID: data.ReportsTo})
MERGE (n)-[:REPORTS_TO]->(o)

After the data is loaded in we should see something like the following:

MATCH (n:Employee) RETURN n

Converting a Many to Many Join (No Relationship Properties)

When describing a many to many join in an RDBMS it is normal to create an intermediate table which has a one to many relationship to each of the other two tables. This is not a requirement for Neo4j. The only two columns we are interested in are TerritoryDescription and RegionDescription. We can create a Territory node with these two properties.

So looking at the RDBMS schema above, we will convert it to the following Neo4j schema.

We will create two CSV files using the following SQL.

SELECT T.TerritoryID, T.TerritoryDescription, R.RegionDescription
FROM   Territories AS T
JOIN   Region      AS R ON (R.RegionID = T.RegionID)

SELECT *
FROM   EmployeeTerritories

Here are the LOAD CSV commands to create the Territory nodes and the SELLS_TO relationship.

-- Territories
LOAD CSV WITH HEADERS FROM "file:///Territories.csv" AS data
MERGE (n:Territory {TerritoryID: data.TerritoryID})
ON CREATE SET
n.Territory   = data.TerritoryDescription,
n.Region      = data.RegionDescription

CREATE CONSTRAINT ON (n:Territory) ASSERT n.TerritoryID IS UNIQUE

-- Employee to Territory Relationship
LOAD CSV WITH HEADERS FROM "file:///Employees_Territories.csv" AS data
MATCH (n:Employee {EmployeeID: data.EmployeeID}), (o:Territory {TerritoryID: data.TerritoryID})
MERGE (n)-[:SELLS_TO]->(o)

After the data is loaded in we should see something like the following:

MATCH (n:Employee)-[SELLS_TO]->(o:Territory) RETURN *

Converting two more One to Many Joins

The Customers table has a one to many relationship with the Orders table. As we did with the previous examples; we will remove the foreign key and create the relationship from a CSV file. Orders also have a one to many mapping between Orders and Shippers. Since Shippers just has two columns, we will pull these into the Orders table.

So looking at the RDBMS schema above, we will convert it to the following Neo4j schema.

We will create two CSV files using the following SQL.

SELECT *
FROM   Customers

SELECT O.OrderID, O.CustomerID, O.EmployeeID, CONVERT(VARCHAR,O.OrderDate,112) AS [OrderDate], 
       CONVERT(VARCHAR,O.RequiredDate,112) AS [RequiredDate], CONVERT(VARCHAR,O.ShippedDate,112) AS [ShippedDate],
       S.CompanyName AS [ShippingCompany], S.Phone AS [ShippingPhone], O.ShipName, O.ShipAddress, O.ShipCity,
       O.ShipRegion, O.ShipPostalCode, O.ShipCountry
FROM   Orders   AS O
JOIN   Shippers AS S ON (S.ShipperID = O.ShipVia)

Here are the LOAD CSV commands to create the Customer and Order nodes and the BUYS relationship.

-- Customer
LOAD CSV WITH HEADERS FROM "file:///customers.csv" AS data
MERGE (n:Customer {CustomerID: data.CustomerID})
ON CREATE SET
n.CustomerID   = data.CustomerID,
n.Company      = data.CompanyName,
n.Contact      = data.ContactName,
n.ContactTitle = data.ContactTitle,
n.Address      = data.Address,
n.City         = data.City,
n.Region       = data.Region,
n.PostalCode   = data.PostalCode,
n.Country      = data.Country,
n.Phone        = data.Phone,
n.Fax          = data.Fax

CREATE CONSTRAINT ON (n:Customer) ASSERT n.CustomerID IS UNIQUE

-- Orders
LOAD CSV WITH HEADERS FROM "file:///orders.csv" AS data
MERGE (n:Order {OrderID: data.OrderID})
ON CREATE SET
n.OrderDate        = data.OrderDate,
n.RequiredDate     = data.RequiredDate,
n.ShippedDate      = data.ShippedDate,
n.ShippingCompany  = data.ShippingCompany,
n.ShippingPhone    = data.ShippingPhone,
n.ShipName         = data.ShipName,
n.ShipAddress      = data.ShipAddress,
n.ShipCity         = data.ShipCity,
n.ShipRegion       = data.ShipRegion,
n.ShipPostalCode   = data.ShipPostalCode,
n.ShipCountry      = data.ShipCountry

LOAD CSV WITH HEADERS FROM "file:///orders.csv" AS data
MATCH (o:Order {OrderID: data.OrderID}), (c:Customer {CustomerID: data.CustomerID})
MERGE (c)-[:BUYS]->(o)

CREATE CONSTRAINT ON (n:Order) ASSERT n.OrderID IS UNIQUE

After the data is loaded in we should see something like the following:

MATCH (c:Customer)-[:BUYS]->(o:Order) WHERE c.CustomerID = ‘ALFKI’ RETURN *

Converting a Many to Many Join (With Relationship Properties)

The last change we need to perform is when we have a many to many relationship and the intermediate table has additional columns of data. The Order Details table is a join between Products and Orders. The table also includes three numerical properties. An order includes one or more products and for each product in the order, we need to know the quantity, price and any discounts. We already have Product and Order nodes created previously. All we now need is to create an ORDER_ITEM relationship which includes the three properties described.

So looking at the RDBMS schema above, we will convert it to the following Neo4j schema.

I will create one CSV files using the following SQL.

SELECT *
FROM   [Order Details]

Here are the LOAD CSV commands to create the ORDER_ITEM relationship.

LOAD CSV WITH HEADERS FROM "file:///orderdetails.csv" AS data
MATCH (o:Order {OrderID: data.OrderID}), (p:Product {ProductID: data.ProductID})
MERGE (o)-[:ORDER_ITEM {UnitPrice: toFloat(data.UnitPrice), Quantity: toInteger(data.Quantity), Discount: toInteger(data.Discount)}]->(p)

After the data is loaded in we should see something like the following:

MATCH (o:Order)-[:ORDER_ITEM]->(p:Product) WHERE o.OrderID = ‘10248’ RETURN *

Finished Model

Here is the before and after. We have gone from 11 tables to 6 nodes.

In the next blog I will show some example Cypher queries.