Open source graph databases, first used by social networks like Facebook and Twitter, are seeing mainstream adoption.
Graph databases, which use graph structures for semantic queries, came into prominence through social networks like Facebook and Twitter. But they're used for far more now than just linking connections between friends and relatives. Graph databases give organizations the capability to analyze and understand vast graphs of connected data.
Open Source Graph Databases
Open source graph databases are proving especially popular, as companies increasingly shun proprietary software and vendor lock-in for data management and storage. Open source also gives software developers more flexibility and makes it easier to control up-front costs.
All of the major social networks use open source graph databases. Twitter created the open source FlockDB for managing wide but shallow network graphs. Google's Cayley was inspired by the graph database behind Freebase and its Knowledge Graph, the knowledge base behind its search engine. Facebook uses Apache Giraph, which was built for high scalability.
"Remember back to Alta Vista before Google? Alta Vista was good, but Google was so much better because it actually understood how all the pages on the web linked together," said Quinn Slack, co-founder and CEO of Sourcegraph, which uses a massive open source graph database in its product, a search engine for open source development code.
In this article, we cover:
What Can Graph Databases Do?
Graph databases not only find connections between different points of data, but they also can rank the relevance or weight of those relationships.
"Graphs represent a natural way to model data as they tend to allow it to be directly stored in the manner that we think and reason about it in the real world," said Stephen Mallette, vice president of the Apache TinkerPop open source graph database project and a software engineer at DataStax, a company that develops and provides commercial support for an enterprise edition of the NoSQL Cassandra database.
"The value proposition for graphs has spread past the innovators and early adopters at this point," he said.
How Do Graph Databases Work?
Relational databases perform poorly on ferreting out relationships. They require modeling data at the start by joining tables through foreign keys. And more joins equals drastically poorer performance, which can make them untenable especially for online applications. It also precludes the flexibility to change with business needs.
Most NoSQL databases -- whether key-value-, document- or column-oriented -- also struggle to link disconnected data and graphs, according to an ebook written by executives from Neo4j, which introduced its open source graph database in 2010 and, like many vendors with products built on open source technology, now offers both community and enterprise editions of the database.
While it can be done, it becomes increasingly difficult as companies move beyond modestly sized operations. Twitter and Facebook, meanwhile, deal with billions of relationships.
Rather than requiring applications to create a network out of disconnected data, graph databases store connected data as connected data.
They store data in individual nodes that represent entities called vertices -- a person, product, piece of data -- and the different relationships between them as edges. One node might hold a product name while another might hold a vendor name, with the relationship between them indicating that the vendor supplies that product.
Some definitions require index-free adjacency, meaning that connected nodes physically "point" to each other in the database.
As a result, "For example, we can ask the graph to find for us all the flavors of ice cream liked by people who enjoy espresso but dislike Brussels sprouts, and who live in a particular neighborhood," the Neo4J authors point out in their ebook.
"Whether we want to understand relationships between customers, elements in a telephone or data center network, entertainment producers and consumers, or genes and proteins, the ability to understand and analyze vast graphs of highly connected data will be key in determining which companies outperform their competitors over the coming decade," they state.
Who Uses Graph Databases?
Social networks aren't the only companies using graph databases, of course. Here are some other notable examples of initiatives in which graph databases play a key role:
- Graph database technology helped the International Consortium of Investigative Journalists link connections in the Panama Papers, which involved the leak of 2.6TB of data from Panamanian law firm Mossack Fonseca about hidden offshore accounts. Those connections included couples at the same address who were not married, bank accounts used for money laundering and emails between various people not necessarily named on the accounts.
- Montefiore Medical Center in New York built a data lake based on graph database technology and has teamed up with Mayo Clinic on a predictive algorithm using various physical indicators to predict when patients are likely to have a major adverse event within 48 hours.
- Walmart uses graph technology to generate product recommendations for its online retail operations.
In this article, we cover:
When to Use a Graph Database?
Graph databases aren't the answer to every business problem. At Geisinger Health System, which has become known for its deep dive into health care analytics, Chief Data Officer Nicholas Marko, MD, said standard BI tools are adequate for 80 percent of the organization’s needs.
The most logical graph database use cases are when understanding relationships and their strength is paramount, and the need for performance, flexibility and reduced latency outstrip the capabilities of batch processing of aggregates.
For instance, detecting credit card fraud requires comparing purchases on the card with the card holder's normal buying patterns. In this case, the ability to flag suspicious activity in real time becomes vital.
"I think that you generally want to look to graph databases when the data complexity is high and when there is high value in the relationships within the data," said DataStax's Mallette.
"A graph will really shine under these conditions, because the data modeling is intuitive and relationships are considered first-class citizens."
He added, "Since relationships are first-class citizens, the entities (domain objects) in the graph become straightforward to connect and traverse to arbitrary depth, thus allowing for complex reasoning over the data that would be otherwise quite difficult with an [relational] or other NoSQL database."
Graph Database Strengths and Weaknesses
Graph database strengths include:
- Performance. While query performance degrades quickly in relational databases with the number of joins, graph database performance remains fairly constant as the dataset grows.
- Flexibility. New kinds of relationships, nodes, labels and subgraphs can be added without interfering with existing queries and applications.
- Agility. The schema-free nature of the graph data model means the data model can evolve in concert with iterative software delivery practices.
The issues users face with graph databases include partitioning and density. In a distributed environment, massive graphs are farmed out across a multi-machine compute cluster. Market leaders are addressing the need to limit cross-machine communication by putting information often retrieved together on the same machine.
And if a customer has bought a lot of products at a shopping site, that creates a dense graph, much of which might be irrelevant information to the query at hand. The database needs to filter very specifically to each unique query.
Graph databases are not good at quickly doing global aggregations, Mallette said, though the open source TinkerPop helps mitigate the problems.
As an example, he said, "to simply count all the vertices in a graph requires iterating over every vertex in a graph. In a graph of billions of vertices, that can take a long time and if you were doing more than a count -- finding all the 'product' vertices then traversing to 'sale' vertices to calculate 'sales by month' -- it would likely become even more expensive.
Using a graph database for applications with those kinds of requirements "will present a weak spot that you will have to be aware of," Mallette said. "The problem is not insurmountable, but if most of your application requires this type of analysis in real-time, you might need to reconsider your graph model or, in some cases, consider other data storage approaches."
The Graph Database Market
Though IDC expects the overall database market to reach $50 billion by 2017, graph databases make up only a sliver of that. Forrester Research projects that 25 percent of enterprises will use graph databases by 2017.
Neo4j, the most popular graph database, comes in at No. 21 on the overall DB-Engines ranking. OrientDB, an open source document/graph hybrid, is the second most popular graph database, followed by Titan, an open source project used in DataStax and other offerings.
Neo4j recently released version 3.0, with an architecture overall primarily focused on a new data store; graph-native storage being another issue in this market.
Big Players and the Graph Database Trend
All the major database players, even those best known for their proprietary software, have released or are working on graph capabilities.
Microsoft CEO Satya Nadella recently cited LinkedIn's graph technology as one of its most attractive features prompting the $26.2 billion acquisition. Microsoft has also been working on Graph Engine, a distributed, in-memory, large graph processing engine.
Oracle in March released Parallel Graph Analytics (PGX) v1.2, employing parallelism to increase performance, a new query language for graph pattern matching, and a new algorithm and APIs to help you build a recommendation engine on top of your graph.
Cloud market leader Amazon Web Services' NoSQL DynamoDB offers a Titan plug-in for graphs.
DataStax and IBM recently announced commercial products built on TinkerPop, which attained top-level project status in May with the Apache Software Foundation.
TinkerPop is an open source graph computing framework for both real-time, transactional graph databases (OLTP) and batch analytic graph processors (OLAP). It can be used for small graphs on a single machine or massive graphs that require a distributed environment.
The project is focused on creating industry standards for graph databases, including a standard language, which it calls Gremlin. Its Gremlin traversal machine, meanwhile, is designed to work across languages.
TinkerPop likely plays a role in the growing interest in graphs, Mallette said.
"Without a project like TinkerPop, the graph database world would be quite fragmented. Every graph system would have its own API, its own method for doing queries and no simple methods for integration. That fragmentation would look like a risky technology choice. Imagine what the relational database market would look like without JDBC [the API in Java for accessing a database]," he said. "TinkerPop alleviates that risk by unifying the APIs for interacting with a graph system, making it possible to avoid vendor lock-in and lower the learning curve across all graphs."
Susan Hall has been a journalist for more than 20 years at news outlets including the Seattle Post-Intelligencer, Dallas Times Herald and MSNBC.com. She writes for The New Stack and FierceHealthIT, among other publications.