Beyond NoSQL: The case for distributed SQL

0 Posted by paul bratcher - 23rd July 2020 - Technology

In the beginning, there were files. Later there were navigational databases based on structured files. Then there were IMS and CODASYL, and around 40 years ago we had some of the first relational databases. Throughout much of the 1980s and 1990s “database” strictly meant “relational database.” SQL ruled.

Then with the growing popularity of object-oriented programming languages, some thought the solution to the “impedance mismatch” of object-oriented languages and relational databases was to map objects in the database. Thus we ended up with “object-oriented databases.” The funny thing about object databases was that in many cases they were basically a normal database with an object mapper built-in. These waned in popularity and the next real mass-market attempt was “NoSQL” in the 2010s.

The attack on SQL

NoSQL attacked both relational databases and SQL in the same vein. The main problem this time was that the Internet had destroyed the underlying premise of the 40-year-old relational database management system (RDBMS) architecture. These databases were designed to conserve precious disk space and scale vertically. There were now way too many users and way too much for one fat server to handle. NoSQL databases said that if you had a database with no joins, no standard query language (because implementing SQL takes time), and no data integrity then you could scale horizontally and handle that volume. This solved the issue of vertical scale but introduced new problems.

Developed in parallel with these online transaction processing systems (OLTP) was another type of mainly relational database called an online analytical processing system (OLAP). These databases supported the relational structure but executed queries with the understanding that they would return massive amounts of data. Businesses in the 1980s and 1990s were still largely driven by batch processing. In addition, OLAP systems developed the ability for developers and analysts to imagine and store data as n-dimensional cubes. If you imagine a two-dimensional array and lookups based on two indices so that you are basically as efficient as constant time but then take that and add another dimension or another so that you can do what are essentially lookups of three or more factors (say supply, demand, and the number of competitors)—you could more efficiently analyze and forecast things. Constructing these, however, is laborious and a very batch-oriented effort.

Around the same time as scale-out NoSQL, graph databases emerged. Many things are not “relational” per se, or not based on set theory and relational algebra, but instead on parent-child or friend-of-a-friend relationships. A classic example is product line to product brand to model to components in the model. If you want to know “what motherboard is in my laptop,” you find out that manufacturers have complicated sourcing and the brand or model number may not be enough. If you want to know what-all motherboards are used in a product line, in classic (non-CTE or Common Table Expression) SQL you have to walk tables and issue queries in multiple steps. Initially, most graph databases didn’t shard at all. In truth, many types of graph analysis can be done without actually storing the data as a graph.

NoSQL promises kept and promises broken

NoSQL databases did scale much, much better than Oracle Database, DB2, or SQL Server, which are all based on a 40-year-old design. However, each type of NoSQL database had new restrictions:

Key-value stores: There is no simpler lookup than db.get(key). However, much of the world’s data and use cases cannot be structured this way. Moreover, we are really talking about a caching strategy. Primary key lookups are fast in any database; it is merely what is in memory that matters. In the best case, these scale like a hash map. However, if you have to do 30 database trips to put your data back together or do any kind of complicated query — this isn’t going to work. These are now more frequently implemented as caches in front of other databases. (Example: Redis.)
Document databases: These achieved their popularity because they use JSON and objects are easy to serialize to JSON. The first versions of these databases had no joins, and getting your whole “entity” into one giant document had its own drawbacks. With no transactional guarantees, you also had data integrity issues. Today, some document databases support a less robust form of transaction, but it is not the same level of guarantee most people are used to. Also, even for simple queries these are often slow in terms of latency — even if they scale better in terms of throughout. (Examples: MongoDB, Amazon DocumentDB.)
Column stores: These are as fast as key-value stores for lookups and they can store more complicated data structures. However, doing something that looks like a join across three tables (in RDBMS lingo) or three collections (in MongoDB lingo) is painful at best. These are really great for time series data (give me everything that happened between 1:00pm and 2:00pm).

And there are other, more esoteric NoSQL databases. However, what all of these databases have had in common is a lack of support for common database idioms and a tendency to focus on a “special purpose.” Some popular NoSQL databases (e.g. MongoDB) wrote great database front-ends and ecosystem tools that made it really easy for developers to adopt, but engineered serious limitations in their storage engine — not to mention limitations in resilience and scalability.

Database standards are still important

One of the things that made relational databases dominant was that they had a common ecosystem of tools. First, there was SQL. Although dialects could be different — as a developer or analyst if you went from SQL Server 6.5 to Oracle 7, you might have to fix your queries and use “(+)” for outer joins — but simple stuff worked and hard stuff was reasonably easy to translate.

Secondly, you had ODBC and, later, JDBC, among others. Nearly any tool that could connect to one RDBMS (unless it was made specifically to manage that RDBMS) could connect to any other RDBMS. There are lots of people who connect to an RDBMS daily, and suck the data into Excel in order to analyze it. I am not referring to Tableau or any of hundreds of other tools; I am talking about the “mothership,” Excel.

NoSQL did away with standards. MongoDB does not use SQL as a primary language. When MongoDB’s closest competitor Couchbase was looking for a query language to replace their Java-based mapreduce framework, they created their own SQL dialect.

Standards are important whether it is to support the ecosystem of tools, or because a lot of people who query databases are not developers — and they know SQL.

GraphQL and the rise of state management

You know who has two thumbs and just wants the state of his app to make its way into the database and does not care how? This guy. And it turns out an entire generation of developers. GraphQL — which has nothing to do with graph databases — stores your object graph in an underlying datastore. It frees the developer from worrying about this problem.

An earlier attempt at this were object-relational mapping tools, or ORMs, like Hibernate. They took an object and basically turned it into SQL based on an object-to-table mapping setup. Many of the first few generations of this were difficult to configure. Moreover, we were on a learning curve.

Most GraphQL implementations work with object-relational mapping tools like Sequelize or TypeORM. Instead of leaking the state management concern throughout your code, a well structured GraphQL implementation and API will write and return the relevant data as changes happen to your object graph. Who, at the application level, cares how the data is stored, really?

One of the underpinnings of object-oriented and NoSQL databases was that the application developer had to be aware of the intricacies of how data is stored in the database. Naturally this was hard for developers to master with newer technologies, but it is not hard anymore. Because GraphQL removes this concern altogether.

Enter NewSQL or distributed SQL

Google had a database problem and wrote a paper and later an implementation called “Spanner,” which described how a globally distributed relational database would work. Spanner sparked a new wave of innovation in relational database technology. You could actually have a relational database and have it scale not just with shards but across the world if needed. And we are talking scale in the modern sense, not the oft-disappointing and ever-complicated RAC/Streams/GoldenGate way.

So the premise of “storing objects” in a relational system was wrong. What if the main problem with relational databases was the back end and not the front end? This is the idea behind so-called “NewSQL” or more properly “distributed SQL” databases. The idea is to combine NoSQL storage learnings and Google’s Spanner idea with a mature, open source, RDBMS front end like PostgreSQL or MySQL/MariaDB.

What does that mean? It means you can have your cake and eat it too. It means you can have multiple nodes and scale horizontally — including across cloud availability zones. It means you can have multiple data centers or cloud geographic regions — with one database. It means you can have true reliability, a database cluster that never goes down as far as users are concerned.

Meanwhile, the entire SQL ecosystem still works! You can do this without rebuilding your entire IT infrastructure. While you might not be game to “rip and replace” your traditional RDBMS, most companies are not trying to use more Oracle. And best of all, you can still use SQL and all of your tools both in the cloud and around the globe.

read more at https://www.cio.com by Andrew C. Oliver

Cio

Technology