The Evolution of Database Architecture and the Future of FFMVIC Data Management¶

Note

This post represents the professional opinions of the author and provides general technology recommendations. The content of this post have not been endorsed by the BFS Architecture Forum or DEECA's Architectural Review Board and do not necessarily represent the views of broader team or the Department as a whole. The post is meant as an educational content to discuss emerging technology patterns and does not provide directions to any ongoing teams or projects.

The database, data storage and data management landscape has seen significant changes over the past two decades, challenging traditional assumptions about transactional and analytical workloads.

This post provides an overview of this evolution from monolithic databases to NoSQL and HTAP aspirations, to the emerging pattern of composable data lakehouses. The article concludes by providing generalised recommendations for adapting these patterns in FFMVIC systems.

In the Beginning There was Version 2¶

Larry Ellison famously believed that customers won't buy a version 1 software, so the first public release of Oracle was Oracle V2, which was soon followed by IBM's DB2 (IBM Data Base 2).

In 1980s and early 1990s, these Relational Database Management Systems (RDBMS) served both operational and analytical workloads: they performed Online Transaction Processing (OLTP) operations during business hours and switched to performing Online Analytical Processing (OLAP) reports during off peak hours.

Tip

Online here differs from our modern definition of the term; in 1980s the term Online meant something close to what we now call Synchronous i.e., the operator would submit a query and get their results right away, as opposed to Offline systems which were batch oriented, i.e, the operator would submit their query and the results would be printed a few hours/days/weeks later.

This approach was sufficient when datasets were small and compute resources were expensive.

A brief recap:

OLTP systems optimise for high-frequency transactional operations and are characterised by microsecond response times, point lookups, and small data modifications. These workloads benefit from row-oriented storage where individual records are stored contiguously, enabling rapid retrieval and modification of rows/items. They focus on write performance, ACID compliance and maintaining referential integrity across numerous concurrent transactions.

OLAP systems conversely optimise for complex analytical queries involving full-table scans and large-scale aggregations across historical datasets. These workloads benefit from columnar storage where values for each column are stored together, allowing queries to read only specific columns required for analysis. This approach dramatically reduces I/O requirements and improves compression ratios compared with OLTP systems.

The fundamental conflict in resource utilisation particularly regarding CPU, memory, and I/O patterns between these opposing workloads resulted in architectural separation of the two. Database pioneer Michael Stonebraker articulated this in his influential paper "One Size Fits All: An Idea Whose Time Has Come and Gone," arguing for specialised database engines tailored to specific workload requirements. This led to the widespread adoption of dedicated OLTP databases for operational tasks and separate data warehousing platforms such as Teradata and Microsoft SQL Server Analysis Services for analytical workloads.

The NoSQL Movement¶

The mid-2000s brought the NoSQL movement, driven by internet-scale companies requiring scalability beyond what traditional relational database systems were capable of. MongoDB, Cassandra and DynamoDB demonstrated that abandoning strict ACID guarantees could enable massive scale for specific use cases.

Though NoSQL databases have been shown to suffer from many of their own challenges (MongoDB is still infamously not ACID-compliant in 2025), the NoSQL movement introduced several concepts to modern data architectures:

Distributed storage across commodity hardware provided cost effective scalability
Eventual consistency models demonstrated acceptable trade offs between consistency and availability
Schema flexibility enabled rapid application development without predefined data structures

The ecosystem which later formed around Apache Hadoop extended these concepts to analytical workloads. MapReduce and later Apache Spark demonstrated that analytical processing could be distributed across commodity hardware, significantly challenging the need for expensive proprietary data warehouses. These systems introduced technologies such as distributed file systems (HDFS) and data lakes, which allow cost effective storage of vast quantities of structured and unstructured data.

The Rise and Fall of HTAP¶

Around 2014, Gartner coined the term Hybrid Transactional/Analytical Processing (HTAP) to describe a group of emerging systems which promised to be capable of performing both OLTP and OLAP operations efficiently on the same data store. These systems promised to eliminate ETL pipelines and provide real time analytics on transactional data.

HTAP systems brought innovative concepts to the table. SingleStore combined in-memory row stores with disk-based column stores, whilst TiDB paired its TiKV row store with TiFlash, a columnar engine based on ClickHouse. These systems showed that careful architectural design could address some fundamental conflicts between transactional and analytical workloads.

However, monolithic HTAP systems failed to achieve widespread market adoption. Several factors contributed to this:

Operational risk: Associated with replacing established OLTP systems.
Scalability requirements: Most organisations do not face Facebook or Google's scaling challenges. Coupled with advances in CPU processing and I/O interfaces, most organisations realised that they can service their transactional workloads with single-machine databases (OpenAI famously uses a single-node PostgreSQL).
Cloud costs: Shared object storage containers are much more economical in cloud environments than designs requiring fast local SSDs, as needed by HTAP systems.
Organisational structures: In most organisations, OLTP and OLAP responsibilities remain with different teams.
Technical complexity Optimising a single HTAP system for fundamentally different access patterns resulted in performance compromises.

The Write-Ahead Log (WAL)¶

While companies were trying and mostly failing to make the universal HTAP dream a reality, the simple Write-Ahead Log (WAL) evolved to become a universal interface between operational and analytical systems. Originally designed for database recovery by recording all modifications before they reach the filesystem, the WAL provides a chronological stream of every database change, making it ideal for Change Data Capture (CDC) applications.

Architectures started to emerge which treat OLTP database as the upstream, and stream events from the the relational DB's WAL to downstream analytical systems. CDC tools such as Debezium capture database changes from PostgreSQL, MySQL, and SQL Server transaction logs and publish them to streaming platforms such as Azure Event Hubs and Apache Kafka. This architecture enables near real time data movement and analysis without impacting operational system performance.

Stream processing technologies such as Apache Flink, Kafka Streams and Azure Stream Analytics can further transform these change streams, applying business logic and aggregations before loading data into analytical systems.

This approach separates operational efficiency from analytical processing requirements while maintaining near real time data freshness for analytics.

The Foundation of Data Lakehouses¶

The industry is also converging on open table formats as the standard storage layer for analytical systems. Apache Iceberg has emerged as the leading format, providing ACID transactions, schema evolution, and time travel capabilities over object storage such as Azure Data Lake Storage. Delta Lake and Apache Hudi offer similar capabilities, but Iceberg's vendor-neutral governance and broad ecosystem support have driven wide adoption.

These formats provide crucial capabilities:

ACID Transactions: Ensuring data integrity for concurrent operations
Schema Evolution: Allowing table schemas to change without disruptive rewrites
Time Travel: Enabling queries against historical data versions
Partition Evolution: Facilitating partitioning scheme changes without data migration
Hidden Partitioning: Abstracting physical layout details from users

This combination of object storage and open table formats has become known as the data lakehouse architecture, which combines the scalability and cost effectiveness of data lakes with the data management features of traditional data warehouses.

Query Engines¶

Another architectural pattern which has become established is the separation of storage from compute through standardised query engines. Apache Spark is the dominant force for complex ETL processing and machine learning workloads and provides distributed computing capabilities across clusters. Trino (formerly PrestoSQL) has also carved out a niche for itself as it specialises in federated queries across multiple data sources, enabling organisations to query data in place without movement.

These engines share common characteristics: optimisation for analytical query patterns, support for modern data formats and SQL-like interfaces familiar to database teams. The separation of storage and compute enables elastic scaling, where query engines can be provisioned on demand while data remains in cost effective object storage.

Key players and Technologies¶

These patterns and technological forces have resulted in the following companies and technologies becoming the key players of the data management landscape in 2025:

Databricks emerged from the Apache Spark project at UC Berkeley and has established itself as the leading lakehouse platform. The company pioneered the concept of unified analytics platforms combining data engineering, data science, and business intelligence on data lake storage. Built around Apache Spark, Databricks provides distributed computing capabilities with collaborative notebooks and managed infrastructure.

Snowflake pioneered cloud data warehousing by separating storage and compute, enabling independent scaling of each layer. The platform's multi cluster shared data architecture allows multiple compute clusters to operate on the same data simultaneously without contention.

PostgreSQL has emerged as the dominant choice for new transactional systems, driven by several technological advantages. Its extensible architecture has enabled the development of sophisticated features such as PostGIS for geospatial data, TimescaleDB for time-series workloads and pgvector for AI/ML embeddings. This extensibility, combined with comprehensive SQL standard compliance, robust JSON support and advanced indexing capabilities has made PostgreSQL the default OLTP for new transactional systems.

Neon represents a new generation of database companies, offering serverless PostgreSQL with storage and compute separation. Neon replaced PostgreSQL's storage layer with a cloud native storage system which enabled features such as git-like database branching and point-in-time recovery.

Crunchy Data established itself as a prominent PostgreSQL support and managed service provider. The company also developed Crunchy Data Warehouse which integrates analytical capabilities directly in PostgreSQL.

DuckDB emerged as an embedded analytical database and has come to be known as "SQLite for analytics." DuckDB provides high performance analytical query processing in a single, embeddable database engine, eliminating the complexity of distributed systems for workloads that can fit in single-machine servers.

ClickHouse has evolved into a high performance columnar OLAP database. ClickHouse excels at analytics on large datasets through aggressive optimisation for analytical query patterns and efficient use of modern hardware capabilities.

Adding Analytics to PostgreSQL¶

Neon and Crunchy Data pursued different strategies to add analytical capabilities to PostgreSQL:

Neon's approach focuses on replacing PostgreSQL's storage layer. By implementing a log-structured storage system and separating storage from compute, Neon enables features such as branching (creating instant database copies for development or analytics) and automatic scaling. This architecture allows analytical workloads to run on separate compute instances without impacting transactional performance.

Crunchy Data's approach involves integrating extensions on top of PostgreSQL. The Crunchy Data Warehouse combines PostgreSQL for operational data with DuckDB for analytical processing and Apache Iceberg for historical data management. This combination provides a unified SQL interface while using specialised engines for different workloads, PostgreSQL handles transactions, DuckDB processes analytical queries, and Iceberg manages the data lakehouse layer.

The HTAP Challenge for Analytics Platforms¶

Both Databricks and Snowflake faced significant technological challenges in their attempts to expand into transactional workloads:

Databricks built its platform around Apache Spark, optimised for batch processing and analytical workloads. The fundamental architecture assumes data is written in large batches and read by distributed scans. Attempting to add OLTP capabilities revealed critical limitations: Spark's job scheduling overhead makes single row operations prohibitively expensive, Delta Lake's file-based storage cannot efficiently handle high-frequency updates, and the lack of traditional database features such as indexes, constraints, and triggers limits transactional use cases.

Snowflake encountered similar architectural impediments. The platform's micro partition architecture struggles with OLTP patterns. Each update potentially requires rewriting entire micro partitions, the platform lacks the low latency response times required for transactional applications. Furthermore, scalable compute designed for analytical workloads is often too costly for sustained transactional processing.

In May 2025, Databricks purchased Neon for approximately USD $1 billion, followed shortly by Snowflake acquiring Crunchy Data for USD $250 million. These acquisitions are an acknowledgment of both Databricks and Snowflake's challenges in building a unified HTAP around their analytics framework, and a recognition that true HTAP requires composition of multiple technologies.

By acquiring proven PostgreSQL teams, Databricks and Snowflake will finally be able to combine compelling transactional capabilities with their analytical platforms. Both companies will likely integrate their new Postgres-based engines using CDC patterns to stream DB changes to their analytics platform.

ClickHouse's Unique Positioning¶

Also in May 205, ClickHouse raised USD $350 million, valuing the company at over USD $6 billion. This shows the unique position which ClickHouse occupies in this landscape by explicitly rejecting HTAP ambitions in favour of pure OLAP excellence.

Architectural purity: ClickHouse has optimised aggressively for analytical patterns without transactional compromises. ClickHouse implements columnar storage with sophisticated compression, vectorised query execution using SIMD instructions supported by modern CPUs, and skip indexes that eliminate unnecessary data scanning. This approach has resulted in orders of magnitude performance advantages over Snowflake and other competing systems.

Ecosystem integration: ClickHouse recognised that modern data management requires specialised engines, and as such, positioned itself as a specialised component within the larger ecosystem. Rather than trying to compete with general purpose platforms, ClickHouse occupies the position of an an acceleration layer, ingesting from streaming platforms such as RabbitMQ and Kafka, querying data in lakehouse formats (e.g., Iceberg), and serving as a high performance cache for frequently accessed analytical data.

This focus on analytical excellence and embracing compositional architectures has coincided with the emerging industry recognition that specialised engines, properly integrated, outperform monolithic HTAP attempts.

Azure's First-party Technologies¶

Microsoft Azure's approach to modern data architecture reflects a market-following strategy. Azure Datalake Storage (gen2) is Microsoft's implementation of Hadoop's distributed filesystem, HDFS. Azure Synapse Analytics is Microsoft's attempt to provide a managed Apache Spark platform and compete with Databricks. Azure Stream Analytics is Microsoft's simplified implementation of Apache Flink, while Fabric represents Azure's attempt at a unified Data Lakehouse (though as usual with Microsoft, Fabric is more of a "brand" than a singular technology).

These offerings primarily show Microsoft's adoption of proven concepts and technologies from market leaders such as Databricks and Snowflake. These platforms provide good integration with other Azure services and enable a path to modern data management for organisations committed to the pure Microsoft ecosystem; but often lag behind market leaders in performance and functionality.

Recommendations for FFMVIC Systems¶

Based on these observations and emerging patterns, the following general recommendations can be made for future adoption by FFMVIC systems:

Trusted Relational Databases for Operational Systems: continue to use established relational database management systems, specifically Azure SQL and Azure PostgreSQL, for storage of structured data for operational systems.
Data Lakehouses for Unstructured Data: gradually adopt the lakehouse architecture, using open table formats (i.e., Iceberg) and ADLS gen2 for storage of unstructured data (e.g., documents, raster images, etc).
Establish CDC for Analytical Workloads: Implement CDC mechanisms to replicate data from operational platforms such as eMap and Tarnook into the data lakehouse without impacting operational performance.
Modernise Pipelines: Simple ETLs can use Python in Azure Functions coupled with established libraries such as Pandas; but for complex ETL/ELT and machine learning applications, Apache Spark (via Azure Databricks) provides a strong platform with unique advantages.
ClickHouse for Analytics: Deploy ClickHouse for analytical workloads, particularly real time operational and interactive analytics.
Gradual Retirement of Legacy Data Warehouse: Once the lakehouses are established, ETL pipelines have been uplifted and ClickHouse expertise has been developed, the stage is set for gradual decommissioning of legacy data warehouses.

The integration strategy should focus on change data capture (CDC) from operational systems to analytical storage using technologies such as Azure Service Bus and Event Hubs. This pattern maintains separation of concerns between operational and analytical systems and teams while providing a path for cost effective storage of large amounts of data and near real time data analytics.