Follow the mainstream technology hype machine and one could easily be led to believe that data modeling is dead. The story goes something like, “Thanks to agile technologies like Hadoop and techniques such as schema on read, prescribed data modeling is now replaced by instantaneous execution on demand.”
If you were laughing, choking, and wiping the coffee off your screen, then you are clearly someone who has done real work with Hadoop and recognizes that line of thinking as utter nonsense. There is a fundamental misunderstanding about schema on read. In fact, schema on read is just a point on a spectrum that begins with old-school schema on write and fully curated models. There’s more to the data modeling story.
When a schema on read model is formed, it is predicated on the fact that a lot of structure was previously encoded on write as well. There is a file format, columns, and keys in a JSON schema. All those structural decisions represent a kind of modeling that influences what can be accomplished on read. The fact that those pre-existing characteristics are in place does not detract from the fact that schema on read is a much more agile process that allows models to be developed incrementally. This is a radical departure from traditional data models that were highly linear and built on a cascade of limitations. Nevertheless, early decisions about how to organize data, lay it out, and make it efficient do continue to be a big part of working with big data systems, especially in the relational world.
To further tease out the truth, it helps to understand the two types of modeling that take place, physical and logical. Physical modeling refers to optimizing the design of systems for optimal, efficient performance characteristics. The underlying assumptions about the technical capabilities of the system impose structure on models, whether a single machine back in the 1980s or the distributed Hadoop clusters of today. Key design, partitioning, file formats, column versus row format, collocation of data, and the like are all examples of physical data modeling considerations. While not a new challenge, changing technology makes it clear that simply shoehorning techniques developed for relational database environments into Hadoop is not an effective strategy.
Logical models are perhaps changing even more, although the underlying influence in determining how one organizes and thinks about data is consistent. The difference is that we have new considerations such as hierarchical structure in JSON, scored arrays, key value pairs, and nesting as well as other hierarchies that are less regular. These recent additions have not been part of logical models of the past because such things could not be expressed as an array of related items.
Is data modeling dead? Nope. Not only is data modeling crucial to modern data systems, but it is also evolving in compelling new ways. Data still has well-defined structure. What has changed is that the person working with the data does not have to invest effort up front to define and capture all elements of the schema ahead of time. Incremental modeling allows the user to parse certain columns and define the schema of the system after data is prepared. Agile modeling is a major step forward from having to start over every time.
We are at a new beginning for data modeling. We are investing in codifying the language, patterns, and structure of data representation to reflect real world problems and use cases. With the emergence of a common language of patterns, we will be able to draw on experience to form broader theories about process and further define best practices. The rich history of data modeling provides a fantastic foundation on which to build the future of data modeling. Taking the time to understand these changes and participate in the journey will be one of the factors that separate those who make full, efficient use of their data from those who unwittingly place obstacles in their own way.