Fallacies of Data Engineering | Ethan M Knox

Ethan M. Knox

Some Fallacies of Data Engineering

In a recent interview I was asked to identify some of the common pitfalls that face new Data Engineers; specifically, what are commonly held misconceptions that can afflict Engineers early in thier careers. After the review I realized that these lessons learned hold merit for anyone embarking on the road to Engineering zen - so I decided to share.

Fallacy 1: EL is an easy problem

When application developers move from transactional system design to data pipelining, there can be a certain amount of hubris regarding EL. Production applications consume data from a multitude of stores including relational / object databases, rest / soap apis, flat system filestores, and of course real-time user inputs, all in an exceptionally performant manner. Compared to real-time transactions with millisecond SLAs, how hard can batch processing of source data really be? The reality is that EL and transactional development are disciplines with dissimilar challenges, each requiring a distinct skill set and method of thinking. While transactional development is most often concerned with the current state of an entity, EL development mandates that we consider many states and effectively reflect data “in all times at once”. Transactional developers need to guard against race conditions in procedural algorithms, but that requirement often pales in comparison to the level of sophistication required for distributed EL processes. Batch EL processes can run for seconds, minutes, or hours, and the exposure to network / source / destination system errors grows in tandem with process duration. As a consequence, EL development demands state management that can survive and recover from failures in all layers.

Fallacy 2: There is one silver bullet EL solution

A freshman tactic that goes hand-in-hand with underestimating the challenges of EL is to try and implement a single catch-all technique for replicating source data. Different source data entities will have very different characteristics in change patterns, volume, underlying business processes, latency etc. Accounting for these factors is essential to designing effective, performant EL processes. For example: 

Fallacy 3: The source is unchanging

The applications and vendors we extract data from pretty much all have development teams. These teams come to work with the purpose of making changes (hopefully improvements) to the source applications and systems. It seems logical, therefore, to expect that our sources will be in a constant state of flux; yet a common pitfall of EL and transform processes is to design algorithms that depend on the source schema, data relationships and application business logic to remain static. One way to guard against unexpected structural changes is with explicit EL definitions. When staging and source schemas no longer match, this should be a halt-level error. Finding extra or missing attributes in api calls should throw exceptions (unless ragged data is expected and accounted for). The system should be designed in a way that changes require attention. Screens do the work of capturing behavioral changes. CASE statements that avoid ELSE assignments paired with a screen looking for NULL values can act as a flag that underlying business logic has changed, or new values have been introduced. Confidence interval screens can watch for sudden changes to a column’s behavior - such as all NULL values when a column is deprecated for a newly normalized table the dev team forgot to mention… which brings up another important strategy: It is critical to communicate with the development teams. Data Engineers need to champion a culture where developers think of them before they run a rake task, or drop several columns they no longer deem useful. A dialog with every data source development team, both internal and external, is essential to stay ahead of changes to the underlying source data.

Fallacy 4: JSON == Semi Structured == BIG DATA!!

Semi-structured data has a very important place in the modern world. Ragged and non-conforming hierarchies, like a Google Analytics session record, would be a nightmare to support transactionally in a fully normalized relational schema. But beware: the strategy of arbitrarily stuffing json objects into RDBMS columns is dangerous, often a symptom of poor project planning and missing requirements analysis. Stakeholders who do not want to make hard modeling decisions may see object stores as an easy way to avoid commitment and decision-making. The end product will be a data model that is painful to traverse, inconsistent in both key and value structure, and impossible for the the average user to derive value from. Dimensional modeling works for data warehouses; when a solution conflicts with dimensional structure, there should be heavy scrutiny to understand the motivation.

Fallacy 5: The underlying architecture of your warehouse / pipeline / query planner is reliable & static

Microservices go down. RDBMS engines are improved and consume transactions in new ways. Vendor API versions are upgraded and deprecated with little or no notice. Resilience requires that each step of an application accounts for the possibility of failure in processes immediately upstream and downstream, and that flags are thrown as early as possible. Monitoring and safeguards belong in every stage of the process - from comprehensive alerts in AWS Cloudwatch to verbose Kimball-style screens of source data, to integration testing of the finalized data marts, quality monitoring needs to go beyond process success and failure. If a job succeeds in 2 hours that used to succeed in 12 seconds, that is cause for concern. If a job that used to produce 20k rows a day now produces 300k, this needs investigation. Accounting for anomalies in every aspect of the data pipeline is key to building credibility in the data warehouse.

Fallacy 6: Iterative development doesn’t work for data

Modern organizations run on a cycle of continuous improvement with a business value return, and the data pipeline is not excluded from that paradigm. What makes CI development less intuitive for data engineering is the nature of the deliverable. It is relatively easy to envision a feature delivery cycle for an e-commerce checkout page:

But delivering features and improvements in a data warehouse is less intuitive. It may look something like:

It would feel great as an Engineer to implement a full screen and test suite from day one, optimize the fetch methods to handle larger record counts, wrap DML operations in appropriate transaction groups for better idempotentcy, and beautifully craft each module in the production pipeline. However delivering a functional (if not perfect) production solution early is, in most cases, better. We can deliver code that runs on Monday and the business can start answering questions on Tuesday. By delivering the earliest working version possible, the business value can exist in parallel with our improvements to the process. As long as the organization is invested in the development of scalable, maintainable software (and willing to let the Data Engineering team immediately pay back technical debt), iterative development lets us deliver more value in less time.

This list is far from exhaustive, but does help to start comprehensive thinking about what it means to build a truly production-quality data pipeline.