Some Fallacies of Data Engineering
In a recent interview I was asked to identify some of the common pitfalls that face new Data Engineers; specifically, what are commonly held misconceptions that can afflict Engineers early in thier careers. After the review I realized that these lessons learned hold merit for anyone embarking on the road to Engineering zen - so I decided to share.
Fallacy 1: EL is an easy problem
When application developers move from transactional system design to data pipelining, there can be a certain amount of hubris regarding EL. Production applications consume data from a multitude of stores including relational / object databases, rest / soap apis, flat system filestores, and of course real-time user inputs, all in an exceptionally performant manner. Compared to real-time transactions with millisecond SLAs, how hard can batch processing of source data really be? The reality is that EL and transactional development are disciplines with dissimilar challenges, each requiring a distinct skill set and method of thinking. While transactional development is most often concerned with the current state of an entity, EL development mandates that we consider many states and effectively reflect data “in all times at once”. Transactional developers need to guard against race conditions in procedural algorithms, but that requirement often pales in comparison to the level of sophistication required for distributed EL processes. Batch EL processes can run for seconds, minutes, or hours, and the exposure to network / source / destination system errors grows in tandem with process duration. As a consequence, EL development demands state management that can survive and recover from failures in all layers.
Fallacy 2: There is one silver bullet EL solution
A freshman tactic that goes hand-in-hand with underestimating the challenges of EL is to try and implement a single catch-all technique for replicating source data. Different source data entities will have very different characteristics in change patterns, volume, underlying business processes, latency etc. Accounting for these factors is essential to designing effective, performant EL processes. For example:
- Insert-only sources with a sequenced (reliable) natural key can be replicated with a simple predicate based on the current highest natural key. This is ideal when the source data allows.
- For records with hard deletes, a delete reconciliation strategy (either a batch hash lookup against the current staging and source, or an audit managed on the source side) will need to be implemented. If volume and lookback window provides it may be cheaper to dump-and-reload (full refresh) this source instead.
- Underlying source system APIs may need to be called into play - like using the postgres XMIN, XMAX or OID values, or DB_TRX_ID on innoDB driven MySQL instances - may supplement weak source data models to determine recently updated or unique rows. These can be expensive at scale since they are not indexed, and hypothetically they need to wrap around at some point (breaking incremental lookups) so there are considerations either way.
- For records with late-arriving data, lookback upserts over a window of time may be the solution. If a vendor incrementally updates event records for up to 60 days, calling that API daily for as far back as records are in flux might be the EL decision. …And so on. It is critical that the source data is understood by the engineering team, which leads to the next point … The source data is understood / intuitive: When it comes to EL pipelines and data warehouse transformations, assumption is deadly. Too many fledgling engineers take for granted that line_items.price is the price paid for a line item, or that rental_contract.updated_at is the last time a row was modified. Data profiling is the first and most essential step to any EL process; look for the expected and unexpected, look for the weird, and find a subject matter expert that can explain the results. “Why are there rows in the production_runs table that have a created_at timestamp later than the packaged_at timestamp?” and “Why are there both estimated_expense and expense keys in the API response?” are questions that need to be documented and answered before an EL strategy can be formulated. If there isn’t a known source of answers, it may mean diving into the application code that touches the model (and then documenting the findings). Data Engineers do not have the luxury of thinking a thing represents an event in a certain way; they need to know beyond a doubt if the data warehouse is to have any credibility.
Fallacy 3: The source is unchanging
The applications and vendors we extract data from pretty much all have development teams. These teams come to work with the purpose of making changes (hopefully improvements) to the source applications and systems. It seems logical, therefore, to expect that our sources will be in a constant state of flux; yet a common pitfall of EL and transform processes is to design algorithms that depend on the source schema, data relationships and application business logic to remain static. One way to guard against unexpected structural changes is with explicit EL definitions. When staging and source schemas no longer match, this should be a halt-level error. Finding extra or missing attributes in api calls should throw exceptions (unless ragged data is expected and accounted for). The system should be designed in a way that changes require attention. Screens do the work of capturing behavioral changes. CASE statements that avoid ELSE assignments paired with a screen looking for NULL values can act as a flag that underlying business logic has changed, or new values have been introduced. Confidence interval screens can watch for sudden changes to a column’s behavior - such as all NULL values when a column is deprecated for a newly normalized table the dev team forgot to mention… which brings up another important strategy: It is critical to communicate with the development teams. Data Engineers need to champion a culture where developers think of them before they run a rake task, or drop several columns they no longer deem useful. A dialog with every data source development team, both internal and external, is essential to stay ahead of changes to the underlying source data.
Fallacy 4: JSON == Semi Structured == BIG DATA!!
Semi-structured data has a very important place in the modern world. Ragged and non-conforming hierarchies, like a Google Analytics session record, would be a nightmare to support transactionally in a fully normalized relational schema. But beware: the strategy of arbitrarily stuffing json objects into RDBMS columns is dangerous, often a symptom of poor project planning and missing requirements analysis. Stakeholders who do not want to make hard modeling decisions may see object stores as an easy way to avoid commitment and decision-making. The end product will be a data model that is painful to traverse, inconsistent in both key and value structure, and impossible for the the average user to derive value from. Dimensional modeling works for data warehouses; when a solution conflicts with dimensional structure, there should be heavy scrutiny to understand the motivation.
Fallacy 5: The underlying architecture of your warehouse / pipeline / query planner is reliable & static
Microservices go down. RDBMS engines are improved and consume transactions in new ways. Vendor API versions are upgraded and deprecated with little or no notice. Resilience requires that each step of an application accounts for the possibility of failure in processes immediately upstream and downstream, and that flags are thrown as early as possible. Monitoring and safeguards belong in every stage of the process - from comprehensive alerts in AWS Cloudwatch to verbose Kimball-style screens of source data, to integration testing of the finalized data marts, quality monitoring needs to go beyond process success and failure. If a job succeeds in 2 hours that used to succeed in 12 seconds, that is cause for concern. If a job that used to produce 20k rows a day now produces 300k, this needs investigation. Accounting for anomalies in every aspect of the data pipeline is key to building credibility in the data warehouse.
Fallacy 6: Iterative development doesn’t work for data
Modern organizations run on a cycle of continuous improvement with a business value return, and the data pipeline is not excluded from that paradigm. What makes CI development less intuitive for data engineering is the nature of the deliverable. It is relatively easy to envision a feature delivery cycle for an e-commerce checkout page:
- Iteration 1: Add quick checkout button
- Iteration 2: Move quick checkout based on UX feedback
- Iteration 3: Add cross-device cookie merging for quick checkout carts
But delivering features and improvements in a data warehouse is less intuitive. It may look something like:
- Iteration 1: Add EL task that can handle current record throughput
- Iteration 2: Refactor task to fetch records in batch increments for scalability
- Iteration 3: Add statistical screens to EL task for greater consumer confidence
- Iteration 4: Refactor screens as composed class to reduce technical debt
It would feel great as an Engineer to implement a full screen and test suite from day one, optimize the fetch methods to handle larger record counts, wrap DML operations in appropriate transaction groups for better idempotentcy, and beautifully craft each module in the production pipeline. However delivering a functional (if not perfect) production solution early is, in most cases, better. We can deliver code that runs on Monday and the business can start answering questions on Tuesday. By delivering the earliest working version possible, the business value can exist in parallel with our improvements to the process. As long as the organization is invested in the development of scalable, maintainable software (and willing to let the Data Engineering team immediately pay back technical debt), iterative development lets us deliver more value in less time.
This list is far from exhaustive, but does help to start comprehensive thinking about what it means to build a truly production-quality data pipeline.