AdSysAr

Blog on software engineering

Friday, September 15, 2006

There Are No Pure Data Problems

Printed in Computerworld; October 17, 2005


While I agree with Ken Karacsony’s assessment that too much ETL is a sign of potential problems (COMPUTERWORLD, SEPTEMBER 05, 2005), I have a very different opinion on what is at the heart of the issue and what kind of solution it deserves. Before I continue with the rest of my response, I would like to emphasize that everything I am asserting is mainly relevant to the on-line transactional processing (OLTP) side of the IT domain. Things look somewhat different on the on-line analytical processing (OLAP) side.
While Ken Karacsony states that what we see is a sign of “poor data management” I tend to think that much more often it is a sign of poor engineering practices in general rather than just poor data management.
Data (or numeric value of certain business-related attributes) tends to be the most tangible and visible aspect that we, as well as our business partners can observe. Given that we work for business community, this visibility by and perception of the business users is much more important than our own (IT professional) perception. Quite often when business users say “we have data problems”, we should interpret their statement as “something is wrong with the system, I do not know what it is exactly, I just know that it gives me a wrong answer, please fix it”.
There is no such thing as “pure data problem”, because in any business application data always exists within the context of the business process. Whenever data is taken out of the (business process) context, i.e. stored in the relational DBMS tables, it loses a considerable portion of its semantic significance. For instance, let’s assume that a typical data base for a financial services company would have an Address record defined. While it may be sufficient for very simple cases to have just one flavor of addresses in the database, with an increase in business process complexity, data analysts and system developers will find themselves dealing with numerous variations of the Address structure: Current Client Residence Address, Property Address, Client Correspondence Address, Shipping Address, Billing Address, Third Party Address, etc. While all these Address records may have identical physical structure, semantically they are very different. For example using automated home appraisal method with a wrong address, i.e. the Current Client Residence Address instead of the Property Address, will produce a wrong result, which is impossible to catch outside of the business process context. To give Shipping department a Billing address instead of the Shipping one is probably also a bad idea.
One way to ensure that data is not taken out of the business context is to build cohesive systems around a logical unit of the business, and expose these systems to each other only through semantically-rich messages. The advantage that messaging style integration has versus the shared database integration style is this ability to transmit not only the shared data but also the shared business context semantics. While it is not hard to maintain similar degree of clarity within the shared database design style, in the absence of the very mature development process, a shared database, by its own nature servicing many different owners at the same time, will rapidly lose its initial design crispiness due to the inability to keep up with the numerous modifications requests. This in turn will lead to the data overloading, redundancy, inconsistency and at the end to the poor “data quality” at the application level. Do not get me wrong: I am not against the shared data store integration approach; I am just recommending being realistic about the complexity of the method within the confines of the modern business environment. I would recommend using shared data integration within the scope of a single business unit while using message-based integration for the inter-departmental development as well as enterprise level. It is significantly easier to provide highly cohesive development environment within the boundaries of a single business unit due to the natural uniformity of the unit’s business priorities.
Except for the area of ad hoc reporting, our clients do not deal with databases -- they deal with business applications. I also would argue that too much ad hoc reporting signals problems with the business process design, and/or application workflow design, and/or UI design. Too many OLTP applications are poorly designed and thus have very inadequate usability characteristics, forcing users to compensate by requesting a high volume of “canned” reports as well as sophisticated ad hoc reporting capabilities. In the world of carefully designed applications, it is the applications and not the databases that are the centers of customer interactions. As an example, I recently worked on a project where we were able to either completely eliminate, or migrate into an application’s human workflow process more than half of the reports initially requested by the business users.
The solution to the “too much ETL” problem in the OLTP world is thus less centralization and lower coupling of the OLTP systems and not more centralization and tighter application coupling through a common data store. One can argue that it is always possible to introduce a layer of indirection (i.e. XML) between the application logic and the common database physical schema, thus providing a level of flexibly and decoupling. While this may work for some companies, in my personal experience, this type of design proved to be harder to maintain than the more robust asynchronous middleware-based messaging due to the fact that it mixes two different design paradigms.
I would be interested in hearing from COMPUTERWORLD readers about any medium- to large-sized company that was successful in building multi-departmental Operational Data Stores that worked well with the multiple inter-departmental systems through a number of consecutive releases. I predict that it will be hard to find a significant number of cases to discuss at all, and it will be especially difficult to find any examples from companies with a dynamic business process that requires constant introduction of new products and services. The main reason for the lack of success, from my point of view, is not technical in nature. It is relatively easy to build tightly-coupled applications integrated via the common data store, especially if it is done under the umbrella of one single program with a mature system development culture. The problem is in the “Realpolitik” of a modern business environment: we live in and work for businesses in the age of ever-accelerating global competition. It is almost impossible to coordinate business plans of various departments, and the subsequent deployment schedules of multiple IT projects, each working on its group of business priorities in order to keep systems, which are built around one shared Database, current. When one of the interdependent development teams misses a deliverable deadline, political pressure to separate will become hard to resist. And if a commercial of the shelf software (COTS) package is acquired, or a corporate merger or an acquisition takes place, the whole idea of all applications working with one common data format is immediately thrown out the window.
So we, in IT, need to learn how to build systems that will not require rigid release synchronization from the multiple OLTP systems belonging to disparate business units. Decoupling can provide us with the required flexibility to modify our systems on a coordinated, but not prohibitively-rigid schedule.
Finally, it is important to emphasize that while loose coupling gives us an opportunity to modify different systems on different schedules without corrupting the coupled systems, loosely-coupled does not mean “loosely-managed.” Loose coupling provides us with a degree of flexibility in implementation and deployment. This additional degree of flexibility gives our business partners the ability to move rapidly when they need to and at the same time provides IT with the ability to contain and manage the challenges caused by the ever-increasing rate of business change. We have to acknowledge that developing loosely coupled applications that work well together across an enterprise with well-delineated responsibilities is a very challenging engineering problem. If not managed well this type of system development may turn advantage of loose coupling into disadvantage of “delayed-action” semantic problems. A mature IT development process is absolutely necessary to overcome this engineering problem and deliver this type of information infrastructure to our business partners. From this perspective, it is worthwhile for any organization that is striving to build a well-integrated enterprise level IT infrastructure to look into SEI Capability Maturity Model. Specifically, Maturity Level 3, called the Defined Level, addresses issues of development process consistency across the whole enterprise. From my point of view it is a prerequisite to the physical integration of the enterprise systems into consistent whole. CMM manual describes this process level as when “the standard process for developing and maintaining software across the organization is documented, including both software engineering and management processes, and these processes are integrated into a coherent whole.” Unless an organization is prepared to operate at his level, it should not have high hopes for a success in the integration area.
So to summarize: successful data management begins by taking focus away from data. Instead, the focus should be on the general level of system engineering and its main aspects, i.e., Requirements Analysis and Management, Business Domain Modeling, Configuration Management, QA Process, etc.
I would argue that any medium to large company that has not reached CMM level 3 and is trying to get “data under control” would have little chance to succeed in this undertaking, regardless of what integration style it will use.

0 Comments:

Post a Comment

<< Home