Blog on software engineering

Friday, October 31, 2008

MDM Is Not Enough, Semantic Enterprise Is Needed.

Page 1


This article introduces the concept of Semantic Enterprise and outlines connection between Semantic Enterprise and Master Data Management (MDM) concepts. The article also shows that successful transitioning to Semantic Enterprise requires significant improvements in enterprise metadata and especially in business metadata management. It explains the importance of supporting an enterprise-level semantic continuum from both business and information technology (IT) communities by committing to development of Enterprise Architecture tenants that would bring both communities together to a more synergetic environment.
Once broadly realized, the critical and indispensable nature of the relationship between Business, Information and Technology architectures will generate demand for improvements to modeling tools that vendors will have to meet in order to remain relevant.

Master Data Management and Metadata

Currently, most tool vendors define Master Data Management as the capability to create and maintain a single, authoritative physical source of “master” data. The purpose of this source is to make shared data – data that has a single content and format – available to all the enterprise systems that need to reference it. As such, Master Data is typically called Reference Data[1].
While MDM, by this definition, is an important technical pursuit in its own right, there is a larger phenomenon behind it.
The broader issue is the semantic integrity (or rather the lack of it) of shared data, particularly at the enterprise-level. As Vickie Farrell, the former VP of marketing for Cerebra puts it: “…lack of what Gartner calls "semantic reconciliation" among data from different sources is inherent in a diverse, dynamic and autonomous organization. … Resolving discrepancies in metadata descriptions from multiple tools, not to mention cultural and historical differences, involves more than physically consolidating metadata into a common repository.”[2]In other words, it is always possible, and arguably, quite easy, to misinterpret any shared data in the absence of rich contextual information that unambiguously distinguishes between different possible meanings. A substantial portion of this rich metadata context should come from information about the business processes that generate and use the shared data. While this metadata continuum starts in the business function model layer (more on architectural layering later in the “Three-Layered Architectural Model” section), it should support a consistent interpretation of shared metadata that continues through the complete business-IT space, all the way through to the implementation and maintenance of the deployed applications and services.

[1] Some authors differentiate between reference and master data. See “Master Data versus Reference Data” by Malcolm Chisholm, Published in DM Review Exclusive Online Content in April 2006.
[2] “The Need for Active Metadata Integration: The Hard-Boiled Truth”, Vickie Farrell;
DM Direct, September 2005;

Page 2

Consider for example a scenario, where Marketing, Sales, and Customer Service departments all use the enterprise Current Customers set. In order for any enterprise to produce reconcilable financial and managerial reports, it is imperative that when systems from different departments access the same data from a single source, their interpretation of what constitutes “current customers” should also be identical. In the case of historically different definitions imbedded in legacy systems, each department should be aware of exactly which particular definition has been used for the enterprise master data and how to correlate that definition with its own departmental definition of “current customer”.
The following architectural model helps to minimize probability of errors similar to the one described above.

Three-layered architectural model
A simple Enterprise Architecture model that supports the desired metadata layering is shown in Figure 1[1]. This model links all three layers: Business Function, System Specification, and Physical / Implementation, into one Enterprise Metadata continuum, in order to guarantee information integrity for the whole enterprise.

All constituents that populate Enterprise Architecture models can be grouped into four architectural domains: Business, Information, Application, and Infrastructure
[2]. Notice that according to the proposed partitioning, Data Architecture does not constitute its own domain but is actually a sub-domain of Information Architecture.

Management of metadata information that describes the physical (infrastructure) layer is a challenging problem in its own right. However, this topic is well outside of this article’s scope and is extensively covered by the ITIL and numerous other publications on Configuration Management Database (CMDB). For those interested in a detailed discussion of the enterprise infrastructure metadata topic, please see Charles Betz’s “Architecture and Patterns for IT Service Management, Resource Planning, and Governance: Making Shoes for the Cobbler's Children”.
For the purpose of this discussion, we will concentrate on the business function layer. A business function is a particular proficiency that an organization — typically an enterprise — possesses and operates to achieve a specific business goal. A business function is an abstraction of a business process that preserves the main characteristics of what is being delivered by the process, discarding most of the information about how it is done, and thus representing an externally visible view of the business process.

Page 3
The foundational nature of the business-function layer definitions need to be emphasized. Unless all applications that reference master data elements agree on the exact meaning of all shared information structures, within the appropriate business process context (e.g. “customer”, “current customer”, “returning customer”, “high-value customer”, etc.), there is very little practical value in creating shared physical data stores.

Unfortunately, the generation of business-function layer metadata, its maintenance and integration with metadata from other layers, presents a number of issues discussed below

Page 4
Business Function Layer and its Metadata
Despite ongoing discussion about the need to add more explicit business function-level, technology-independent contextual information to the metadata repositories, there are currently only very few tools that offers such capabilities (I’d be happy to learn that I am wrong here). One of the reasons that integration of business function-level metadata across all architectural layers is so difficult to achieve is that the standards for the metadata representation in each of the model layers are still emerging, especially in the top-most business function layer.
Recent progress by modeling tools vendors and the user community around the BPEL[1], XMI[2], CWM[3], and other standards positioned to answer the need for exchanging and storing metadata is very promising[4].
However, the integration and unification of different types of modeling tools, as it was predicted in 2005 by Michael J. Blechar of Gartner, is far from being complete. “Before disparate modeling tools from the same or different vendors can truly become integrated, standards such as UML and BPMN must evolve and be coordinated. Gartner does not expect this effort to be completed — or, at least, approach a "good enough" solution to integrate best-of breed tools in any meaningful way — until 2007 or 2008.”[5] While integration of modeling tools is definitely a serious problem, the main problem faced by most companies is the absence of realization that a common understanding of information between the different constituencies of the modern enterprise, both external and internal, is needed. The common semantic enterprise space should extend in both directions: horizontal and vertical. The horizontal dimension addresses the interactions between different departments of the same company as well as between a particular business entity and its external environment (i.e. business partners, suppliers, legal and regulatory compliance mandates, consumers, etc). The vertical dimension addresses the information interchange between business users looking for productivity improvements and cost reductions on the one hand, and the IT community responsible for implementation of the required automated information systems on the other. Due to the extremely complex nature of the modern IT environments and the specialization that is caused by this complexity, further specialization in skill sets and fragmentation in the IT semantic space needs to be addressed as well. One possible solution here is the creation of an organizational role at each level (business function, logical specification, and physical implementation) that is responsible for each layer’s, as well as interlayer semantic integrity. For example, business process models created by the business process architects and the information models created by information architects should have referential integrity. For this wonderful, if not miraculous, event to finally take place, these models need to share a common semantic space (i.e. every information element and information structure that exists in the information models should be referenced in the process models and have identical meaning in all of them). The corollary is also true: no two different information elements can have the same meaning in all business process models within the scope of a single business unit domain. In addition to that, the existence of information elements (in information models) that are not used in any of the business process models is, in general, disallowed.

Page 5
Semantic Enterprise and Semantic Web

The concept of Enterprise semantic space (or Semantic Enterprise) is closely related to that of Semantic Web. While there are significant similarities between the two, there are also some significant differences.
The main similarity between the two concepts is a notion of well-defined information meaning (or semantic), which ensures that complex information management processes are successfully executed in a predictable manner. Both concepts require well structured information models (a.k.a. as ontologies) as well as tools that process these rich information models in order to establish and maintain common understanding of information between the different parties involved in exchanging information. At the implementation level, it is safe to assume that UML- and XML-based technologies (MOF, XMI, RDF, OWL, etc) would play the central role in the development of both concepts.
At the same time, the main difference is rooted in the degree of centralization, or rather de-centralization. While the creators of the Semantic web presume a highly decentralized model with a relatively high degree of inconsistency, the Semantic Enterprise is significantly more sensitive to semantic inconsistency and thus would require a higher degree of centralization and control.
While detailed discussion of the Semantic Web is beyond the scope of this article, it is important to understand that these two concepts have significant overlap and would probably develop a common set of tools and approaches.

Page 6

Unifying Data- and Process- centric Views on MetadataFrom the enterprise metadata point of view, MDM, in a broad sense, is a sign of things to come. It highlights the existing need for a new approach to the integration of business function-level metadata at the Enterprise level or Enterprise Business Metadata Integration[1] (EBMI). The term EBMI describes a consistently coordinated (unified) view of the business function-level information at the Enterprise level. It implies neither a single physical format used by all systems, nor a virtual database that provides access to the information, regardless of its physical location (similar to Enterprise Information Integration technique). In actuality, EBMI is primarily concerned with conceptual and logical level information, rather than information at the physical implementation level. EBMI implies that there is an agreed upon (by all business and IT constituencies)
definition for certain information structures, each within a particular well-defined business function context, as well as a robust translation mechanism between the different formats of metadata used by multiple business and technology counterparts. Any physical implementation that is compliant with the definition above is naturally a part of the solution.
For example, two departments of the same company, Order Management and Fulfillment on one side, and Customer Service on the other, have historically had different definitions of what constitutes a “returning customer”, a “fulfilled order”, an “inventory level trigger”, etc. The company, as a whole, may or may not have implemented a single physical data store for this information. While it is highly desirable to factor out common enterprise information and store it in one central location to minimize redundancy, it is not critical. It is however absolutely critical for the long-term health of the enterprise to make sure that both departments are aware that they have differences in the customer and order semantics, which are rooted in implementations of the departmental business processes. Another critical component of the solution is a set of rules that correlates (translates) two departmental definitions. The EBMI concept is similar to the business modeling approach provided by Business Process Modeling (BPM) tool vendors since it presents rich contextual information about business functions that a company possesses in order to meet its business objectives. What makes this concept different from the traditional BPM approach is that it adds metadata, which describes the information (data) elements participating in the business processes. By providing a robust information architecture model that complements the business process modeling view, EBMI brings together the BPM approach and the data-centric approach (traditionally used by the data modeling and database programmers’ community).


The examples and issues presented above demonstrate that the effort to establish a semantic enterprise continuum should be driven by both the business and the IT communities. Cooperation of the two communities, and especially the sponsorship of the business leaders, are of primary importance. However, the goal of building Semantic Enterprise cannot be accomplished without the necessary tools even with the two communities working in concert. This is especially true for the interlayer integration – while tool vendors have begun to provide a better support for physical layer metadata integration, support is almost non-existent for cross layer interactions. Vendors that provide not only “horizontal” integration capabilities across physical sources (physical sources data lineage), but also “vertical” integration capabilities between physical data and business process contextual information, will emerge victorious in the critical integration race.

[1] Web Services Business Process Execution Language , OASIS Standard WS-BPEL 2.0.
[2] XML Metadata Interchange standard, Object Management Group Standard;
[3] Common Warehouse Metamodel -- specification for modeling metadata for a data warehousing environment, Object Management Group Standard;
[5] Michael J. Blechar, Gartner Research Publication G00129905, “BPA, Object-Oriented and Data Modeling Tools Are Converging.” 2005

[1] For the detail discussion of the three-layered Enterprise Architecture model please see: “Quality Data Through Enterprise Information Architecture”,
[2] For a more detailed discussion of the four architectural domains, please see the Forrester Research Report “Creating the Information Architecture Function”,7211,34649,00.html
[3] Charles T. Betz , “Architecture and Patterns for IT Service Management, Resource Planning, and Governance: Making Shoes for the Cobbler's Children”, Morgan Kaufmann , 2006, ISBN-10: 0123705932

[1] I would prefer to use the term Enterprise Information Integration (EII) but unfortunately this term is already used to describe a virtual integrated data view centered on the physical implementation layer.

Monday, August 20, 2007

The Art and Craft of Great Software Architecture and Development: How to spot the dreaded non-coding architect

Monday, February 12, 2007

Please see the edited version on MSDN website:


Wednesday, September 27, 2006

Enterprise Information Architecture as a foundation for successful data quality management.


Data quality is a well-known problem and a very expensive one to fix. While it has been plaguing major US corporations for quite some time, lately it is becoming increasingly painful. Higher prominence of various regulatory compliance acts, i.e., SOX, GLB, Basel II, HIPPA, HOEPA, etc necessitates an adequate response to the problem.

A common approach to data quality problem usually starts and ends with the activities scoped to the physical data storage layer (frequently relational databases) in the classes of applications that heavily depend on enterprise data quality: i.e., Business Intelligence, Finance reporting, Market Trend Analysis, etc.
Not surprisingly, according to the trade publications, most of these efforts have minimal success. Given that in business applications data always exist within the context of a business process, all the attempts to solve “data quality problem” at the pure physical data level (a.k.a. databases and ETL tools) are doomed to fail.

Successful business data management begins by taking focus away from data. The focus initially should be on creation of Enterprise Architecture, especially commonly missing Business as well as Information architectures constituents of it. Information Architecture spans Business and Technology architectures, brings them together, keeps them together and provides necessary rich contextual environment to solve the ubiquitous “data quality problem”.

Thus Enterprise Business, Information and Technology architectures are needed for successful data management.

Data Quality

Data Quality Deficiency Syndrome
Major business initiatives in a broad spectrum of industries, private and public sectors alike, have been delayed and even cancelled citing poor data quality as the main reason. The problem of poor information quality has become so severe, that it has moved to the top tier among the reasons for business customers’ dissatisfaction with their IT counterparts.
While it is hardly an argument that poor data quality is probably the most noticeable issue, in a vast majority of cases, it will be accompanied by the equally poor quality of systems engineering in general, i.e., requirements elicitation and management, application design, configuration management, change control, and overall project management. The popular belief that “if we just get data under control, the rest (usability, scalability, maintainability, modifiability, etc) will also follow”, has proven to be consistently wrong. I have never seen a company where data quality level was significantly different from the overall IT environment quality level. If business applications in general are failing to meet (even) realistic expectations of the business users, then data is just one of the reasons cited, albeit the most frequent one. As a corollary to this, if business users are happy with the level of IT services, usually all the quality parameters of the IT organization effort are at a satisfactory level. I challenge the readers, from their prior experience and knowledge, to come up with an example where data quality in a company was significantly different from the rest of the information systems quality parameters.
What is commonly called “poor data quality problem” should be more appropriately called the “data quality deficiency syndrome”. It is indeed just a symptom of a larger and more complex phenomenon that can be called something to the kind of “poor quality of systems engineering in general”. Data quality is just the most tangible and obvious problem that our business partners do observe 1.

What is data?
Since data plays such a prominent role in our discussion, let’s first agree on what data is. Generally, most agree that “data 2” is a statement accepted at face value. A large class of data is measurements of a variable.
While all the examples in this article assume numeric and alphanumeric values, the assertions should be applicable to image-typed values as well.

1 For a great example on how an attempt to deal with data architecture separately from all the other architectural issues leads to potentially significant problems, please see the following article in the Microsoft’s Architecture Journal.
The Architecture Journal, “Data Replication as an Enterprise SOA Antipattern”,

2 Data is the plural of “datum”.

Data context
A notion that data is produced by measurements or observations is very significant. It points to a very important concept that is absolutely critical to the success of any data quality improvement effort. This concept is a notion of data context or metadata. In other words, a number just by itself, stripped of its context, is not really meaningful for business users. For example, the number 10 taken without an appropriate context bears little use. However, if one learns that we are talking about 10 cars and not 10 bikes, it now yields a better understanding of the business situation. The more data context is available, the better is our ability to understand what this piece of data really means. To continue with the example noted above, so far we have learned that we are talking about 10 cars. Now if we add to this context that we are talking about “10 cars waiting detailing and then delivery to a specific party, let’s say “Z Car Shop”, that has already paid for them”, we now have a much better understanding of the business circumstances surrounding this number. This is at the crux of “poor data quality problem” – lack of sufficient data context. We typically do not have enough supporting information to understand what a particular number (or a set of numbers) means, and we thus cannot make an accurate judgment about validity and applicability of the data. 3
As an IT consultant Ellen Friedman puts it: “Trying to understand the business domain by understanding individual data elements out of context is like trying to understand a community by reading the phone book.”
The class of data that is the subject of this article always exists within a context of a business process. In order to solve “poor data quality problem”, data context should always be well-defined, well-understood, and well-managed by data producers and consumers.

Data quality attributes

Professor Richard Wang of MIT, defines 15 dimensions or categories of data quality problems. They are: accuracy, objectivity, believability, reputation, relevancy, value-added, timeliness, completeness, amount of information, interpretability, ease of understanding, consistent representation, concise representation, access, and security.
A serious discussion of the above list would warrant a whole book; however, it is important to make a point that most of these attributes are in fact representing the notion of data context. For the purpose of our discussion on data quality, the most relevant attributes are: interpretability, ease of understanding, completeness, and timeliness.
The timeliness attribute also known as temporal aspect of information/data is arguably the most intricate one from the data quality perspective.
There are at least two interpretations of data timeliness. The first deals with our ability to present required data to a data consumer on time. It is a derivative of good

3 It is necessary to acknowledge that there is a class of data problems that is rooted in technology. These are problems related to usage of wrong technology, coding mistakes, such as errors in calculations and transformations, etc. All these problems, however, are relatively easy to detect and thus fix if a rigorous development process is used. This class of data problems, while being vital in certain cases, (e.g. losing the Mars Climate Orbiter in 1998 due to a programming error) is not the subject of this article.

requirements and design, but in the context of this article, it is of little interest to us. The second aspect is the notion of data having a distinctive “time/event stamp” related to the business process, and thus allowing us to interpret data in conjunction with the appropriate business events. It is not hard to see that more than half of the data quality attributes in the list above are at least associated with, if not derived from, this interpretation of timeliness. The importance of the time/event attribute points to a fundamental problem with the conventional data modeling technique, i.e., entity-relationship modeling or entity-relationship diagram (ERD). The ERD method lacks any mechanism similar to UML’s Event and State Transition Diagrams. This gap in turn leads not only to a consistent under-representation of this extremely important aspect of data quality in the conventional data models, but also creates a serious knowledge-management problem for a large group of players in the data quality arena.

According to J.M. Juran, a well known authority in the quality control area and the author of the Pareto principle, which is commonly referred to today as the “80-20 principle”, data are of high quality "if they are fit for their intended uses in operations, decision making and planning. Alternatively, data are deemed of high quality if they correctly represent the real-world construct to which they refer.” 4
Again, this definition points to the notion that data quality is dependent on our ability to understand data correctly and use them appropriately.
As an example, consider U.S. postal address data. Postal addresses are one of the very few data areas that have well defined and universally accepted standards. Even though an address can be validated against commercially available data banks to ensure its validity, this is not enough. If a shipping address is used for billing and vice-versa, or borrower correspondence address is used for appraisal, the results obviously will be wrong.

As already discussed above, the temporal aspect of information quality is extremely important for understanding and communicating, but it is often lost. For example, in the mortgage-backed securities arena, there are two very similar processes with almost identical associated data. First is Asset Accounting Cycle, which starts at the end of the month for interest accrual due next period. The second is the Cash Flow Distribution Cycle, which starts 15 days after the Asset Accounting Cycle begins. This difference of 15 calendar days, during which many possible changes to status of a financial asset can take place, can make financial outcomes differ significantly, but from the pure data modeling perspective, the database models in both cases are very similar or even identical. A data modeler who is not intimately familiar with the nuances of a business process, will not be able to discern the difference between the data associated with disparate processes by just analyzing the data in the database.

4 Juran, Joseph M. and A. Blanton Godfrey, Juran's Quality Handbook, Fifth Edition, p. 2.2, McGraw-Hill, 1999


Architecture as metadata source
As previously discussed, conventional data modeling techniques do not contain a mechanism that can provide sufficiently rich metadata, which is absolutely necessary for any successful data quality improvement effort to be successful. At the same time, this rich contextual model is a natural byproduct of successful Enterprise Architecture (EA) development process so long as this process adheres to a rigorous engineering approach5.
Architecture definition
Architecture is one of the most used (and abused) terms in the areas of software and systems engineering. In order to get a good feel for the complexity of the systems architecture topics, it suffices to list some of the most commonly used architectural categories, methods and models: Enterprise, Data, Application, Systems, Infrastructure, Zachman, Information, Business, Network, Security, Model Driven Architecture (MDA) and certainly the latest silver-bullet: Service-Oriented Architecture (SOA). All of the above architecture types naturally have a whole body of theoretical and practical knowledge associated with them. Any in depth discussion about various architectural categories and approaches is clearly outside the scope of this article; however, it is important to concentrate on the concept of Enterprise Architecture, and the following definition by Philippe Kruchten provide the context for this discussion “Architecture encompasses the set of significant decisions about the system structure”6.
Similarly, Eberhardt Rechtin states7 “A system is defined ... as a set of different elements so connected or related as to perform a unique function not performable by the elements alone”.
In order to emphasize the practical side of architecture development, the two definitions above can be further enriched and a long-time colleague of mine Mike Regan a systems architect with many successful system implementations under his belt adds: “Architecture can be captured as a set of abstractions about the system that provide enough essential information to form the basis for communication, analysis, and decision making.”

From the above definitions, it is clear that system architecture is the fundamental organization of a system. System architecture contains definitions of the main system constituencies, as well as the relationships among these constituencies.
Naturally, the architecture of a complex system is very complex as well. In order to deal with such architectural complexity, some decomposition method is needed. One such method is the Three-Layered Model.


5 It is prudent to concentrate on Enterprise-level architecture since it is the most semantically difficult level and thus has the highest return potential for data quality improvements. All the discussion points below will be also applicable to any lower level concepts: LOB, department, etc.
6 Philippe Kruchten,
7 Systems Architecting: Creating and building complex systems, Eberhardt Rechtin, Prentice-Hall, 1991

Three-Layered Model
All modern architectural approaches are centered on a concept of model layers -- horizontally-oriented groups defined by a common relationship with other layers, usually their immediate neighbors above and below. A possible layering for EA can constitute a capabilities (or business process) layer at the top, the information technology specifications layer in the middle, and the information technology physical implementation layer on the bottom . This model assumes information systems-centered approach; in other words, the purpose of this architectural model is to provide an approach to successful information systems implementation.

A simplified Three-Layered Model is shown in Figure 1. Some key concepts are worth mentioning:
First, although business strategy is not a constituent of the Business Architecture layer, it represents a set of guidelines for enterprise actions regarding markets, products, business partners and clients. A more elaborate view of these actions is captured by Business Architecture.
Second, the model demonstrates that Enterprise Information Models reside in both the Conceptual as well as in the Logical layers, and provide the foundation for consistent interaction between these layers.
Third, Enterprise IT Governance Framework is defined in the top conceptual layer, while IT standards and guidelines that support the Governance Framework are implemented in the Specification layer.
Finally, Enterprise Specification Layer defines only Enterprise Integration Model for the departmental systems, but not their internal architectures.

The discussion that succeeds the diagram expands on the notion of business process architecture and elaborates on the layering details.

8 This approach is inspired by Martin Fowler’s book “Analysis Patterns: Reusable Object Models”,, as well as by OMG’s Model Driven Architecture (MDA)

Business Architecture
It is important to emphasize the business process layer as the foundation for our Enterprise Architecture model. Carnegie Mellon University (CMU) provides the following definition for Enterprise Architecture: “A means for describing business structures and processes that connect business structures”. Interestingly enough this definition from the CMU Software Architecture Glossary is actually applicable to and definitive of the EA as a whole, and not just for the Business EA.
EA definition used by the US Government agencies and departments emphasizes a strategic set of assets, which defines the business, as well as the information necessary to operate the business, and the technologies necessary to support the business operations.
While this definition maps extremely well to the above proposed three layered view of EA, a word of caution is appropriate: while three layered model provides a good first approximation of Enterprise Architecture, it is by no means complete and/or rigorous. It is obvious that both business and technical constituencies of the EA model can and should be in turn decomposed into multiple sub-layers.
In the quest to make Business Enterprise Architecture (BEA) layer a robust practical concept, BEA has morphed from the initial organizational chart-centered and thus brittle view, into a business process-centered orientation, and lately into business capabilities-centered view, becoming even more resilient to business changes and transformations .
Current consensus around EA accentuates both business and technical constituencies of it. This business and IT partnership is even more highlighted by the advances of the service-oriented architecture (SOA), which views businesses process model and supporting technology model as an assembly of inter-connected services.

9 CMU SEI, Software Architecture Glossary,

The original version is: “Enterprise Architecture ‘‘(A) means—‘‘(i) a strategic information asset base, which defines the mission; ‘‘(ii) the information necessary to perform the mission; ‘‘(iii) the technologies necessary to perform the mission; and ‘‘(iv) the transitional processes for implementing new technologies in response to changing mission needs; and ‘‘(B) includes—‘‘(i) a baseline architecture; ‘‘(ii) a target architecture; and ‘‘(iii) a sequencing plan…”

11 A Business-Oriented Foundation for Service Orientation, Ulrich Homann

Architectural model as a foundation for data quality improvement
Top Business layer
In the proposed three-layered view of the EA, the business process (or capabilities) layer includes a business domain class model . Since this model is implemented at the highest possible level of abstraction, it captures only foundational business entities and their relationships. Thus, the top layer domain model is very stable and is not subject to change unless the most essential underlying business structures change. The information (or data) elements that are defined at this level of the domain model are cross-referenced against the business process model residing in the same top layer. In other words, every domain model element has at least one business process definitions that references it. The reverse is also true: there is no information element called out in the business processes definitions that does not exist in the domain model.

It is also worth pointing out that only common enterprise-level information processes and elements are captured at the top layer of EA. For example, due to historical reasons, an enterprise consists of multiple lines of business (LOB), each carrying out its own unique business process with related information definitions. At the same time, all the LOBs participate in the common enterprise process. In this case, at the top (enterprise level) business process layer, only the common enterprise-level process will be modeled. In extreme cases, this enterprise process will primarily consist of the interfaces between the LOB-level business processes.

Each of the enterprise’s LOBs will need to have its own three-layered model, where top level business entities and the corresponding information (or data) elements will be unambiguously mapped to the enterprise-level model entities. Needless to say, only the elements that have their counterparts at the enterprise level can possibly be mapped. By relating LOB-level definitions to the common enterprise-level equivalents, we are eliminating one of the main reasons for low enterprise data quality: semantic mismatch (a.k.a. ambiguity) between different business units. And since our data elements are cross-referenced with the business process models, we should have enough contextual information to correlate information elements at the enterprise- and LOB- levels. In the most difficult cases, UML State Transition Diagrams should be created to capture temporal and event aspects of the business processes.

Specification layer
The Specification Layer of the three-layered EA model introduces system-related considerations and defines specifications for the enterprise-level information systems. These are the systems that need to be constructed to support the business processes defined at the top layer of the model. By defining system requirements in terms of the business processes, another major cause of low data quality is eliminated: a disconnect between the business and the technology views of the enterprise system.

For example, it is quite common for more than one system to be operating on a data element defined at the top business layer. In this case, each system specification will define its own unique data attribute, but all these attributes are in turn mapped to the one element at the top layer. This top down decomposition approach helps to alleviate a problem known as “departmental information silo”.

Again, similar to the top layer, in the spirit of correlating data with the process contextual information, Business Use Cases Realizations and System Use Cases (or similar artifacts) are introduced at this level to provide enough grounding for the data definitions. It is important to note that in addition to the enterprise systems, the system interfaces of LOB-level systems (to support business process connection points between the different LOBs) are also specified in this layer.

Implementation layer

In this layer, the platform-specific implementation are defined and implemented. Unlike at the Specification layer, multiple platform specific implementations may be mapped to the same element defined at the specification layer. This unambiguous, contextually-based mapping from possibly multiple technology-specific implementations to a data element defined at the technology-independent specification level is the foundation for the robust high quality data management approach.

It is impossible to overestimate the importance of the two-dimensional traceability in the discussed architectural model. The first dimension – vertical traceability between the model layers – provides a foundation for rich contextual connection between the business process and the system implementation that supports this process. The second dimension – horizontal traceability within the same model layer – provides a foundation for a rich contextual connection between the hierarchical organizational units, as well as the systems implemented at their respective levels.
A robust traceability mechanism is absolutely necessary for high data quality to become a reality. The architectural model provides a foundation for the information traceability and thus data quality, without which it is not possible to address a cluster of issues introduced by the modern business environment in general and especially by the legal and regulatory compliance concerns.

Friday, September 15, 2006

There Are No Pure Data Problems

Printed in Computerworld; October 17, 2005

While I agree with Ken Karacsony’s assessment that too much ETL is a sign of potential problems (COMPUTERWORLD, SEPTEMBER 05, 2005), I have a very different opinion on what is at the heart of the issue and what kind of solution it deserves. Before I continue with the rest of my response, I would like to emphasize that everything I am asserting is mainly relevant to the on-line transactional processing (OLTP) side of the IT domain. Things look somewhat different on the on-line analytical processing (OLAP) side.
While Ken Karacsony states that what we see is a sign of “poor data management” I tend to think that much more often it is a sign of poor engineering practices in general rather than just poor data management.
Data (or numeric value of certain business-related attributes) tends to be the most tangible and visible aspect that we, as well as our business partners can observe. Given that we work for business community, this visibility by and perception of the business users is much more important than our own (IT professional) perception. Quite often when business users say “we have data problems”, we should interpret their statement as “something is wrong with the system, I do not know what it is exactly, I just know that it gives me a wrong answer, please fix it”.
There is no such thing as “pure data problem”, because in any business application data always exists within the context of the business process. Whenever data is taken out of the (business process) context, i.e. stored in the relational DBMS tables, it loses a considerable portion of its semantic significance. For instance, let’s assume that a typical data base for a financial services company would have an Address record defined. While it may be sufficient for very simple cases to have just one flavor of addresses in the database, with an increase in business process complexity, data analysts and system developers will find themselves dealing with numerous variations of the Address structure: Current Client Residence Address, Property Address, Client Correspondence Address, Shipping Address, Billing Address, Third Party Address, etc. While all these Address records may have identical physical structure, semantically they are very different. For example using automated home appraisal method with a wrong address, i.e. the Current Client Residence Address instead of the Property Address, will produce a wrong result, which is impossible to catch outside of the business process context. To give Shipping department a Billing address instead of the Shipping one is probably also a bad idea.
One way to ensure that data is not taken out of the business context is to build cohesive systems around a logical unit of the business, and expose these systems to each other only through semantically-rich messages. The advantage that messaging style integration has versus the shared database integration style is this ability to transmit not only the shared data but also the shared business context semantics. While it is not hard to maintain similar degree of clarity within the shared database design style, in the absence of the very mature development process, a shared database, by its own nature servicing many different owners at the same time, will rapidly lose its initial design crispiness due to the inability to keep up with the numerous modifications requests. This in turn will lead to the data overloading, redundancy, inconsistency and at the end to the poor “data quality” at the application level. Do not get me wrong: I am not against the shared data store integration approach; I am just recommending being realistic about the complexity of the method within the confines of the modern business environment. I would recommend using shared data integration within the scope of a single business unit while using message-based integration for the inter-departmental development as well as enterprise level. It is significantly easier to provide highly cohesive development environment within the boundaries of a single business unit due to the natural uniformity of the unit’s business priorities.
Except for the area of ad hoc reporting, our clients do not deal with databases -- they deal with business applications. I also would argue that too much ad hoc reporting signals problems with the business process design, and/or application workflow design, and/or UI design. Too many OLTP applications are poorly designed and thus have very inadequate usability characteristics, forcing users to compensate by requesting a high volume of “canned” reports as well as sophisticated ad hoc reporting capabilities. In the world of carefully designed applications, it is the applications and not the databases that are the centers of customer interactions. As an example, I recently worked on a project where we were able to either completely eliminate, or migrate into an application’s human workflow process more than half of the reports initially requested by the business users.
The solution to the “too much ETL” problem in the OLTP world is thus less centralization and lower coupling of the OLTP systems and not more centralization and tighter application coupling through a common data store. One can argue that it is always possible to introduce a layer of indirection (i.e. XML) between the application logic and the common database physical schema, thus providing a level of flexibly and decoupling. While this may work for some companies, in my personal experience, this type of design proved to be harder to maintain than the more robust asynchronous middleware-based messaging due to the fact that it mixes two different design paradigms.
I would be interested in hearing from COMPUTERWORLD readers about any medium- to large-sized company that was successful in building multi-departmental Operational Data Stores that worked well with the multiple inter-departmental systems through a number of consecutive releases. I predict that it will be hard to find a significant number of cases to discuss at all, and it will be especially difficult to find any examples from companies with a dynamic business process that requires constant introduction of new products and services. The main reason for the lack of success, from my point of view, is not technical in nature. It is relatively easy to build tightly-coupled applications integrated via the common data store, especially if it is done under the umbrella of one single program with a mature system development culture. The problem is in the “Realpolitik” of a modern business environment: we live in and work for businesses in the age of ever-accelerating global competition. It is almost impossible to coordinate business plans of various departments, and the subsequent deployment schedules of multiple IT projects, each working on its group of business priorities in order to keep systems, which are built around one shared Database, current. When one of the interdependent development teams misses a deliverable deadline, political pressure to separate will become hard to resist. And if a commercial of the shelf software (COTS) package is acquired, or a corporate merger or an acquisition takes place, the whole idea of all applications working with one common data format is immediately thrown out the window.
So we, in IT, need to learn how to build systems that will not require rigid release synchronization from the multiple OLTP systems belonging to disparate business units. Decoupling can provide us with the required flexibility to modify our systems on a coordinated, but not prohibitively-rigid schedule.
Finally, it is important to emphasize that while loose coupling gives us an opportunity to modify different systems on different schedules without corrupting the coupled systems, loosely-coupled does not mean “loosely-managed.” Loose coupling provides us with a degree of flexibility in implementation and deployment. This additional degree of flexibility gives our business partners the ability to move rapidly when they need to and at the same time provides IT with the ability to contain and manage the challenges caused by the ever-increasing rate of business change. We have to acknowledge that developing loosely coupled applications that work well together across an enterprise with well-delineated responsibilities is a very challenging engineering problem. If not managed well this type of system development may turn advantage of loose coupling into disadvantage of “delayed-action” semantic problems. A mature IT development process is absolutely necessary to overcome this engineering problem and deliver this type of information infrastructure to our business partners. From this perspective, it is worthwhile for any organization that is striving to build a well-integrated enterprise level IT infrastructure to look into SEI Capability Maturity Model. Specifically, Maturity Level 3, called the Defined Level, addresses issues of development process consistency across the whole enterprise. From my point of view it is a prerequisite to the physical integration of the enterprise systems into consistent whole. CMM manual describes this process level as when “the standard process for developing and maintaining software across the organization is documented, including both software engineering and management processes, and these processes are integrated into a coherent whole.” Unless an organization is prepared to operate at his level, it should not have high hopes for a success in the integration area.
So to summarize: successful data management begins by taking focus away from data. Instead, the focus should be on the general level of system engineering and its main aspects, i.e., Requirements Analysis and Management, Business Domain Modeling, Configuration Management, QA Process, etc.
I would argue that any medium to large company that has not reached CMM level 3 and is trying to get “data under control” would have little chance to succeed in this undertaking, regardless of what integration style it will use.

Saturday, November 05, 2005

X-Engineering, Zero Latency Enterprise will put the spotlight on data quality

Article published in DMReview

According to James Champy, the co-author of The New York Times bestseller “Reengineering the Corporation” , the market players that can respond to the critical market events faster than their competitors will end up as winners in the emerging new economy. It is safe to assume that most of these market players have already reengineered their business processes within the corporation boundaries to achieve better efficiency. In order to win the next phase in the never-ending market race, they will also need to integrate their business processes with those of their suppliers and business partners. Additionally, the ability to quickly adjust processes to better respond to one’s customers will also become a decisive factor in the new economy.

In this type of economic environment, the latency between the initial market event (any kind of significant disruption to the market status quo) and a response from the integrated process chain cannot take months or even weeks. The winners will have days and sometimes just hours to react to the changes in the supply chain or a new customer trend.

Taking this into account, successful corporations that are aspiring to become winners in the new global race should be thinking about zero latency processing. For instance, in today’s marketplace, a new financial services product usually guarantees its inventor a head start of a few months, typically resulting in substantial financial gains. However, if the other market players can respond within days instead of months, they can practically eliminate the competitor’s advantage of being the first to the market.

X-Engineering will lead to Zero Latency Enterprise
Given that modern business processes rely heavily on information systems, and as market forces keep pushing companies towards faster updates to their business processes, the information systems’ implementation/deployment cycle becomes increasingly more important. One of the more popular approaches -- Zero Latency Enterprise -- encourages the creation of a feedback loop from the analytical (OLAP) side of the tactical, and even possibly strategic decisioning, into the operational (OLTP) systems in order to accelerate the event-response sequence.

As the update cycles accelerate, data quality will become even more important
Traditionally, the quality of data stored in the Enterprise Data Warehouse (EDW) significantly influences the quality of the decisioning process. In turn, the quality of data housed in the Enterprise Data Warehouse is dependent on the quality of data produced by the OLTP systems. With margins contracting more and more every year, it is possible to consider that the difference between success and failure of some significant undertaking may depend on some relatively obscure operational attribute captured by some operational system and then consumed by the EDW. Unfortunately, the more complex the business processes is, the more difficult it is by the OLTP systems to produce high quality data. Further, in the Near Real Time (NRT) Enterprise environment, the OLTP systems should be able to accept changes in the operational parameters that the OLAP / decision-support systems (DSS) produce. In order to support this fast update cycle, a rules-based or a similar fast-deployment cycle technology should be used. The existence of a Near Real Time feedback loop from the OLAP side back to the OLTP side of the enterprise supports very rapid changes to the business process, but at the same time exacerbates any inconsistencies and errors that result when information is transformed and loaded from the OLTP system into the EDW/OLAP side. For example, an erroneous calculation of loan processing costs in the EDW (based on an incorrectly captured operations time) may lead to an automated decision to open this financial product (loan type) to more clients. This decision would be automatically consumed by the appropriate operational systems with the Internet-enabled front end, and may significantly affect the business’ financial characteristics. If it turns out that the calculation was wrong, the net effect may be a substantial loss instead of a hefty profit. The elimination of time-consuming manual steps in the process, while providing a corporation with the ability to respond very quickly to a market event, will at the same time put even more emphasis on decisioning and thus data quality.

Traditional Approach to Data Quality needs improvement
In a classical EDW environment Extract Transform Load (ETL) tools assume responsibility for data extraction from the source systems, as well as transformation, cleansing and loading the data into the EDW/OLAP systems.
At the same time, the OLTP and OLAP developers have to address a rather different set of issues. It is not surprising that there is oftentimes a mental “impedance mismatch” between the OLTP and the OLAP staff that results in disagreements about:
• Push versus Pull (Extract in ETL)
• Data transformation responsibilities and techniques
• Data cleansing approach

Traditionally, IT departments rely on the teams responsible for the EDW/ OLAP processing to address the ETL issues, using the old and unfortunately ineffective creed: “you need it -- you do it”.
This approach does not work well as the cost of loading the OLAP Data stores with the reliable high-quality data, and especially keeping the OLAP Data stores semantically synchronized with the changes in the source OLTP systems, is very high. In the most common scenarios, changes introduced on the OLTP side still require weeks, and sometimes months, to be correctly reflected in the EDW/OLAP systems.

Making it happen
Two factors have come together to change the traditional approach. As I have already pointed out above, there is a demand from business leadership to shorten implementation cycles, as well as the trend of integrating many inter-corporation business processes into one end-to-end, highly-efficient process. On the technology side, the emergence of and advancements in the Service Oriented Architecture (SOA) have created momentum in IT departments towards better understanding, and thus modeling of business processes.

With the SOA advancement, the OLTP side is becoming much better structured: the issues at the syntax and communication protocol level are addressed, and boundaries are now explicit. The asynchronous nature of communication requires understanding, capturing and transmission of time-state information. The advent of SOA is creating a foundation and an industry impetus to start viewing the data issues in a new light, connecting data issues more with the business process management (BPM). While the SOA-way of thinking can itself help, it is not sufficient to address the issue of semantic differences between the source and target systems within the scope of the SOA framework. The differences in semantics are impossible to address without capturing enough contextual information to reason about these differences.
• Timing
• Relationship to the rest of the domain
• Business process-level coordination.

Meet Data in Context.
On my most recent project (for the medium size financial services company) the project team has developed an approach that addresses significant issues that until now were preventing this company, as well as other companies, from realizing benefits of the Near Real Time analytical decision-support technology. The cornerstone of this approach is the creation and the rigorous maintenance of the rich contextual domain model on the OLTP side. The existence of this ontological model eliminates two main predicaments that exist in the way to the zero latency enterprise.

First, the rich contextual OLTP-side model enables and facilitates better understanding of information within the context of the business process, which in turn enables business process integration both within, as well as across the corporation boundaries.

Second, the OLTP side of the enterprise can now output information according to the specs produced by the EDW/OLAP/ DSS side with improved quality and efficiency. While the ETL processes still exist, the cycle of producing the information required by the strategic and tactical decision makers is now significantly shorter.
Decentralized rich OLTP-side Domain Model is the key
A successful OLAP-side approach that capitalizes on a single Enterprise metadata repository has not yet been successfully applied on the OLTP side of an enterprise . This is not surprising given that the business processes, and thus the various OLTP systems themselves, are much more diverse in nature than the more homogeneous OLAP-side systems. For instance, the operational processes in the acquisition department and trading desks are different in scope, use different terminology, and have different key performance indicators.

Data Architecture teams have realized that the diverse business process context may pose a problem on the way to creating a single OLTP-side metadata repository, and have suggested an alternative approach. This approach strips the data of most of its business process context in order to make it easier to correlate data from different processes and OLTP systems.

One example of this common technique is a data dictionary approach. Unfortunately, this approach also does not work too well in the long run: data divorced from its context rapidly becomes more or less useless with the increase in business processes complexity. For instance, a typical Data dictionary for a financial services company would have an Address structure defined. While it may be sufficient for very simple cases, with an increase in business process complexity, data analysts and system developers found themselves dealing with numerous variations of the Address structure: Current Client Residence Address, Property Address, Client Correspondence Address, Third Party Address, etc.

Furthermore, the Address case described above is relatively simple compared to a situation of three- or four- layered hierarchical data structures, i.e. Credit Report and Credit Score. Credit Score, for example, may be aggregated at different levels: a Loan, a Borrower, a Borrower Group, etc., with the Borrower level may further be provided by a number of credit vendors. Credit vendors, in turn, may use different aggregations of Credit Scores from the three Credit Repositories: Experian, TransUnion, EquiFax. Considering that these Credit Repositories in turn may use different scoring models, it quickly becomes apparent how fast information complexity can increase.

The ability to impeccably correlated data from the different OLTP systems across the contexts of their own business processes is absolutely essential in order to solve the data quality problem for conventional, and even more so for Zero Latency Enterprise models,. Unfortunately, quite often due to lack of a well-defined system development process, as well as shortage of analysts with appropriate modeling skills, this correlation analysis (sometimes called ‘mapping”) is left out to the data analysts and developers. The skill set possessed by these groups of professionals, specifically one of physical integration of the OLTP RDBM-based systems, does not lend itself well to the domain/business process modeling problem. Add to this a rather common absence of any metadata-repositories for the business process models that would be available for an average Java or .NET developer , and the result is a status quo of a tightly-coupled physical database systems. This approach makes the entire OLTP side brittle and commonly producing unreliable “polluted” data consumed by the OLAP side. This is typically followed by the never-ending cycle of blame for low data quality made apparent in the extraction, transformation and cleansing steps.

In order to successfully integrated OLTP systems, two main issues need to be addressed.
First, every application, or group of applications, that will be considered as an independent processing entity with well defined boundaries should have a rich meta-information repository. This repository will unambiguously define all the relevant data within the scope of the business processes supported by this system. For instance, I was recently part of an effort where the Domain model had three main parts: Business class model, Business Use Case Realizations, and System Use Case model. No data element would be added to the Domain model unless it was initially called out in the business and system use cases.

Second, a well-defined process should be created and rigorously maintained for information correlation between the metadata repositories of the different areas. Each department should be responsible for the creation of its own Domain model, but the correlation process and the artifacts that capture the results of this process (in our case, we call them Overarching System Use Cases) are the joint responsibility of the departments that are integrating their business processes.