See: Gartner, "Magic Quadrant for Data Quality Tools," Ted Friedman, Andreas Bitterer. October 7, 2013.
Colin White of BI Research and Harriet Fryman of IBM help separate the reality from the hype by taking a look at use cases and the benefits customers are gaining from big data.
Data quality management includes the continuous analysis, observation, and improvement of all information in the organization. Best practice includes:
1. Data quality assessment, as a way for the practitioner to understand the scope of how poor data quality affects
the ways that the business processes are intended to run, and to develop a business case for data quality management;
2. Data quality measurement, in which the data quality analysts synthesize the results assessment and concentrate on the data elements that are deemed critical based on the selected business users’ needs. This leads to the definition of performance metrics that feed management reporting via data quality scorecards;
3. Integrating data quality into the application infrastructure, by way of integrating data requirements analysis across the organization and by engineering data quality into the system development life cycle;
4. Operational data quality improvement, where data stewardship procedures are used to manage identified data quality rules and conformance to acceptability thresholds;
5. Data quality incident management, which allows the data quality analysts to review the degree to which the data
does or does not meet the levels of acceptability, report, log, and track issues, and document the processes for
remediation and improvement.
A data model is a plan for building a database. To use a common analogy, the data model is equivalent to an architect's building plans.
Data modeling is a process used to define and analyze data requirements needed to support the business processes within the scope of corresponding information systems in organizations. To be effective, it must be simple enough to communicate to the end user the data structure required by the database yet detailed enough for the database design to use to create the physical structure.
A data model is a conceptual representation of the data structures that are required by a database. The data structures include the data objects, the associations between data objects, and the rules which govern operations on the objects. As the name implies, the data model focuses on what data is required and how it should be organized rather than what operations will be performed on the data.
Data modeling is the formalization and documentation of existing processes and events that occur during application software design and development. Data modeling techniques and tools capture and translate complex system designs into easily understood representations of the data flows and processes, creating a blueprint for construction and/or re-engineering.
A data model can be thought of as a diagram or flowchart that illustrates the relationships between data. Although capturing all the possible relationships in a data model can be very time-intensive, it's an important step and shouldn't be rushed. Well-documented models allow stake-holders to identify errors and make changes before any programming code has been written.
Data modeling is also used as a technique for detailing business requirements for specific databases. It is sometimes called database modeling because a data model is eventually implemented in a database.
There are three different types of data models produced while progressing from requirements to the actual database to be used for the information system:
1) Conceptual data models. These models, sometimes called domain models, are typically used to explore domain concepts with project stakeholders. On Agile teams high-level conceptual models are often created as part of your initial requirements envisioning efforts as they are used to explore the high-level static business structures and concepts. On traditional teams conceptual data models are often created as the precursor to LDMs or as alternatives to LDMs.
2) Logical data models (LDMs). LDMs are used to explore the domain concepts, and their relationships, of your problem domain. This could be done for the scope of a single project or for your entire enterprise. LDMs depict the logical entity types, typically referred to simply as entity types, the data attributes describing those entities, and the relationships between the entities. LDMs are rarely used on Agile projects although often are on traditional projects (where they rarely seem to add much value in practice).
3) Physical data models (PDMs). PDMs are used to design the internal schema of a database, depicting the data tables, the data columns of those tables, and the relationships between the tables. PDMs often prove to be useful on both Agile and traditional projects and as a result the focus of this article is on physical modeling.
Although LDMs and PDMs sound very similar, and they in fact are, the level of detail that they model can be significantly different. This is because the goals for each diagram is different – you can use an LDM to explore domain concepts with your stakeholders and the PDM to define your database design.
Data Modeling in Context of Business Process Integration
Data Modeling in the Context of Database Design
Database design is defined as: "design the logical and physical structure of one or more databases to accommodate the information needs of the users in an organization for a defined set of applications". The design process roughly follows five steps:
1. planning and analysis
2. conceptual design
3. logical design
4. physical design
The data model is one part of the conceptual design process. The other, typically is the functional model. The data model focuses on what data should be stored in the database while the functional model deals with how the data is processed. To put this in the context of the relational database, the data model is used to design the relational tables. The functional model is used to design the queries which will access and perform operations on those tables.
Data Model Components
The data model gets its inputs from the planning and analysis stage. Here the modeler, along with analysts, collects information about the requirements of the database by reviewing existing documentation and interviewing end-users.
The data model has two outputs. The first is an entity-relationship diagram which represents the data strucures in a pictorial form. Because the diagram is easily learned, it is valuable tool to communicate the model to the end-user.
The second component is a data document. This a document that describes in detail the data objects, relationships,
and rules required by the database. The dictionary provides the detail required by the database developer to construct the physical database.
Managing data is challenging. Many efforts result in siloed information and fragmented views that damage competitiveness and increase costs. In the modern era of "big data" the best practice may be to create one central data depository with a uniform data governance architecture yet allow each business unit to own their data.
The goal is to provide simple ways for both data scientists and non-technical users to explore, visualize and interpret data to reveal patterns, anomalies, key variables and potential relationships. Data Governance and Master Data Management (MDM) design is key to achieving this goal.
Master data management (MDM) comprises a set of processes and tools that defines and manages data. MDM lies at the core of many organizations’ operations, and the quality of that data shapes decision making. MDM helps leverage trusted business information—helping to increase profitability and reduce risk.
Master data is reference data about an organization’s core business entitles. These entities include people
(customers, employees, suppliers), things (products, assets, ledgers), and places (countries, cities, locations). The
applications and technologies used to create and maintain master data are part of a master data management (MDM) system.
Recent developments in business intelligence (BI) aid in regulatory compliance and provide more usable and quality data for smarter decision making and spending. Virtual master data management (Virtual MDM) utilizes data
virtualization and a persistent metadata server to implement a multi-level automated MDM hierarchy.
● Improving business agility
● Providing a single trusted view of people, processes and applications
● Allowing strategic decision making
● Enhancing customer relationships
● Reducing operational costs
● Increasing compliance with regulatory requirements
MDM helps organizations handle four key issues:
● Data redundancy
● Data inconsistency
● Business inefficiency
● Supporting business change
MDM provides processes for collecting, aggregating, matching, consolidating, quality-assuring, persisting and
distributing data throughout an organization to ensure consistency and control in the ongoing maintenance and
application use of this information. MDM seeks to ensure that an organization does not use multiple (potentially
inconsistent) versions of the same master data in different parts of its operations and solves issues with the quality of data, consistent classification and identification of data, and data-reconciliation issues.
MDM solutions include source identification, data collection, data transformation, normalization, rule administration,
error detection and correction, data consolidation, data storage, data distribution, and data governance.
MDM tools include data networks, file systems, a data warehouse, data marts, an operational data store, data mining, data analysis, data virtualization, data federation and data visualization.
MDM requires an organization to implement policies and procedures for controlling how master data is created and
One of the main objectives of an MDM system is to publish an integrated, accurate, and consistent set of master data for use by other applications and users. This integrated set of master data is called the master data system of record (SOR). The SOR is the gold copy for any given piece of master data, and is the single place in an organization that the master data is guaranteed to be accurate and up to date.
Although an MDM system publishes the master data SOR for use by the rest of the IT environment, it is not
necessarily the system where master is created and maintained. The system responsible for maintaining any given
piece of master data is called the system of entry (SOE). In most organizations today, master data is maintained by
Customer data is an example. A company may, for example, have customer master data that is maintained by multiple Web store fronts, by the retail organization, and by the shipping and billing systems. Creating a single SOR for customer data in such an environment is a complex task.
The long term goal of an enterprise MDM environment is to solve this problem by creating an MDM system that is not only the SOR for any given type of master data, but also the SOE as well.
MDM then can be defined as a set of policies, procedures, applications and technologies for harmonizing and
managing the system of record and systems of entry for the data and metadata associated with the key business
entities of an organization.
Greenplum is a big data analytics company driving the future of Big Data analytics with breakthrough products that harness the skills of data science teams to help global organizations realize the full promise of business agility and become data-driven, predictive enterprises.
Greenplum Unified Analytics Platform (UAP) combines the co-processing of structured and unstructured data with a productivity engine that enables collaboration among your data science team. Greenplum UAP includes Greenplum Database, Greenplum HD, and Greenplum Chorus.
Oracle is uniquely qualified to combine everything needed to meet the big data challenge - including software and hardware – into one engineered system. The Oracle Big Data Appliance is an engineered system that combines optimized hardware with the most comprehensive software stack featuring specialized solutions developed by Oracle to deliver a complete, easy-to-deploy solution for acquiring, organizing and loading big data into Oracle Database 11g. It is designed to deliver extreme analytics on all data types, with enterprise-class performance, availability,
supportability and security. With Big Data Connectors, the solution is tightly integrated with Oracle Exadata and Oracle Database, so you can analyze all your data together with extreme performance.
Once data has been loaded from Oracle Big Data Appliance into Oracle Database or Oracle Exadata, end users can use one of the following easy-to-use tools for in-database, advanced analytics:
Oracle R Enterprise – Oracle’s version of the widely used Project R statistical environment enables statisticians to use R on very large data sets without any modifications to the end user experience. Examples of R usage include predicting airline delays at a particular airports and the submission of clinical trial analysis and results.
In-Database Data Mining – the ability to create complex models and deploy these on very large data volumes to drive predictive analytics. End-users can leverage the results of these predictive models in their BI tools without the need to know how to build the models. For example, regression models can be used to predict customer age based on
purchasing behavior and demographic data.
In-Database Text Mining – the ability to mine text from micro blogs, CRM system comment fields and review sites combining Oracle Text and Oracle Data Mining. An example of text mining is sentiment analysis based on comments. Sentiment analysis tries to show how customers feel about certain companies, products or activities.
In-Database Semantic Analysis – the ability to create graphs and connections between various data points and data sets. Semantic analysis creates, for example, networks of relationships determining the value of a customer’s circle of friends. When looking at customer churn customer value is based on the value of his network, rather than on just
the value of the customer.
In-Database Spatial – the ability to add a spatial dimension to data and show data plotted on a map. This ability enables end users to understand geospatial relationships and trends much more efficiently. For example, spatial data can visualize a network of people and their geographical proximity. Customers who are in close proximity can
readily influence each other’s purchasing behavior, an opportunity which can be easily missed if spatial visualization is left out.
In-Database MapReduce – the ability to write procedural logic and seamlessly leverage Oracle Database parallel execution. In-database MapReduce allows data scientists to create high-performance routines with complex logic.
In-database MapReduce can be exposed via SQL. Examples of leveraging in-database MapReduce are sessionization of weblogs or organization of Call Details Records (CDRs).