IBM SPSS data mining products and services ensure timely, reliable results by supporting the Cross-Industry Standard Process for Data Mining (CRISP-DM). Created by industry experts, CRISP-DM provides step-by-step guidelines, tasks, and objectives for every stage of the data mining process. CRISP-DM is the industry-standard process for data mining projects.
CRISP–DM demands that data mining be seen as an entire process, from communication of the business problem through data collection and management, data pre-processing, model building, model evaluation, and finally, model deployment. Mastering the methodology therefore requires the combination of abilities ranging from data affinity through quantitative reasoning and a sound business acumen to well-developed communication skills. At AsiaAnalytics, we have succeeded in uniting all factors, combining them into a winning package: world class analytics technology (IBM SPSS Statistics and IBM SPSS Modeler) operated and supported by China's leading analysts, modellers, and data scientists.
The methodology consists of six steps, each of them equally important in the generation of meaningful analytical insights and the production of actionable results.
Business Understanding: In this initial phase, the aim is to identify and better grasp the business objectives of the data mining exercise. To do so, data scientists most often have to work together with business experts. A project plan will be produced in order to ensure the structure of the data mining exercise.
Data Understanding: In this stage, we collect the data, describe the data and verify its quality.
Data Preparation: This step consists in selecting the data within the dataset that we will use for data mining. To do so, we remove the unnecessary data and make sure that the selected data is free from outliers.
Modelling: Here, we select the most appropriate model to fulfil the business objectives. We then build the model and assess its structure.
Evaluation: In this phase, we assess the quality of the results produced by the model with respect to the business objectives. If necessary, we review some of the actions taken in the preceding steps.
Deployment: depending on the requirements, the deployment can be as simple as generating a report or but can also consist of a complex integration into an existing sytem or the implementation of repeatable data-mining processes.
All AsiaAnalytics consultants are systematically trained on this methodology, putting it into services whenever appropriate. As analytics pioneers, our team has more experience in employing the CRISP-DM methodology than anyone else in the region. This has allowed use to develop unparalleled excellence in its application, enabling us to reach maximum analytical precision and efficiency when offering solutions to our clients.
In the following, please find a more detailed description of the six steps in the CRISP-DM methodology:
Know “who, what, when, where, why, and how” from a business perspective.
Develop a thorough understanding of the project parameters: the current business situation, the primary business objective of the project, the criteria for success, and who will determine the success of the project.
Gather all of the data you will need for your project. If your data will come from more than one source, make sure your data mining tool can integrate the data. The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data or to detect interesting subsets to form hypotheses for hidden information.
The project plan describes the intended plan for achieving the data mining goals, including outlining specific steps and a proposed timeline, an assessment of potential risks, and an initial assessment of the tools and techniques needed to support the project. Generally accepted industry timeline standards are: 50 to 70 percent of the time and effort in a data mining project involves the Data Preparation Phase; 20 to 30 percent involves the Data Understanding Phase; only 10 to 20 percent is spent in each of the Modeling, Evaluation, and Business Understanding Phases; and 5 to 10 percent is spent in the Deployment Planning Phase.
Decide what data to use for analysis and list the reasons for your decisions. This involves:
The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modelling tool(s) from the initial raw data. Data preparation tasks are likely to be performed multiple times and not in any prescribed order. Tasks include table, record and attribute selection as well as transformation and cleaning of data for modelling tools.
In this step, the data analyst outlines the resources, from personnel to software that are available to accomplish the data mining project. Particularly important is discovering what data is available to meet the primary business goal. At this point, the data analyst also should list the assumptions made in the project— assumptions such as, “To address the business question, a minimum number of customers over age 50 is necessary.” The data analyst also should list the project risks, list potential solutions to those risks, create a glossary of business and data mining terms, and construct a cost-benefit analysis for the project.
The data mining goal states project objectives in business terms such as, “Predict how many widgets a customer will buy given their purchases in the past three years, demographic information (age, salary, city, etc.), and the item price.” Success also should be defined in these terms—for instance, success could be defined as achieving a certain level of predictive accuracy.
Here a data analyst acquires the necessary data, including loading and integrating this data if necessary. The analyst should make sure to report problems encountered and his or her solutions to aid with future replications of the project. For instance, data may have to be collected from several different sources, and some of these sources may have a long lag time. It is helpful to know this in advance to avoid potential delays.
During this step, the data analyst examines the “gross” or “surface” properties of the acquired data and reports on the results, examining issues such as the format of the data, the quantity of the data, the number of records and fields in each table, the identities of the fields, and any other surface features of the data. The key question to ask is: Does the data acquired satisfy the relevant requirements? For instance, if age is an important field and the data does not reflect the entire age range, it may be wise to collect a different set of data. This step also provides a basic understanding of the data on which subsequent steps will build.
This task tackles the data mining questions, which can be addressed using querying, visualisation, and reporting. For instance, a data analyst may query the data to discover the types of products that purchasers in a particular income group usually buy. Or the analyst may run a visualisation analysis to uncover potential fraud patterns. The data analyst should then create a data exploration report that outlines first findings, or an initial hypothesis, and the potential impact on the remainder of the project.
At this point, the analyst examines the quality of the data, addressing questions such as: Is the data complete? Missing values often occur, particularly if the data was collected across long periods of time. Some common items to check include: missing attributes and blank fields; whether all possible values are represented; the plausibility of values; the spelling of values; and whether attributes with different values have similar meanings (e.g., low fat, diet). The data analyst also should review any attributes that may give answers that conflict with common sense (e.g., teenagers with high income).
In this phase, various modelling techniques are selected and applied and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often necessary.
In this phase, various modeling techniques are selected and applied and their parameters are calibrated to optimal values. Typically, several techniques exist for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase may be necessary. Modeling steps include the selection of the modeling technique, the generation of test design, the creation of models, and the assessment of models.
This task refers to choosing one or more specific modeling techniques, such as decision tree building with C4.5 or neural net- work generation with back propagation. If assumptions are attached to the modeling technique, these should be recorded.
After building a model, the data analyst must test the model’s quality and validity, running empirical testing to determine the strength of the model. In supervised data mining tasks such as classification, it is common to use error rates as quality measures for data mining models. Therefore, we typically separate the data set into train and test set, build the model on the train set, and estimate its quality on the separate test set. In other words, the data analyst develops the model based on one set of existing data and tests its validity using a separate set of data. This enables the data analyst to measure how well the model can predict history before using it to predict the future. It is usually appropriate to design the test procedure before building the model; this also has implications for data preparation.
After testing, the data analyst runs the modeling tool on the prepared data set to create one or more models.
The data mining analyst interprets the models according to his or her domain knowledge, the data mining success criteria, and the desired test design. The data mining analyst judges the success of the application of modeling and discovery techniques technically, but he or she should also work with business analysts and domain experts in order to interpret the data mining results in the business context. The data mining analyst may even choose to have the business analyst involved when creating the models for assistance in discovering potential problems with the data.
For example, a data mining project may test the factors that affect bank account closure. If data is collected at different times of the month, it could cause a significant difference in the account balances of the two data sets collected. (Because individuals tend to get paid at the end of the month, the data collected at that time would reflect higher account balances.) A business analyst familiar with the bank’s operations would note such a discrepancy immediately.
In this phase, the data mining analyst also tries to rank the models. He or she assesses the models according to the evaluation criteria and takes into account business objectives and business success criteria. In most data mining projects, the data mining analyst applies a single technique more than once or generates data mining results with different alternative techniques. In this task, he or she also compares all results according to the evaluation criteria.
Determine whether and how well the results delivered by a given model will help you achieve your business goals. Is there any business reason why the model is deficient?
At this stage in the project the data analyst has built a model (or models) that appears to have high quality from a data analysis perspective. Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model and review the steps executed to construct the model to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached.
Before proceeding to final deployment of the model built by the data analyst, it is important to more thoroughly evaluate the model and review the model’s construction to be certain it properly achieves the business objectives. Here it is critical to deter- mine if some important business issue has not been sufficiently considered. At the end of this phase, the project leader then should decide exactly how to use the data mining results. The key steps here are the evaluation of results, the process review, and the determination of next steps.
Previous evaluation dealt with factors such as the accuracy and generality of the model. This step assesses the degree to which the model meets the business objectives and determines if there is some business reason why this model is deficient. Another option here is to test the model(s) on real-world applications—if time and budget constraints permit. Moreover, evaluation also seeks to unveil additional challenges, information, or hints for future directions. At this stage, the data analyst summarises the assessment results in terms of business success criteria, including a final statement about whether the project already meets the initial business objectives.
It is now appropriate to do a more thorough review of the data mining engagement to determine if there is any important factor or task that has somehow been overlooked. This review also covers quality assurance issues (e.g., did we correctly build the model? Did we only use allowable attributes that are available for future deployment?).
At this stage, the project leader must decide whether to finish this project and move on to deployment or whether to initiate further iterations or set up new data mining projects.
Take the project results and decide how best to use them to address your business issue:
Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organised and presented in a way that the customer can use it. It often involves applying “live” models within an organisation’s decision-making processes. However, depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise.
In order to deploy the data mining result(s) into the business, this task takes the evaluation results and develops a strategy for deployment.
Monitoring and maintenance are important issues if the data mining result is to become part of the day-to-day business and its environment. A carefully prepared maintenance strategy avoids incorrect usage of data mining results.
At the end of the project, the project leader and his or her team write up a final report. Depending on the deployment plan, this report may be only a summary of the project and its experiences (if they have not already been documented as an ongoing activity) or it may be a final and comprehensive presentation of the data mining result(s). This report includes all of the previous deliverables and summarises and organises the results. Also, there often will be a meeting at the conclusion of the project, where the results are verbally presented to the customer.
The data analyst should assess failures and successes as well as potential areas of improvement for use in future projects. This step should include a summary of important experiences during the project and can include interviews with the significant project participants. This document could include pitfalls, misleading approaches, or hints for selecting the best-suited data mining techniques in similar situations. In ideal projects, experience documentation also covers any reports written by individual project members during the project phases and tasks.