You can use this algorithm to do initial exploration of data, and then later you can apply the results to create additional mining models with other algorithms that are more computationally intense and more accurate. As an ongoing promotional strategy, the marketing department for the Adventure Works Cycle company has decided to target potential customers by mailing out fliers. To reduce costs, they want to send fliers only to those customers who are likely to respond.
The company stores information in a database about demographics and response to a previous mailing. They want to use this data to see how demographics such as age and location can help predict response to a promotion, by comparing potential customers to customers who have similar characteristics and who have purchased from the company in the past.
Specifically, they want to see the differences between those customers who bought a bicycle and those customers who did not. By using the Microsoft Naive Bayes algorithm, the marketing department can quickly predict an outcome for a particular customer profile, and can therefore determine which customers are most likely to respond to the fliers. The Microsoft Naive Bayes algorithm calculates the probability of every state of each input column, given each possible state of the predictable column.
Here, the Microsoft Naive Bayes Viewer lists each input column in the dataset, and shows how the states of each column are distributed, given each state of the predictable column. You would use this view of the model to identify the input columns that are important for differentiating between states of the predictable column.
For example, in the row for Commute Distance shown here, the distribution of input values is visibly different for buyers vs. The scatter plot represents all the cases in the dataset, and each case is a point on the graph. The clusters group points on the graph and illustrate the relationships that the algorithm identifies.
After first defining the clusters, the algorithm calculates how well the clusters represent groupings of the points, and then tries to redefine the groupings to create clusters that better represent the data. The algorithm iterates through this process until it cannot improve the results more by redefining the clusters.
You can customize the way the algorithm works by selecting a specifying a clustering technique, limiting the maximum number of clusters, or changing the amount of support required to create a cluster. When you prepare data for use in training a clustering model, you should understand the requirements for the particular algorithm, including how much data is needed, and how the data is used.
A single key column Each model must contain one numeric or text column that uniquely identifies each record. Compound keys are not allowed. Input columns Each model must contain at least one input column that contains the values that are used to build the clusters.
You can have as many input columns as you want, but depending on the number of values in each column, the addition of extra columns can increase the time it takes to train the model. Optional predictable column The algorithm does not need a predictable column to build the model, but you can add a predictable column of almost any data type.
The algorithm uses the results of this analysis over many iterations to find the optimal parameters for creating the mining model. These parameters are then applied across the entire data set to extract actionable patterns and detailed statistics. A decision tree that predicts an outcome, and describes how different criteria affect that outcome. A set of rules that describe how products are grouped together in a transaction, and the probabilities that products are purchased together.
The algorithms provided in SQL Server Data Mining are the most popular, well-researched methods of deriving patterns from data.
To take one example, K-means clustering is one of the oldest clustering algorithms and is available widely in many different tools and with many different implementations and options. All of the Microsoft data mining algorithms can be extensively customized and are fully programmable, using the provided APIs.
You can also automate the creation, training, and retraining of models by using the data mining components in Integration Services. Choosing the best algorithm to use for a specific analytical task can be a challenge. While you can use different algorithms to perform the same business task, each algorithm produces a different result, and some algorithms can produce more than one type of result.
For example, you can use the Microsoft Decision Trees algorithm not only for prediction, but also as a way to reduce the number of columns in a dataset, because the decision tree can identify columns that do not affect the final mining model.
Classification algorithms predict one or more discrete variables, based on the other attributes in the dataset. Regression algorithms predict one or more continuous numeric variables, such as profit or loss, based on other attributes in the dataset. Segmentation algorithms divide data into groups, or clusters, of items that have similar properties. Association algorithms find correlations between different attributes in a dataset. The most common application of this kind of algorithm is for creating association rules, which can be used in a market basket analysis.
Sequence analysis algorithms summarize frequent sequences or episodes in data, such as a series of clicks in a web site, or a series of log events preceding machine maintenance. However, there is no reason that you should be limited to one algorithm in your solutions.
Experienced analysts will sometimes use one algorithm to determine the most effective inputs that is, variables , and then apply a different algorithm to predict a specific outcome based on that data. Decreasing this value can potentially reduce the time that is required for creating the model, because processing of the model stops when the limit is reached. This parameter can be used to eliminate items that appear frequently and therefore potentially have little meaning.
If this value is less than 1, the value represents a percentage of the total cases. Values greater than 1 represent the absolute number of cases that can contain the itemset. If you increase this number, the model might contain fewer itemsets. This can be useful if you want to ignore single-item itemsets, for example. You cannot reduce model processing time by increasing the minimum value, because Analysis Services must calculate probabilities for single items anyway as part of processing.
However, by setting this value higher you can filter out smaller itemsets. For example, if you set this value to 0. If you set this value to less than 1, the minimum number of cases is calculated as a percentage of the total cases. If you set this value to a whole number greater than 1, specifies the minimum number of cases is calculated as a count of cases that must contain the itemset. The algorithm might automatically increase the value of this parameter if memory is limited.
The default is 0. The default value is 0. When the default is used, the algorithm will produce as many predictions as requested in the query. However, setting a value can improve prediction performance. For example, if the value is set to 3, the algorithm caches only 3 items for prediction.
0コメント