Data science is a huge umbrella term. It covers everything from data cleansing to preparation of data and its eventual analysis. The scope is so wide, it can be confusing when we want to discuss it. Personally, I find it useful to have the following diagrams to better frame my thoughts when I discuss data science.
Here, I've distinguished between engineering tasks and analytics tasks. Data engineering-type work can include data extraction, loading, transformation, and cleaning of data. The sexy stuff that the public tends to associate with data science? That would be the data analytics work. That's where you derive useful insights, conclusions you can act on to produce superior business results.
When people get excited about data science in general, it's specifically predictive analytics that they're raving about. I'm going to spend some time walking you through predictive data analytics because you need to separate the truth from the hype. Even if you don't get into the nitty-gritty engineering aspects of the field, it's crucial that you know how predictive analytics works and what it entails. With the accelerating improvement rate in data science, predictive analytics may become its own field. So for the purposes of this article, I'm going to focus on predictive analytics over explanatory analytics because it's more important that you know where the field is going than where it is right now.
Different Kinds of Analytics
When it comes to analytics, it helps to think in terms of whether we want to explain existing data points or predict future trends.
Of course, certain tasks may overlap, which is why both diagrams above are presented as spectrum. However, there will be tasks that are closer to one end than the other. With these two diagrams, I have started to introduce the basics of thinking about predictive data analytics and predictive data science, which are used for the express of predicting future events.
Predictive Analytics: The Key Assumption Most People Miss
Predictive analytics has been defined as the use of data, statistical algorithms, and machine learning techniques "to analyze current and historical facts to make predictions about future or otherwise unknown events." The key assumption in this definition is the idea of using known historical and current data points to predict the future. Most people tend to skip over this assumption. Sometimes, people don't realize this assumption also acts as a limitation on the success of predictive analytics. Because this assumption is so important, I will repeat it below.
Predictive analytics involves using known historical and current data points to predict the future.
Suppose you want to apply predictive analysis to predict black swan events. That would be silly. By definition, black swan events are so surprising, they cannot be predicted. So ignore this key assumption at your peril; you could wind up wasting your time.
Please keep this assumption in mind. It will help you avoid ascribing too much effectiveness on predictive data science. Keeping this assumption in mind also helps you maintain objectivity when you evaluate your models during the analysis process, which I will cover in more detail later on.
The Six Steps of the Predictive Analytics Process
There are many predictive analytics techniques out there. I'm not going to cover them in detail; you can easily get an overview by googling for them. Instead, I think it's more valuable to cover the core steps of the analytics process. That way, when you find out more about the specific techniques, your knowledge of the core steps will help you frame your thinking.
The six steps of the predictive analytics process are:
- Defining the Project
- Collecting the Data
- Performing the Analysis
- Deploying the Model
- Continuous Model Monitoring
Step 1: Defining the Project
We define the project in terms of outcomes, deliverables, scope of work, and even data sets. We ask ourselves: what is it that we want to predict? What degree of accuracy must we achieve to consider the project a success? What are the data sets we need to have in order to perform the analysis?
Step 2: Collecting the Data
Remember the first diagram contrasting engineering tasks with analytics tasks? When we collect the data, we are in the engineering phase of the project. We may even need to build additional automation to make the collection, cleaning, and preparation of the data easier and faster. This may be grunt work, but it's crucial grunt work. Garbage in, garbage out, as the saying goes. You want to ensure the data you prepare is good enough at this stage.
Step 3: Performing the Analysis
Now, we perform analysis on the sanitized data. The objective here is to derive several candidate models we may use for prediction. This step is usually done in tandem with the next step, which is creating the model itself. In this step, we frequently employ statistical methods specific for analysis. Plenty of testing is done here as well. We want to validate the assumptions and test them using standard statistical techniques.
Step 4: Modeling
In this step, we use various statistical and machine learning approaches to generate a predictive model. We take several candidate models from the previous step and iteratively train them using training data. When the model completes the training set, we then test it on new data to see how well it performs. We may further refine the models or even discard some of them altogether depending on the testing results. We may also revert to earlier stages in order to generate more models, especially if we are not getting the level of accuracy we are looking for.
Step 5: Deploying the Model
Eventually, we'll arrive at a predictive model we are willing to go live with. Now, we deploy it in real-life systems. At this stage, we are back to doing more engineering tasks. We want to deploy the new model in such a way that we can collect more data on its performance. Perhaps the model needs to incorporate new data points as they arrive for maximum effectiveness; in this case, more engineering is required to ensure this feature is properly implemented.
Step 6: Continuous Model Monitoring
Our predictive model has gone live. We are collecting data on how it's performing in real time. Since we will need to regularly tweak the model as time goes by, we need to continuously monitor the model and its effectiveness. Deciding on how much to tweak the model is more art than science. It can be hard to perform analysis in real time, so your data engineers and your data analysts need to work more closely here.
Conclusion and Next Steps
I started this article by explaining how you should think about data science and its various tasks. Then we went deeper into the two extremes of analytics (i.e., explanatory and predictive) with the bulk of the article devoted to predictive analytics. I have laid out an overview of the predictive analytics process you can use to build a predictive model and deploy it in live systems. If you do this correctly, your organization should benefit from predictive analytics.
As good scientists and engineers, we want to push the frontier of what we can do. The natural next step after a successful deployment of a predictive model would be to build decision models on top of these predictive models. Using the predictions derived from your analytics, decision models can help in areas such as optimizing resources, developing decision logic, etc. Eventually, your organization may even consider expanding the scope of such decision models to generate the desired action for every interaction with every customer under all circumstances.
As you can see, there's a lot of exciting potential when it comes to predictive analytics for businesses. I hope that this introduction has given you enough information to start nudging your organization towards this path.