Operationalizing Data Science

Greg Foss
Feb 18, 2018
7 min read

Last week we decided to learn a bit more about Azure Machine Learning Studio and after a week of playing around with it we wanted to say what we like, what we don’t and what we’ve learned. We started out by building a proof-of-concept. Here was our scenario.

The sales team would like to know roughly what orders they can expect in the upcoming week and month. They would also like to know if there are any current orders in the system that are anomalies. That is, did the customer deviate from a recognized ordering pattern. They’d also like to know the trend of a product and product group. Is this product continuing to sell strongly or is it starting to decline?

That’s the scenario, but before we build the whole thing, let’s find out if we can do it technically with a proof of concept. The simplest measure would be.

Can we move sales data from our erp system (SAP) into AML, perform basic data manipulation and then display the results in PowerBI.

Reminder: This is a POC so we know we could do this all in PowerBI directly, but when additional layers of functionality such as anomaly detection and regression are performed, this would be the preferred model. The necessary steps:

Extract base data from SAP and load into SQL. Dump historical sales order item data.
Extract delta data updates from SAP using Netweaver Gateway, SSIS and SQL. For the SAP savvy: 2LIS_11_VAHDR & 2LIS_11_VAITM as a starting point.
Build Azure Experiment to read the data from SQL, modify it in some way and then load it back into SQL.
Use PowerBI to visualize the data and do some basic analytics.

We will not cover steps one and two as these are so very SAP centric and that is not part of the POC. We know we can do this and have done it many times. We will also not go through the steps of creating a SQL database in Azure as this has been done infinitum in various blogs. So basically, we start with a SQL database with a table of basic sales order data. In Azure ML we go first to our Experiments tab by clicking on the Lab beaker icon.

From there we click on the Import Data option and drag it onto our canvas.

You will see a new module on the canvas.

On the right hand side of the screen you will see all the entries necessary for connecting to your Azure SQL database.

Now that you are connected to SQL you can click the Run button at the bottom of the screen or right click on the module and select Run Selected from the context menu. When the module is finished running you can now visualize the data that you imported very easily. Click on the bubble on the bottom of the module and select the Visualize option.

You can gain some summary statistics by clicking on the column header and looking at the pane to the right.

Perhaps the most useful feature here for me is Missing Values. We know that we do not need to deal with missing values in my data, largely because it is coming from an SAP system and the data is relatively clean. However, for our POC we want to do something so we’ll do basic filtering, converting and aggregating before loading it back into a different table in our database. First of all, we don’t want that SoldTo column. There would be too obvious a cross-correlation when we do more sophisticated things. To easily filter it out drag the Edit Metadata module to the canvas.

You can always search for the module by typing its name into the search bar as well.

Now drag the connector from the database import to the input bubble on the Edit Metadata module. It is good practice to make notes on the modules, just double click on them and put in whatever you like.

In the panel on the right, click on the Launch column selector button. You can easily select by name or by applying rules. For instance, with a rule you can omit all currency or date columns. It is easiest for us to just do this by name as we are only omitting one column. Move all the desired columns to the Selected Columns side. Then click on the check-mark at the bottom.

Our next step is to filter the data, because we do not want any line items that are before 2013-01-01. To do this we are going to use our first R script. Select the Execute R Script module from the panel on the left and drag it onto the canvas.

Now for a little explanation of the Execute R Script module.

There are three input nodes. The first two from left to right are for data imports. For instance, you could import two dataframes and rbind them in the script. The third one, on the far right, is for importing scripts or bundles. I found this useful when calling libraries that were not among the 400+ native libraries that Azure ML provides. Here is a nice blog on how to do that. The two output nodes are for the dataset (left) and the console (right). I did not find that coding in these modules was very user friendly. Code your R or Python scripts in R Studio or Jupyter respectively. Work out all the bugs there and then move that code into the module. I have not yet had the pleasure of playing around with the Python module, but for the R module the syntax is a touch different. The first thing you do is bring in the data from the port. Port 1 is maml.mapInport(1) and Port 2 is maml.mapInputPort(2). Once you have run your code you then push the data out using the command maml.mapOutputPort(“data.set”); In our case we want to just filter the data so our script looks like this.

# Map 1-based optional input ports to variables
dataset1 <- maml.mapInputPort(1) # class: data.frame
library(dplyr)
dataset1$Dte <- as.Date(dataset1$DateTime)
dataset1$DateTime <- NULL
dataset1 <- filter(dataset1, Dte > as.Date("2013-01-01") )
# Select data.frame to be sent to the output Dataset port
maml.mapOutputPort("dataset1");

Note: At this point you may be saying to yourself. “Why do the Edit Metadata step to filter out a column when you could just NULL it here?” Yes, of course, there are many ways to skin a cat.

We are going to continue this process for two more R modules. I coded all this in R Studio, but when it came to putting it into Azure ML I broke it into nice little sections. There may be no real technical reason to do it this way, but it does make the code clean and easy to follow. After the second two R modules I ended up with the following. Here is the code we used to convert the date into week buckets.

# Map 1-based optional input ports to variables
dataset1 <- maml.mapInputPort(1) # class: data.frame
dataset1$Week <- as.Date(cut(dataset1$Dte,
breaks = "week",
start.on.monday = FALSE))
dataset1$Dte <- NULL
maml.mapOutputPort("dataset1")

Here is the code we used to aggregate the quantities based on material and the ship-to location.

# Map 1-based optional input ports to variables
dataset1 <- maml.mapInputPort(1) # class: data.frame
library(magrittr)
library(dplyr)
dataset1 <- dataset1 %>%
group_by(Week, Material, ShipTo) %>%
summarize(Qty = sum(Quantity))
maml.mapOutputPort("dataset1");

One of the nice features of putting your code in these little modules is that it encourages segmenting your code into buckets. It makes it easier to read and understand than hundreds or thousands of lines of code written one after the other.

The last step is to write the modified data back to the database. We want to write this data to a different table in Azure SQL. Drag and drop the Export Data module onto the canvas.

In the panel to the right, enter the parameters for your database.

Now when we run the entire module, data is read from SQL, it is manipulated and then written back to a new SQL table. This allows us to move on to the next step which is importing the data into PowerBI.

Open up Powerbi and click on the Get Data button.

From the drop-down menu select the SQL Server database as the source of data.

Enter your server name and the database. If this is your first time logging into that server and database you will be prompted for credentials.

On the next screen you will see the database and the tables that are in it.

Select the one that you used to write to from AML Studio.

Then select the Edit button because we are going to make some small modifications before loading the data into PowerBI. For instance, we want the date field to be a date. To do that simply right-click on the column with the date in it and select the Change Type -> Date/Time from the context menu.

When you are finished making the changes you want click on the Close & Apply button .

Now that we have the data in PowerBI in a good format, we can run some metrics against it.

Because this is not a PowerBI demo, I’ll just jump to the final product.

This quick report, that filters by Material, ShipTo and Date, allows users to see the ordering pattern and forecast of a material. We could have done forecasting in Azure ML and other metrics, but PowerBI makes metrics such as Mean, Median, Percentile, Trend and Forecasting super simple, so why bother? We started out this process to show how one could operationalize simple analytics using Azure, R and PowerBI. Let’s take this data to a new level and show what could be done with more advanced analytics.

Use Azure ML to do anomaly detection and write the results to SQL. Then we could overlay those anomalies onto a chart in PowerBI. This would allow for a very user-friendly interface to show anomalies in ordering patterns based on ship-to, material and time. You could even use it to identify geographical anomalies.Use Azure ML to build a customer segmentation analysis based on a variety of features. This could be added to the PowerBI report.Add inventory metrics to this model and use machine learning forecasting to drive material procurement and manufacturing schedules.Just about any scenario you can imagine.

Operationalizing Data Science

Recent Posts

Comments