As a data scientist, I am truly spoiled by the Cleaning and Linking team. As we mentioned in our previous blog post, the atom-centric view created during Cleaning and Linking allows us to see how all of a company’s data relates to each other in the context of the smallest single entity, and sets the stage for higher-level analytics. However, the analytical work has just begun. Big Data is all about finding new and interesting trends through statistics, machine learning, and other numerical methods. In order to dive in and discover the really deep insights through predictive modeling, we must first engage in the process of Data Featurization.

What is Featurization?

What is featurization, you may ask? Simply put, featurization is the process of converting a nested JSON Object into Indicators — vectors of scalars that are a requirement for analysis. That sentence alone probably requires some translating!

  • JSON stands for JavaScript Object Notation. It is a lightweight format for data that’s easy for machines to read and write. JSON is perfect for our uses because the various languages we use (Python, R, JavaScript) have strong, native interaction with this format. Most of our software interacts with data stored in a format that leverages JSON.
  • Scalar is a term from the Physics tradition. A scalar is a one-dimensional physical quantity. Here, we adapt it to a more computational framework to mean a single unit of measure, such as ‘1’, ‘3.14’, or ‘red.’
  • Vector is a term from the Linear Algebra tradition. A vector space (or just vector) is a collection of elements. Programmers might call this a ‘list.’ An example vector might be [1, 2, 5, 6].

Vectors of scalars make advanced computation and statistical work possible in a way that raw data cannot. Suppose you wanted to determine the average cost of your business trips to visit a client. While you might have the receipts for your travels digitally stored and organized, it would be impossible to calculate the average until you tally up the per-trip costs somewhere (such as individual Excel cells). Once you’ve aggregated this data, it’s possible to see just where you can reduce some of your expenses!

How do we featurize data?

Now that I’ve given a general overview of what featurizing means, I’ll give you some examples of how we make this happen. After Cleaning and Linking, the data exists in an atomic format using an adaptation of JSON for storage. In this format, each atom is represented by a single, potentially nested Object consisting of key-value pairs. For example:

{"id": 1, "name": "Bob", phone_number:{"number": "1111"},
        "transactions": [{"date": "May", "amount": 1000}, {"date": "June", "amount": 1500}]},

{"id": 2, "name": "Steve", phone_number:{"number": "22222"},
        "transactions": [{"date": "May", "amount": 500}, {"date": "June", "amount": 1000}]},

{"id": 3, "name": "George", phone_number:{"number": "33333"},
        "transactions": [{"date": "May", "amount": 300}]}

This format is great for a couple of reasons:

  1. The Data Scientist has assurances that the data is correct, at least to the satisfaction of the customer.
  2. The Data Scientist does not have to worry about how various pieces of data connect. In this case, we know which pieces of data belong to which atom.

However, the data coming out of Cleaning & Linking by definition isn’t tabular at all. To bridge this gap, we go through the process of featurization.

Our team has built a powerful platform, Aunsight, and has written a lot of code to make this process as simple and efficient as possible. What this process boils down to is writing JavaScript functions that compute a value for each atom. In the above example, one of our team members would write a function like this:

function avg_transaction_amt(data) {
     var n = 0,
            total = 0;
        data['transactions'].forEach(function(tran) {
            n++;
            total += tran['amount'];
        });
     return total/n;
}

That looks really complicated, but this function is simply calculating the average ‘amount’ for the atom’s ‘transactions’. This function is applied to each atom and creates a value.

Now the data is arranged in a tabular or a vector format:

name transaction_count  avg_transaction_amt
Bob 2 1250
Steve 2 750
George 1 300

With the data arranged like this, I can begin to do some advanced analytics, like telling you that there’s a positive correlation between the number of transactions and average transaction size.

Sometimes we want to perform analysis that looks at different segments of the data. For example, how did the data look in May and how did it look in June? So to do this, the team needs to specify a set of Blocks—conditions we’d like to impose on the data. In order to this, our team establishes a set of filtering functions to apply to each atom prior to computation. In this case our team member would write the following:

function filter_transactions(data) {
           if(data['month'] ==  this.param) { return 1; }
           return 0;
         }

This is the JavaScript way to remove transactions from the atom that aren’t within the month. If we applied our computing functions, we’d get two sets of indicators like so:

month name transaction_count avg_transaction_amt
May Bob 1 1000
May Steve 1 500
May George 1 300
month name transaction_count avg_transaction_amt
June Bob 1 1500
June Steve 1 1000
June George 0 0

With that sort of information, our Data Scientists can start to produce interesting and actionable insights as well as raise important questions. The typical transaction amount increased a lot from May to June—why is that? Why didn’t the third atom (‘George’) have any interactions in June? Now that the data is in this tabular format, deeper analysis and predictive modeling can begin.

Insights Gleaned from Featurization Lead to Better Data

Sometimes, the insights that come out of this process can also highlight improvements that can be made to optimize the effectiveness of our future model. Does the transaction database have additional information — such as transaction type? If so, that might be really important for determining a correlation between typical amount and frequency. Integrating the new data would require additional Cleaning and Linking and then refinement in the featurizing process.

The Value in Featurization

This step adds value to our process in a number of ways:

  1. Most importantly, featurizing bridges the gap between data in our Cleaned & Linked format to a format that our statistical tests and predictive modeling algorithms can use.
  2. Featurizing lets us approach the same problem consistently, across client projects. This consistency lets our team:
    • Develop tools that will be used across many projects and saves effort down the line.
    • Optimize around a use-case that will be repeated over and over. When we have to invest in hardware or specialized solutions, we know it will pay dividends on all current and future projects.
    • Collaborate more efficiently. Having a consistent style means that our programmers can pick up easily where others have left off – as though they themselves had written the code. In addition, our analysts and strategists know what to expect as outputs, enabling them to think more abstractly about your data.

Featurization is a necessary and impactful step in the overall process of predictive analytics. This step allows us to discover deep insights from your data. Stay tuned for next month’s blog post, which will explain how all the steps up to this point come together in the modeling & predicting stage.