Modern enterprise data—tracking key performance indicators like conversions or click-throughs—exhibits a pathologically high dimensionality, which requires re-thinking data representation to make analysis tractable. For instance, the Malicious URL dataset has over 3.2 million categorical feature columns with a high degree of sparsity. Typical approaches for supervised machine learning, e.g., a simple model such as a logistic regression with interaction terms, don’t function at all without adaptation, as the parameters alone would require over 80TB of data to store. Specialized representations, like compressed sparse row format, can be integrated with sparsity-aware procedures, such as those found in XGBoost. However, sparse representation still incurs significant runtime costs and requires adhering to a subset of modelling approaches, such as decision trees or field-aware factorization machines. We demonstrate a chromatic approach to sparse learning, which uses approximate graph coloring to significantly collapse dataset width. By identifying the structure of mutual exclusivity between sparse columns, we can collapse the categorical features of the Malicious URL dataset to 499 dense columns, opening it up to application of a much broader set of machine learning algorithms. Even on sparse-capable methods, such as XGBoost, the use of an equivalent dense representation alone yields a 2x training speedup without any performance loss.
– Hello, everyone, my name is Vlad Feinberg, I’m the head of Machine Learning at Sisu Data. And today we’ll be talking about chromatic sparse learning.
So we’ve got a full presentation here.
We’ll start with some data expectations as to why we’re solving this kind of problem.
The kinds of data that you’ll expect in some of the deep learning settings that researchers work in, work with very dense data, like images or text. We’re gonna be focusing on different kinds of data today for large and sparse datasets. And we’ll be looking at some binary classification tasks. This is the kind of data that you might generate from JSON Blobs like the one that’s on the right here. If you were to take this JSON Blob and explode it into a large sparse vector, you’ll end up with a lot of zero entries in a lot of different categories.
And one of the reasons we’re focusing on this kind of data is because there’s been a trend of seeing more and more of it in the enterprise. JSONB is being stored in databases more and more. Companies like segment, Heap and Fivetran help you collect this data very easily. And when you collect all this information, especially across different entity resolutions, and you join them together with one too many joins, you end up with a lot of set valued features, things like let’s suppose that I liked things I retweeted, people I follow, and in this kind of setting the Schema changes very often, it may not be fully known what the full Schema looks like, and even if you had it, if you were to fully normalize it, it would end up being very large. And that all amounts to a lot of manual ETL work to prepare for Machine Learning.
So on a more formal level, the kind of data that we’re gonna be working with is gonna be very low-width, and that means that there’s only about a hundred to thousands of nonzero entries per row, even though the overall width of this row has a lot of columns in it.
And one of the ways to convert the data into this kind of sparse binary format is to apply one-hot encoding over categories, like taking the states and generating 50 binary columns, one for each state, converting some small counts to unary features and if something is sparse and continuous, bending it first and then one-hot encoding the bins.
The typical approach to solve to have supervised Machine Learning in this kind of setting is to apply the Hashing Trick. The Hashing Trick is a transformation that applies a random linear mapping phi here from a large input dimension n to a smaller dimension d. And the way that it works is by randomly hashing input features into a smaller number of buckets and adding up the feature values that get hashed into a particular bucket. The keys from this large dimensional space don’t even need to be integers, they can be anything that you hash into one of these d smaller buckets.
Now, the reason that this works, at least in the linear case, that’s what we’re gonna through, is that if you’re happy with a classifier on your high dimensional space, which is just a dot product between your large coefficients and your input feature vector, then this random mapping actually preserves norm from the high dimensional space to the low dimensional one in a relative manner. And applying this to a couple of different vectors, we can actually turns out show that the inner product between two vectors in the hash space that’s low dimensional, is relatively that of the inner product in the high dimensional space. As a result, we can actually train our classifier on the low dimensional data and approximate a classifier that’s working on the high dimensional data. And so as a result, we can make this approximation that lets us learn efficiently.
Now, this kind of approach actually follows from a very long line research and the Hashing Trick turns out to be an incredibly efficient way of doing this kind of compression. In order to hash an input, you just have to go over all the nonzero terms in that input once and so it’s very quick to apply, but on the other hand, there are some theoretical concerns.
In particular, the worst case for the Hashing Trick happens to be when you have incredibly sparse data, at least in a theoretical sense.
If you had a bag of words representation, you would need sentences to be of at least links k, in order for the relative norm, that crucial property that helps us learn on the hash space to be preserved with fidelity one over K. And so even though this is the case theoretically, the Hashing Trick is widely used in practice, you’ve got Spark having this FeatureHasher and Vocal Lab it’s a frequently used Machine Learning package, all of which have been applied successfully in large scale sparse settings. So it’s going to be a strong baseline for us to compare to.
Now, what we’ll look at is optimizing for this particular case where you have very few nonzero entries in every single training example, and what we would like to do, is to create a method that can scale well, given this kind of structure. And can we do this in a way that doesn’t impact classifier quality? And can we do this in a way that scales well with the number of parallel processors that we have? So we’re going to propose a method that solves this kind of data-dependent problem by using graph coloring and then some recent feature compression techniques.
This is building on some related work over the recent years. And so what really lets this happen is the fact that there’s structure that we can take advantage that uses the fact that there’s very few nonzero entries per row. At the end of the day, what we’re doing is we’re taking advantage of the fact that the graph that we’re going to construct here has a very small chromatic number compared to the number of features that we’re working with, and like all data-dependent properties, this is something that we can use in certain cases, but it’s definitely advantageous over the Hashing Trick only because this data looks like this. Hashing Trick is oblivious, and it’s more powerful in general settings, but this thing is very applicable to a lot of the data that we’ve seen in an enterprise today.
So, let’s dive into it what this method looks like.
At the top-level, what we’re going to do is construct the co-occurrence graph G, which is the graph where features are the vertices, and there’s an edge between features if they both occur in the same train point for some training point. This graph is something that we’re going to color in the in the graph coloring sense, and we’re gonna create a dataset which has the colors themselves as the features. So that is this chi of G dataset, is gonna be very narrow because we’ll need very few colors to color this graph. After that, we’re going to expand this data frame a little bit by compressing these color features, and then one-hot encoding that’s going to make them applicable to different classifiers. So let’s dive into some details.
The first observation that we’re gonna make here is that in this kind of low nonzero setting, features tend to be mutually exclusive. So here’s an example hypothetical medical survey, if we were to ask people, when was their most recent surgery? They might respond to a multiple choice question that’s encoded with three binary columns. And because there’s this property that only one of these three columns is set, they’re all going to be mutually exclusive, no row will have more than one of them set. In addition, the generative process for our data, the survey here might have some kind of control flow in particular, there might be a section that you only answer if you’re over 65 years old. So all of the questions there will happen to have no answers whenever the answer to are you under 65 years old? Is true. And so that’s going to create some mutual exclusivity. In addition to this, there might be some natural dichotomies that occur in the data, such as these two questions here, where there are people who are currently employed and people on COBRA insurance. And so given all this information, we can capture this structure with a graph where the features are the columns corresponding to answers in this question, and edges occur whenever the questions have nonzero answers in the same entry.
So let’s go through a more concrete example of how this graph gets built. We have here a pretty small data frame over six rows, and what we’ve done is we created a vertex for each feature. So all of the different values that our categorical values take on are having associated versions. And you’ll notice that we have a row here with both minimally invasive hip surgery and Kaiser insurance. And so as a result, we’ve put an edge between those two vertices there. We went ahead and did this for the rest of the data frame, and we end up with this graph, with two connected components, and three edges between all the vertices.
Now, we’re going to take this graph, and we’ll color it, so in other words, we’re gonna assign each vertex some kind of color, such that vertices that are adjacent are never colored the same. So that’s just the definition of what a proper coloring is. Coloring in general is an NP-complete problem, but because we’re working with sparse graphs, some greedy strategies like the greedy coloring, work really great in practice, and they don’t really use too much more colors than optimal.
Now, given this coloring, we can construct features categorical variables corresponding to the colors assigned the features. And here we have the graph that we had before, now colored in red and blue, and categorical variables constructed from those colors. So we see now as we described the first row, the first row only had an under 65 feature sets to nonzero value. And because that feature was red, we’re gonna put it in the red column for that row. And this kind of encoding is going to fully represent the data frame that we constructed the graph from. Because for any given row, we have placed edges in between features for that row. So any vertex or any feature that occurs in a given row is guaranteed to have different colors than all of the other features in that same row. And so as a result, we can always represent rows in this chromatic representation, because we colored the graph generated from those rows.
This doesn’t actually finish our conversation, all we’ve done so far is reverse one-hot encoding, reverse one-hot encode the data frame that we started with. And this is actually a really great performance optimization for gradient boosted trees, because the training process for GBDTs scales with the number of columns that you have, and what we’ve just done is we’ve taken a binary data frame, and we’ve shrunk it to a much narrower categorical data frame, where the colors have fairly high cardinality for all of the different features that we started with. And so if we were just training gradient boosted trees, we would have a great performance win here. But if we wanted to apply this kind of technique to linear models or Gaussian processes, neural nets or factorization machines, the typical approach would be to one-hot encode the categorical features that we’re working with. Unfortunately, that would undo all that the work that we just did. And so we’re gonna need to think a little bit harder about how we can adapt this method.
So, the thing that we’ll turn to is feature compression. A recent line of work in Information Science has found that there’s an approach to preserve as much information between a binary label and a categorical variable that has some kind of joint distribution with that label in a much coarser, quantized variable. So the variables are flipped here, because it’s a different field, but in this case, we have X as our label our binary label, and it has some relationship with Y our features and they have M different values that they can take on, it’s very high cardinality. What we can do with the algorithm proposed here is compress that M many values of Y into much less K values for Z, while maintaining as much information between X and Z as possible.
And just last year, there was a work that actually found a much faster algorithm for computing this new derived Z, so this is what that looks like. We take our high cardinality categorical variable X here, and we line it up in ascending conditional probability order where the conditional probability as of our label equaling one. And so here we have X, X has 10 different values A through J. And what we’re going to do is quantize X, so take the many different categories that X can be valued as and compress it into just four categories. So those are the values that Z can take on. And it turns out that the optimal solution for this kind of quantization that preserves the most mutual information between Z and our label is in this ascending order. So values A, B, and C all become the grinning face D and F all the upside down face, G and H star eyes and I and J the crazy face. So in this case, what we’ve done is we were able now to compress many different values into a single one, we lose some information but we do this to lose as little information as possible. And it turns out because of this recent work that is actually very efficient to do.
The challenge is that we don’t have just one categorical variable, but many. Luckily, this is actually not so hard to surmount, all we need to do is line up our different categorical variables all in a row, and solve the sum of all of these different coarsening optimizations and as a result, we can get a very small dictionary for our original high dimensional input, because we have this categorical representation of our data frame.
So given this coarsening, we’ve been able to compress the original data that we have very effectively and that’s only something that we could do because of the categorical representation that we built out of the graph coloring. One warning in applying this method blindly is that in order to compute these conditional probabilities, you actually need to look at the label. And this poses a risk of something called target leakage, because you’re cheating and looking at the label ahead of time, you can’t reuse that sample for training your Machine Learning model. So it’s important to split your dataset, one half to do this compression and the other half for learning.
Now, why would we expect this full procedure to work this was a lot of moving parts put together. At a high level, because our graphs are so sparse, we should expect that they will require very few colors to fully color, and as a result, we’ll have a lot of really great compression, taking this sparse binary matrix and making it into something very narrow and high cardinality. It’s a very good input for our feature compression algorithms. In particular looking for some inspiration and random graphs, they take very little colors about linearithmic in the average degree to color fully.
So let’s take a look at what kind of datasets we’ll be evaluating on.
The first one is going to be the URL dataset. The kind of features that this has is various lexical and host-based information about the website. And the goal is to predict whether or not the website is malicious.
KDDA and KDDB contain inputs that are captured interactions with software. So lots of different categorical values here. And the output is to predict whether or not the student who is interacting with this learning software got the answer right on their first attempt. And then finally, we’re gonna look at a more classical internet use case where you’re trying to predict whether or not a user will follow recommended topic based on again very set values of categorical information about that user, who they follow, what topics they subscribe to.
This is what the general dimensions for the datasets look like. So we’re dealing with millions of rows and millions of columns, especially sparse columns. The dense features we just set aside and combine at the end of our dimensionality reduction process. Luckily, though, when you actually construct this co-occurrence graph, the degree of that graph is very small and then when you color this very sparse graph, the resulting number of colors is correspondingly small, as kind of we’ve expected by looking at the random model. And the result of this is something that took millions of features to represent initially, can now be represented with hundreds of categorical variables, and that’s gonna let us perform very effective compression.
So let’s take a look at what we have to do to actually get to those numbers.
At the top-level, we started working with records that are essentially just feature value pairs and their new lines separated. To construct the co-ocurrence graph from this kind of input, each record needs to be mapped to a co-occurrence clique, because every feature in a record co-occurs by definition. It has an edge between all pairs of features in that record. And so when you look at that record, it generates essentially a complete graph for all the features inside of it. And we need to take the union of these complete graphs to get the full co-occurrence graph for the dataset. Once we have this graph, we can actually color it pretty quickly as a data-independent step. And then after that, we need to collect summary statistics to come up with the conditional probabilities for our different features, run the feature compression algorithm, and then recode the data where for every given feature, we look up what color it has, and then we look up what the compression quantization is for that particular color. And so all of that maps into a very small set of ending features. And that’s the narrow indents that we have at the end.
So along the way, we had to solve a couple of issues to get this to work, one of which is the memory that is required to transform all of the strings that represent the features that we started with, into indices that we can work with internally when we’re representing our graphs and our features for feature compression. This is solved actually by taking the Hashing Trick and sort of using it for a systems purpose. And in particular, all we have to do is take the hash of the key, which is just a 64 bit hash, and use that as our feature key. And because of calculation by the birthday paradox, we don’t actually need to worry about collisions happening on hashes, hashes this large, at least with high probability. And so this actually lets us get around maintaining a large dictionary, from feature springs to their indices, which is actually a problem called out in the Spark docs. So, maybe an approach like this could be helpful in other settings.
After that, the toughest part was actually to construct the graph in the first place. These cliques are fairly large, and they’re all these complete graphs. And we wanna, if you recall, compute the union across all of these complete graphs. Union is a commutative and associative operation. So I’m sure map-reduce is is a good gut instinct that some of you might have in terms of how to solve this problem. We have a lot of rows and we want to combine the graphs associated with each row. Unfortunately, when we tried doing this, there was a serial merge that happens in the end, because you are not producing in a tree-like pattern across all of your processors and that final reduce has to happen on one thread because that’s what’s delivering this final dictionary. And that merge at the end requires almost half of the total runtime, and so as a result, you get into this very unfortunate situation where map-reduce doesn’t actually help you scale your algorithm too well.
And taking a closer look at this, it’s clear why, in the limiting case where all of our individual chunks of work, essentially end up creating the same graph, you’re going to be using P times Z memory usage total, where you have P processors and E edges. And the lower bound on the runtime is gonna scale with the logarithm of the number of processors that you have because of the tree depth of our reduce. And so an approach to find a sharded union of these edge sets that we’re computing instead is to use a hash join. We split the space of edges into P parts corresponding to the hash space of these edges. And so once a mapper generates the edges associated with an example, it just sends an edge generated from a record to the corresponding shard, one of those P shards that split up the hash space. And the writers on that side of the shard, union, the edges corresponding to that shard of the hash space. So as a result, what you get is P disjoint sets of edges, where the writer always deduplicates all of the edges that it gets. And so you never actually end up using more than E memory, you need some memory for local buffers, but that’s about it, and the runtime doesn’t require a serial merge at the end, it actually perfectly paralyzes across the maps. When we ran this in practice, we noticed dramatic reductions in the runtime and the memory consumption here on the URL dataset, which is actually the smallest dataset, and showing that this kind of approach improves over the tree reduce mechanism.
And then the final technique that we are to use is if you recall, there was this double dipping concern. In order to split the dataset without actually having to create two datasets, what we did is hashed the rows of each individual training example and use the final bit of that hash to distinguish whether or not we’re on the half of the data that we’re using to estimate conditional probabilities or on the half of the data that we’re using to train our classifier.
So let’s take a look at what this resulted in end-to-end.
For starters, we took a look at some linear models, you wouldn’t actually need to reduce dimensionality too much for linear models if you’re serving off of large servers, it’s still useful in some more memory constrained settings, but this is mostly to assess just the dimensionality reduction techniques themselves and see how well they’re preserving information. And one thing that we noticed is that across the board, when you use a lot of features for Hashing Trick in chromatic Hashing Trick, you get an improvement compared to when you have fewer features. But there’s still a big rift in terms of how much you can get out of a given feature, when you use chromatic learning, there’s a lot higher final classifier accuracy, even in very compressed settings when you use the chromatic learning plus feature compression approach. And so this x-axis here, which describes the number of bits necessary to index into the feature vector is actually the log two of the number of features that our classifier has. So at the very end here, we’re working with two to the 14 features, whereas at the very beginning is just two to the eight features. And so with only around 256 features for most of these datasets, chromatic learning captures the essence of what’s necessary to do supervised learning.
To assess speed, we compared the end-to-end pipeline, plus the training on a gradient boosted tree that doesn’t use the optimization I mentioned earlier, the graph coloring optimization, XGBoost is a very commonly used library, and it doesn’t have this optimization built in. So this was just a way of assessing how effective this is in terms of speed because of dimensionality reduction. And what we noticed is, you know, compared to using the sparse routines built into that library, you get really large improvements in overall training time with chromatic learning, and a curious little, you know, piece of insight here is that you actually notice some improvement in the accuracy as well. And so I’m not too sure exactly why this happens, I think it’s because the coarsening that feature compression does act as a sort of regularization because you can’t over fit on very infrequent features now, but I think the win is pretty clear across all of our different datasets.
And then we also took a look at neural networks. So this is an approach that wouldn’t really be accessible for these kinds of datasets beforehand, because when you’re dealing with something like KDD12 that has over 50 million features, you can’t even represent a mini batch on most GPUs, even a V100 GPU with 16 gigabytes of memory, because that’s initial feature vector is gonna just take too much memory to represent. But if you can compress the data down to 1024 columns, then all of a sudden, even if those columns have many more values, all of a sudden you can represent the embedding vectors for all of these categorical values on lower memory GPUs and if you compare to other approaches that reduce the number of features that you work with, like a Hashing Trick and FT here is a frequency based truncation where you take away the features that occur very infrequently first.
This approach of chromatic learning plus feature compression
across the board results in significantly more information retained from the covariance that let you predict better on the test set.
So, all together, when your graphs are sparse, you can compress them very well with graph coloring. And what we’ve noticed is that you can get dramatic reductions in this dataset width and that brings advantages in terms of speed and overall input dimension, making things easier to train, but also actually, in some cases, improving your end learning performance and especially compared to other baselines for reducing the dimensionality of your data, chromatic learning found better and test performance. Along the way in order to implement this, we had to go through a couple of systems techniques that made it all possible, and so these are things that I think Spark audience here can take with them to other settings. And so we’ve had the Systems HT for reducing the memory of representing feature strings, sharded unions for having a much less serial overhead reduce, and then finally, there’s an online data splitting approach that lets you create two datasets without actually requiring the memory to separate those individual copies. And so, to conclude here, not only is this a pretty successful method for dimensionality reduction, but it also seems to give us this co-occurrence graph that I think is actually pretty interesting to investigate further. There’s a lot of opportunity here to explore how this graph interacts with learning and other settings, and so now that we can actually generate it, it’s an exciting opportunity to look deeper.
Vladimir Feinberg is the Head of Machine Learning at Sisu Data, where he leads the investigation and development of algorithms for large-scale streaming structured data. Prior to Sisu, Vlad was a graduate student in the UC Berkeley Ph.D. program, advised by professors Michael I. Jordan, Ion Stoica, and Joseph E. Gonzalez, focusing on systems and machine learning. Vlad graduated from Princeton University, working with Barbara Engelhardt on Gaussian process estimation procedures and Kai Li on 3D convolutional neural net optimization.