Many powerful Machine Learning algorithms are based on graphs, e.g., Page Rank (Pregel), Recommendation Engines (collaborative filtering), text summarization, and other NLP tasks. Also, the recent developments with Graph Neural Networks connect the worlds of Graphs and Machine Learning even further.
Considering data pre-processing and feature engineering which are both vital tasks in Machine Learning Pipelines extends this relationship across the entire ecosystem. In this session, we will investigate the entire range of Graphs and Machine Learning with many practical exercises.
Jörg: Thanks so much for joining this session about graph powered machine learning. Just in case, if you’re still wondering why you should stay tuned, Graph ML is actually the future of ML. So if we look at the outcomes of large enterprises, such as Uber, Google, LinkedIn, and others, they’re actually getting a lot of their breakthrough results currently with graph powered machine learning. Just an example here on the left, deep mine actually improved the Google Maps ETA prediction by over 30% by leveraging graft powered machine learning. Other breakthroughs such as for example, in computational chemistry, also drug discovery, are also currently often driven by graph powered machine learning. And if that’s not enough, if you just also look at Gartner predictions they’re actually seeing it as one of the big trends in data and analytics for ’21, and really see it in over 50% of their inquiries surroundings that. I’m Jörg, I’m the head of engineering and machine learning at ArangoDB, basically a scalable graph database, but basically my entire career has been switching back and forth between building large scale distributed machine learning pipelines and actually databases.
So I’m currently quite happy that we can actually combine that in one role. And also I’m able to actually talk here at this conference. What will we covered today? I mean, the spectrum is rather large and probably way too much for just a half an hour talk. So we’ll actually focus on those aspects. We’ll build up on a little bit of graph insights, talk shortly about graph databases, but then dive into what we can do with standard graph analytics, and then go further a bit more into the really graph machine learning research, with graph embeddings, and then also with graph neural networks. In the end, I’ll also provide a lot more pointers, as you can assume, this right now it can give you an overview, but we cannot go too deep into any of those topics, unfortunately, due to time.
So what is Graph ML, or what are actually all those different steps? So if I would have to classify Graph ML by the complexity of securities, and basically the system addressing that, I would say the basic unit, the atomic unit would be a graph query. And this is what you could add as to any graph database out there. And for example, imagine here on the right, we have a graph of networks which could be LinkedIn Facebook or any other social network interaction graph. And if you ask a simple question, “Who can introduce me to X?” This is a very simple graph free, basically finding a path between two nodes in the graph. The next level of complexity is actually where we start analyzing the graph and trying to get insights out of it. So questions here could be, for example, “Who is the most connected persons in that graph?”
So who are kind of the important influencers I should talk to if I kind of want to spread or update people here. Other questions could for example be, “Where are disjoint sub clusters? Where are disjoint sub communities? So if I had to classify it, what is graph analytics, it’s actually that I’m trying to analyze over the entirety of said graph, whereas in graph queries, I often even start with just a local thing, the connection between two nodes. Then the like level, the 2022 level of complexities, that’s actually graph machine learning. And this is actually where we start to combine machine learning and machine learning models together with graph. And we can answer very powerful questions with that. Just imagine, for example, you can ask your graph like, “Hey, what are actually potential connections in the social network graph, which are not really in the data, but are likely given the information we have here?”
And this is something that you probably know from LinkedIn or Facebook, where you get always those recommendations, “Hey, you might want to talk to X, you might want an introduction to Y.” And so this is actually what we can solve by having a machine learning model train for such kind of predictions together with the graph data at hand. Other questions we can do here… So that would have been an example for link prediction or relation prediction between nodes. And another example would for example, be a node feature prediction. So a simple question could be, “Who is likely to churn?” So imagine like your entire group of friends has just moved away from a network provider from a certain service. So that probably is going to increase the chances of you churning as well. So we’ll see how that actually differs from more traditional machine learning if you view it in the graph context on the next slides.
So with that, also we can solve a wide array of other problems. So as we already saw, often we want to see how are certain things connected that can also be events. Just imagine your make pipeline for building your tool. So for building your software project, that’s actually in that sense, it’s also a graph and dependency graph. Otherwise we already covered the question, “Who is important, who is influential?” And if we actually extend that, we can also try to identify unusual behaviors in that graph. So anomaly detection, is there currently someone trying to break in? Is there currently someone trying to break that system? Simply on interaction graphs, where we know what are kind of expected behaviors, because this is what we’re seeing all over the graph and very localized anomalies in the graph, which might be a sign that something fraud or some new trend is going on there.
Also, if we look at knowledge crafts, it’s a lot about what information can be inferred in terms of semantic inference. So this is another big topic. And also we can actually ask questions about entire graphs. Are they actually the same? So is this for example comes if we are looking at protein or molecule structures where we actually want to kind of predict, “Are they the same or are they at least similar molecules?” And yeah, as discussed before, we’ll treat that kind of in two different areas. So we’ll first be concerned with graph analytics. And this is basically trying to answer those questions from a graph community detection, recommendations, path finding, and others. And then in the second part, we’ll go a bit more into detail and really try to learn new features about a graph. And that could be as mentioned node to link classification, link prediction, or also an entire classification of graph.
How is craft machine learning different from, let’s call it traditional machine learning, as we probably all know such as like the standard TensorFlow and other tools? So kind of default and very common assumptions, actually not true for all machine learning models, but it’s a very common assumption in traditional machine learning is that individual data points are independent and identically distributed data. So basically if we would look at this graph of a person here on the right, it would actually mean that we treat each of them independently. Each of those persons. In a graph, we actually have a lot more assumptions about how the neighborhood influences of what we are trying to predict. So we are actually not seeing individual data points as independent, but we actually see them dependent on the neighborhood we have. And there are a number of common assumptions about that.
So one would be homophily, which is basically assumptions that all neighbors are similar. So if a different or similar nodes in a graph have labeled certain properties, this is very likely that the neighboring notes will have a similar property. And surprisingly, this is actually the case for many real world datasets, such as social graphs and many others. Furthermore, there’s often the assumptions that we have structural equivalent nodes. So for example, a really well connected influencer nodes, which basically perform similar roles throughout an entire network. So they’re not independent, they actually… As they serve similar roles within a network. And the last kind of assumption we could take, which is pretty much the opposite from the first one, the homophily, is heterophily. And that’s we actual assumed that neighbor notes are different from their neighbors. And so for example, one case where this is often coming up is, for example, just in social networks or for example, dating sites where a gender might be, for example, different between neighboring nodes.
Then again, just briefly coming back to kind of this learning paths towards Graph ML, or also bridging the gap between graph databases and machine learning. I would actually see that as a continuum, but working for a graph database company, this is a question I actually get asked all the time. And I would actually say the role of a mature graph databases to is to actually support graph analytics and Graph ML as well, but as mentioned, it often really starts out with simple graph queries where we try to identify in very explicit patterns, then moves up to more complex graph algorithms. So for example, the finding a shortest path, what is different about finding the shortest path is that we actually have to keep state throughout that computation. So it’s not just looking at something local, “Can I find a pass?” But it’s actually more need to keep state here.
Next as mentioned is graph analytics where we try to really grab insights from an entire graph. So identifying sub-communities. And then the last step is Graph ML, where we actually combined graph and machine learning models to learn new facts about a graph. And the output often is actually also a new graph with, for example, those predicted missing links we had in the first example.
The challenge in this field is that we actually often have different options for even similar queries. So to make it concrete, what we said earlier with the graph queries, let’s take here a very common data set, and this is the Cora citation network. It’s basically a network of papers citing each other. And an easy graph query would be like, “Who cited paper X?” Because it’s basically looking up the direct neighborhood of the node in that draft. Then if we look at it from a graph algorithms, graph analytics perspective, “What are the most cited papers?” And, “What are also common sub communities?”
As you might guess, papers are cited differently. Probably I will cite more people in the same field. And often I will also cite people actually in my direct neighborhood. So for example, being at the same university, which we can discuss about whether that’s good from a research perspective, but it’s actually, it’s often discoverable in real world citation networks. Then if we actually have a trained model and can apply Graph ML, we could actually do tasks such as predicting [inaudible], predicting missing citations. So for example, “Hey, this paper, you’ve cited all those papers, but most likely you should also look at paper X because it’s working in a very similar field.” Or also predicting node label features. As we can see here, for example, also class labels for a certain paper, just based on the citation graph.
If we just look at one of those tasks, and that would be the last we saw, predicting labels for a certain paper. Already there, we can deploy a number of different techniques. Probably one of the more easier ones is label propagation where each of the neighboring papers, so the cited papers, would kind of send a message with their own labels to that paper, and then we have an aggregation algorithm which gets all the incoming messages, and then the most common class labels received by all the cited papers. We could also build a graph convolutional network to predict that. So basically we have the network predicting a new graph, which would have those class labels, which surprisingly internally, often actually works with similar techniques, such as this message passing in between neighboring nodes. And then the last thing is we could actually get an embedding of this graph, meaning we turn the individual nodes into a vector representation.
And then as you, if you have any experience with more traditional machine learning tools, we can actually feed it into any traditional machine learning techniques, such as support vector machines, or even like a deep neural network. And then trying to kind of predict those labels based on the vector embeddings created of that graph slash of the node representation. So this is what makes it so hard. The other aspect, I think what makes it hard is that we actually often are not trained to think in graphs. We often actually leverage graphs if we draw something on a whiteboard, if we actually sketch a problem, but when trying to solve a problem algorithmically, it’s often very challenging for us to transform that into a graph network. So to just take a very simple example here for thinking in graphs, that would be collaborative filtering.
So collaborative filtering it’s again, this assumption of homophily. Basically people in your network are actually similar to you. So based on what they have bought, if we classify them as similar to us, we actually get recommendations for what we might be interested in as well. So being in pandemic times, probably a good thing to figure out is like, “What do we want to watch tonight?” And probably we’re interested in new stuff, because we have watched like all our standard rec by now. And so how can we build a recommendation system for movies based on a graph question. And so if we look at IMDb, we can express actually rather easily as a bipartite graph. So in a bipartite graph, we have two disjoint set of nodes. So here we got the user set of nodes and movies set of notes, and basically the edges in between will always go from one node type to the other.
So the IMDb example, a user would rate a movie, but the movie wouldn’t rate another movie, right? It’s only going from one node type to another. And so then, if we actually states that as a graph problem or collaborative filtering, the question is “I want to find highly rated movies by people who are like me. So who also like movies I rated highly.” So if we actually go through that, it would be like… At the first stage we find all the movies. So we’re going from my own user to all the movies in that bipartite graph, to the set of movies which I find very cool and I’ve rated very highly in the past. Then I basically go back and I want to find that set of users who’s similar to me. So the set of users who actually also rated those movies very highly.
So I’m basically going to go traversing back this path to the user domain of the bipartite graph. And I have a set of users who are based on those ratings I did similar to me. And then actually the last step was just to go back on the movie side and find additional movies they rated very highly. And this is basically how we can turn such a recommendation problem into a graph problem and thinking about it from a graph perspective. And if we look at what other people like, for example, Uber Eats are doing, they have a great blog post about how they deploy actually graph machine learning to do even more advanced product or restaurant recommendations, dish recommendations.
With that let’s actually move on to the second part and that would be graph embeddings. So one of the challenge is if we are treating a graph, the graph is actually unstructured data, right? So we cannot easily put it into a vector form and then feed it into a neural network or other machine learning techniques. Because those machine learning techniques, they really liked numbers, they really like vectors or even tensors of numbers. And so the challenge is how can we turn something unstructured, such as a graph or also words, which also don’t easily transform into a simply a number, into a meaningful embedding? And what do we mean by meaningful embeddings? So we don’t want to map each, in the case of words, each word, just to a random number, because we would lose a lot of information. So for example, you see relationship between king and queen, or men and women, or Spain and Madrid, Italy and Rome, they’re actually very similar because they model a certain relationship between those words.
And if we would just map it into random space, we would basically lose all those connections. So the challenge is actually, how can we train an embedding which helps us to maintain those semantic informations for the relations in between. For words, this was actually done by Word2vec. Due to time, I cannot go into too much detail, but I included a number of links in the slides, so feel free to follow them. But it actually works very similar to the graph case where a common first approach for that was called node2vec. And so the challenge with node2vec or how we do it, we first want to create sentences we can train by. So we actually, what we first do is we first do a sample random walks on our graphs. So here, for example, we have random walks of size four, starting from different nodes.
And by that we basically have linear chains of node sequences, and that basically makes it much, much easier because now that we can treat it as a structured problem. And we are trying to train a dummy, so-called skip-gram network by predicting… From the first node, we try to predict basically all the surrounding nodes in those random walks. And by that we actually train this dummy network. It actually just has one hidden layer. And in the end, I said it’s a dummy network, we don’t really care about the network at all, but what we care about is the middle layer, because the middle hidden layer in that network, it will actually give us that embedding. And here we can basically then see this embedding. We can choose how many dimensions it should have. So usually it has a much lower dimension than the overall dictionary. But given that we have the similar words in similar neighborhoods.
So for example, imagine again capital and country, they often will be used in similar sentence structures if we look at the word case, and this is actually the same for graph cases. So by that we can actually leverage that hidden layer as a computed embedding, and often just by doing that, we already get really good outcomes. And this is one of the surprising results of graph machine learning, that we actually have some so-called inductive bias. This usually comes from the assumption of homophily, that actually nodes within the same neighborhood have very similar properties. And as this is the case for many real-world datasets, we can actually, by just training the skip-gram network, which doesn’t have any too much detailed information, it’s just a very shallow network, already get like pretty good embeddings, which we can use them for further machine learning tasks, for further machine learning prediction tasks.
One of the downsides of node2vec is that actually, if we look at that as soon as the graph changes, we would have to retrain it. So it really just works for static graphs and furthermore, it also doesn’t scale very well. This is why there was actually a lot of research starting in the past seven, eight years around graph neural networks where we actually really take the graph structure in accounting when building a neural network. And so how does that work? A really good talk is by [inaudible], for example, down here, linked here. And this again is we’re taking a graph as input, we are transforming it into another graph, and then based on that, we can add features. So for example, for a node classification, for overall class classification, also linked prediction. So we are basically transforming our graph as part of the neural network training.
One of the downsides of early graph neural networks was actually there limitations in both the inductiveness and also the scale at which they could be deployed. So this is, for example, where more current research is helping us to overcome that. And I just want to briefly present two papers, not even go into too much detail, but basically just show you kind of the research direction happening out here. So the first one is actually GraphSAGE. And if we just dissect the title here, let’s look at what that actually means all in here. So the first part of this title is concerned about the representation, learning on graphs. This is actually what we did before as well, we want to learn how to represent a graph. And this is actually typically in some kind of vector or a tensive form.
And then the two properties which is added by this research is actually the inductive part, which means that we can go beyond to having this fixed size structure. So early approaches as mentioned, they actually only worked if I had a fixed size graph. And as soon as I would either see a new graph or a changing graph, imagine like Facebook graph, how often that’s changing, LinkedIn graph, whenever someone new comes, imagine would have to retrain that. Of course, that doesn’t really work. And so here we actually take that as an inductive problem, which means we can extend it to new graphs we haven’t seen before and also to changing graphs. The second aspect this paper tackled was the scale of training and they basically introduced something like mini batches in… You might know from neural networks, which actually helps this approach to scale out much better across larger sets of clusters.
Whereas with other approaches often I was mostly limited by the memory of a single note. Again, just imagine a Facebook, Uber Eats size graph. This doesn’t fit on a single looked. The next problem to be tackled is that those early papers, they actually really just looked at the structure. So basically, two nodes are connected and this is the structural representation. But if you look at real-world examples, we actually have strong connections, and we have reconnections. Like two cities might be very close to each other, two connections might be really close, they write daily, and others are just very brief occurrences. And so, as you can imagine, those should actually have different different weights in the training. And this is actually where we went through with graph attention network, where we can actually train the importance and learn the importance of certain relations of certain nodes in the neighborhood over others.
So if we kind of had to sketch the complexity again, as you see, I really enjoy doing that. Like the easiest approach to graph representation or graph machine learning is we just do simple samplings using deep walk. Then we have node2vec, which basically comes with more complex sampling algorithms. We have simple graph, convolutional networks doing something similar to message passing. Then we have approaches like GraphSAGE, actually taking that to an inductive and large-scale domain, and graph attention efforts which introduced the techniques that we can actually have go beyond just the structure of the network and actually take more features also on the edges into account.
Thanks so much for listening. I knows this was a lot of content and we actually stayed pretty much like at a very high level perspective, but I really hope it gave you some overview of that current field and also sparked some interest into that. I included the number of links and books here, and maybe just one of the highlight recommendations is a course I’m actually giving on [inaudible] next week about graph powered machine learning first steps, which we’ll actually cover this in a lot more detail. Otherwise here basically that book recommendation is also covering that entire field from a simple graph representation, over graph thinking, to graph algorithms and then also graph powered machine learning and graph representation learning.
With that, again, thanks so much for listening and please don’t forget to provide feedback for the session. Also feel free to reach out to me, always happy to improve or discuss more about graph neural networks. Thanks a lot.
Jörg Schad is VP of Engineering and Machine Learning at ArangoDB. In a previous life, he has worked on or built machine learning pipelines in healthcare, distributed systems at Mesosphere, and in-mem...