Table of Contents

A Visual Introduction to Graphical Data Representations

An approachable and interactive introduction to leveraging graph-theoretic methods in artificial intelligence.

Presented here is an overview of a general framework in which problems in artificial intelligence can be expressed in a graph-theoretic setting and how learning can be performed under over this framework. Specifically, this includes an in-depth development of the framework from both a theoretical and applied perspective, with applications specific to network theory, social networks, manufacturing, and healthcare. The requisite graph-theoretic concepts are summarized as a bulleted list in the drop-down to the right, whilst a more detailed introduction to graph theory (with examples) is given in Appendix A.

Data Representations

Conceptually, one can think of data representations as a way to store and transmit information about something, be it a group of entities, a collection of outcomes, or a population. Let $X$ denote a dataset. By data representation, we simply mean any map $\psi: X \to \mathcal{F}$ from $X$ to a feature space $\mathcal{F}$. For some data point $x$, its image under $\psi$ is often called its latent representation. Data representations are paramount to the success of most statistical learning methods. Changing the manner in which an initial dataset is represented (that is, finding a good feature space $\mathcal{F}$) is the core idea behind feature engineering, dimensionality reduction, and unsupervised learning. Also, many popular methods used in supervised learning implicitly learn better representations of a data source, as part of their learning process. For example, kernel based methods (such as support vector machines) work by embedding $X$ with respect to a kernel function $\kappa$, with the embedding map $\psi: X \to \mathcal{F}$, where the inner-product of the image of two data points $x$ and $x'$ under $\phi$ (their latent representations) meets the following condition $$\langle \psi(x), \psi(x')\rangle_{\mathcal{F}} = \kappa(x,x')$$ The power of this approach is that one need not compute the full latent representation $\psi(x)$ of every data point (the feature space here can even be infinite dimensional!), and in fact, one need not have any explicit knowledge of what $\psi$ actually is,. Likewise, one can view a neural network as no more than a series of maps between a number of (hidden) feature spaces. In the context of, say, a binary classification problem, either approach yielding a good model simply means that they have determine some embedding of $X$ into some feature space $\mathcal{F}$ in which the classes being learned are separable. The key take away is that the success artificial intelligence and machine learning applications is inextricably tied to how a data source is being represented.

A commonality amongst the majority of popular machine learning paradigms, is that they all result in or learn from data represented as matrices (linear models, feed-forward networks) or, more generally, as tensors (convolutional networks). The framework developed herein, however, affords a far more flexible representation of data which does not admit well to such representations, such as hierarchical data or data consisting of various relationships between entities/objects.

Graphical Data Representations

By graphical data representations in this context, we refer to a framework in which problems in data science, machine learning, artificial intelligence, and the like, are expressed as graph theory problems, to which we can then apply graph-theoretic methods to attempt to solve them. Often these graphical representations can inform and simplify future modeling methods and approaches, and can sometimes allow us to solve problems directly.

This approach is central to many areas of symbolic artificial intelligence such as knowledge graphs, ontologies, and conceptual graphs, typically used to represent semantic relationships between (abstract) classes, concepts, or groups. A good example is the taxonomic rank used to classify organisms into the standard hierarchy of species, genus, family, order, class, phylum, kingdom, domain, etc. It also affords a framework for rich and expressive representations of real-world data sets, particularly with data having inherently complex, hierarchical, or relational structure. Once data has been expressed in this form, we can analyze the natural network topology and structure in these representations which we can use to develop concrete global visualization facilitating big-picture understanding; support holistic, high-level models; and reduce overall modeling complexity.

Relations and Ontologies

One challenge of dealing with data which is inherently relational in nature is that complex, higher-order relationships can be hard to discover. Determining the degree of separation between two people in a social network, for instance, can be challenging to compute directly. However, when approached under a graphical paradigm these insights can often be gained with relative ease by using graph-theoretic algorithms related to paths, traversals, or connectedness.

Suppose we are modeling a social network for a fixed set of people as users of some social media platform using the dataset below, where each individual is enumerated by $i \in \{0,1,...,32\}$.

Social Network as a Table

User Connected User
0 7
10 15
18 19
12 16
5 8
7 2
1 2
15 12
21 20
16 17
26 20
2 9
12 13
26 29
7 6
27 25
29 22
12 17
20 27
27 23
26 21
2 6
7 1
9 3
26 30
15 14
27 22
3 5
25 28
26 27
6 8
13 14
26 22
21 29
26 31
24 25


From this dataset, imagine trying to answer questions, such as:

These questions, while easily stated, are quite challenging to answer when this data is given in such a tabular format but are trivial when represented and studied under a graphical framework. This is a prototypical example in which the requisite information needed to answer a question is in fact provided by a data source, but is represented in a form in which extracting said information is more arduous, or perhaps even obfuscates learning altogether. A graphical representation of this data can be seen in the visualization below, while a more detailed analysis of this data set is covered in Network Topology in Action. section in Appendix A.

Social Network as a Graph



While determining the underlying graphical structure in this case is rather natural since the relation in question can simply be whether two users are connected on a given platform, there does exist some subtlety in doing so. To see this, let us compare the situations in which this data consists of users on the two popular social media platforms: Facebook and Twitter. In the standard vernacular of each, this is explicitly the relation friends_with on Facebook and follows on Twitter, respectively. While at a high-level these settings may appear to be identical in the sense that we're simply specifying whether two users are connected on a given platform, notice that the mechanics of how users are connected on the respective platforms is ultimately different. The relation of friends_with is in fact a well-defined, symmetric relation (for Facebook users): if user $A$ is friends_with user $B$, then user $B$ is friends_with user $A$. However, this does not hold for the relation of follows in the case of Twitter users. Namely, that user $A$ follows user $B$, does not necessary mean that the converse, user $B$ follows user $A$, holds true in general. In this case, it would likely be more appropriate to model the social network as an undirected graph in the case that this dataset represents these individuals as Facebook users, and conversely as a directed graph, in the case that they are Twitter users.

In the most general setting, we can think of the graphs constructed in is manner are done so with respect to some ontology, denoted $\mathcal{O}$. A full treatment of ontologies, knowledge representations, and generalities of information science are beyond the scope of this work, for our needs we may simply think of ontologies as a set of objects and any relation amongst those objects. The graphs discussed in the example were constructed with respect to the set of social media users and the underlying relations defining a connection in Facebook and Twitter.

Constructing Graphical Representations

Being able to computationally determine the similarity amongst a set of entities is valuable in dealing with situations in which there are many such entities at hand. One such setting, is in the field of manufacturing which generates large amounts of data and information, but to which effective modeling can be incredibly challenging. This is due to many factors, such as the underlying complexity of the manufacturing processes themselves, having to manage global supply chains, inventory systems, and managing yield, which are all further complicated by the inherently stochastic nature of manufacturing complex products. There are often a multitude of factors influencing an outcome in manufacturing, from the products being manufactured, the location or machine being used in fabrication, or where the underlying components came from. We now explore how we may leverage the complexity reducing and holistic properties of graphical representations to aid in the development of intelligence applications within this field.

One way in which this can be done is by leveraging the relational structure of a bill of material. For some manufactured product, the bill of material or BoM for short, is a list of all the components (and quantities) required to create and assemble that product, from the purchased (raw) components to the assembled components to finished goods. The details can vary situation to situation, but for simplicity's sake, let us assume that we are dealing with a fixed set of all finished products and every component needed to manufacture them, in which each has a unique identifier. Let $\Omega$ denote the full set of all these components and let us simply refer to them as purchased and assembled components. Consider some assembled component $A \in \Omega$. Suppose that $A$ is manufactured using 1 of component $B$ and 2 of component $C$. Suppose further, that $B$ is a purchased component, while $C$ is a assembled component, which is manufactured using 1 of component $D$ and 1 of component $E$, both being purchased. More concretely, $A$ could represent a type of bicycle, whose main components are the bike frame, ($B$), which is purchased, and two wheels ($C$), each of which is assembled from a tire ($D$) and rim ($E$).

Regardless of what the context of these components are, $A$ can be embedded under a graphical representation by the weighted graph $\beta(A)$, which is given by $$\beta(A)= (\{A,B,C,D,E\},\{(A,B,1),(A,C,2),(C,D,1),(C,E,1)\})$$ For brevity, we will simply refer to $\beta(A)$ as the BoM of A. Notice that, while the details may vary between BoMs or even across the finished products themselves, each BoM $\beta(A)$ will have similar (global) network topologies in that they are finite, directed graphs with no cycles, which are called directed acyclic graphs or DAGs. Moreover, each BoM forms a polytree, which is a DAG whose underlying undirected graph is a tree -- we can treat $\beta(A)$ as either an undirected or directed graph depending on the context. In general, most applications of this approach will model this as a directed network, in that most useful relationships between these components will be non-symmetric in nature. It is, however, of convenience to treat these networks as undirected in a general setting, especially for visualization purposes. This is because both directions

make sense, and have use in various contexts, especially in simulating 'ripple' effects. If we want to determine the supply level for sub-component needed to manufacturer a given amount of finished goods, we'd use the downward directional flow, because the number of finished goods needed is given and we're propagating these values in a top-down manner. Conversely, to simulate the effects of a increasing price of some purchased component holistically, we'd use the upward flow because this is a bottom-up propagation across the network since a price increase in any purchased component will cause a corresponding increase in any finished good using that component, either directly or indirectly. More technically, this is equivalent to considering the two directional flows as transposes of one another.

Notice that part $A$ will be the root node of the BoM $\beta(A)$ and in the case that $B$ is a sub-component of $A$, then $\beta(B)$ is a subgraph of $\beta(A)$, which we will denote as $\beta(B) \leq \beta(A)$. Formally, we are constructing graphical relations relative to the ontology specified by the set of all parts in $\Omega$ and one of the non-symmetric relations given above.

We can carry out this representation process to the entire set of products or parts to which we want to model, and depending on the situation, we may treat the BoMs are individual graphs or consolidate them into a single network. Consider the following BoMs for parts $A$ and $G$, respectively $$\beta(A)= (\{A,B,C,D,E,I\},\{(A,B,1),(A,C,2),(C,D,4),(C,E,2),(C,I,2)\})$$ $$\beta(G)= (\{G,F,C,H,I,D,E\},\{(G,F,1),(G,C,1),(F,H,3),(F,I,1),$$ $$ (C,I,2),(C,D,4),(C,E,2),(H,J,1),(H,K,2)\})$$ Notice that specifying even these simple BoMs in a non-graphical manner quickly becomes unwieldy to the point of becoming nearly intractable, as the number of nodes and edges increase even slightly. However, expressing each BoM as a network affords a far more compact and informative representation. Further, we can add to the richness and expressiveness of this representation. For example, we can specify these as weight graphs, in which the edge from a component to a sub-component is sized according to how many of those sub-components are required in the manufacturing of the component. Likewise, we can quickly distinguish whether a component is a finished product (an assembled-component which is never a sub-component), assembled-component or purchased component, and color them by red, orange, and grey, respectively. We can also compare various topological properties of $\beta(A)$ and $\beta(G)$ to establish a similarity metric between components.

An alternate approach is to consider these components in a more holistic nature, namely, as a single network, by consolidating $\beta(A)$ and $\beta(G)$. This allows us to study and analyze interdependencies between these finished products and will highlight components which are critical to both. If we are given a fixed number of finished goods needed to be manufactured, either by considering the number of open orders we have for each finished product, or by forecasting future demand, we propagate these requirements over this network and compute overall the quantity of each component is required to meet these requirements.



Learning from Graphs

We now have a better notion of how one can construct graphical data representations in real-world settings. As such, we now turn our attention to how one can make use of these representations.

Visualizing Networks

One of the easiest, yet nonetheless effective methods of making use of graphical representations is to simply visualize them, provided that we can find a good visualization for it. The challenge is that coming up with a good visualization for a general network is often non-trivial. In the same way that a poor choice of scale or the type of plot to use can obfuscate an analysis and lead to an ultimately misleading visualization in a more traditional setting, the same applies to a graphical one, perhaps even more so. For visualization of a network to be fruitful, it is typically best to use a visualization which is done in such a way as to minimize the number of times that edges cross or intersect one another. The lowest possible number of edge crossings of a drawing of a graph (when drawn on a plane) is called its crossing number, and in this case that this number is zero, the graph is called a planar graph. This perhaps seemingly innocuous criterion is, in fact, a very hard problem. For example, consider the following visualization for a given graph, in which the nodes are positioned randomly.

There is very little we can likely learn about this graph from analyzing this particular visualization, and in fact it even may lead us to believe that the underlying data set represented by this graph is very chaotic and has little to no structure nor noticeable patterns. It turns out, however, that this network is more well-behaved than this visualization implies. Before going into more detail regarding this example, it can be helpful to better understand why minimizing the number of edge crossings of a graph visualization is so challenging. Imagine trying to rearrange the nodes in this visualization so as to end up with the fewest edge crossing as we can (you can try it if you like, the graph is interactive -- but don't try too hard, this is not a planar graph). How will you know that you're done, or rather, that you've rearranged the nodes so that the number of edge crossings is now minimal and is equal to the crossing number of this graph? Better yet, what is the crossing number of this graph? It turns out that computing the crossing number for a general graph, is quite hard; in fact, it is an NP-Hard problem! But even then, knowing the crossing number for a graph and obtaining such a representation are two different things.

Luckily for us, obtaining a visualization reaching the true minimum number of edge crossings is not a necessity, at least as it relates to representing data within the field of artificial intelligence. In fact, for most of our purposes, a visualization having a reasonable amount of edge crossings will likely be just as useful to us as a true planar visualization having only the minimal number of crossings! Such 'good enough' visualizations can be had by applying various heuristics and algorithms to generate candidate visualization layouts for a graph. Several examples of these algorithms applied to this graph (and the resulting visualization) are shown below. Further, many of these algorithms are focused on specific concepts aside from edge crossings.



While not perfect in that each example has (unnecessary) edge crossings, we can see that these visualizations are far more informative than the original and that each conveys, at a minimum, slightly different information. Various characteristics of this networks global topology are now more immediately evident. Through visual inspect, one can confirm that this network is: 1-connected (it is a connected graph which can be disconnected by removing node only a single node: 8, 12, 16, or 26), cyclic (the path (0, 1), (1, 11), (11, 2), (2, 0) is a cycle), and non-planar (the sub-graph induced by the nodes 42, 44, 45, 43, 46, and 47 is non-planar).

There are a myriad more insights to be had regarding the local or node level structure of this network.

These visualizations may seem quite different and one could argue that each tells a slightly different stories about the underlying system to which this network models. An important consideration to keep in mind, however, is that are all identical from a graph-theoretic viewpoint because they all specify the exact same graphical structure!

Inferring Similarity

Another useful aspect of graphical data representations are the various ways in which one can determine similarity or likeness amongst the set of entities represented by the nodes in the network. The beauty here is that these notions can be had in a, more or less, purely graphical way. For instance, the very fact that two nodes are connected, suggests that, in and of itself, these nodes are similar to one another or alike in some way. We can further quantify this notion of similarity, by determining the magnitude of this relation using their distance in the network; that is to say, the length of the shortest path connecting them. Depending on the context of what the edges represent, it could be that two directly adjacent nodes (shortest path is simply a single edge) are more similar to one another than two nodes connected by a shortest path of, say, length 3. In the example below, we compute the neighborhood of the green node in the given network; these nodes are colored using three shades of red (darkest to lightest) corresponding the 1st-, 2nd-, and 3rd-neighbors of the green node, respectively.

This coloring scheme gives us the nodes which are similar to the selected node by virtue of being not only connected but, in the case of the dark red nodes, directly adjacent to it. By the above remarks, it may make sense to conclude that the dark red nodes are more similar to the green node than the medium and light red nodes. Under this notion of similarity, this would suggest that the green node and the blue node are not similar, in that they are not connected. However, an alternative notion of likeness can be had by instead noticing that, at least qualitatively speaking, the green and blue nodes play similar roles in the network. They both have high out-degree and zero in-degree and both are positioned near the 'center' of this network. Also, notice that the subgraph induced by the neighborhood of each node, are both star graphs and are structurally identical to one another, up to the number of nodes/edges (click the button in the above visualization to see these neighborhoods).

Holistic Networks

Now let us explore the wealth of information and insights which can be had by extending this approach to a more general setting to see how it enables more complete and holistic intelligence. We do so by returning to the field of manufacturing, and apply this framework to a collection of BoMs for a much larger set of products and components. This is closer to many real-world manufacturing applications in which one is often dealing with a cardinality in the thousands to millions of distinct products and components. We extend this approach to the (synthetic) dataset of BoMs illustrated below, across four distinct product families (groupings of like products), representing ~5,500 distinct components. The nodes are colored according to one of the four respective product families to which they belong.

Click the button above example to enable the visualization.

Given the approach detailed above, this network is relatively straight-forward to produce and yet greatly simplifies a wide array of common problems to the point of which they are almost trivial to answer. For example, the groupings of products and components into product families often does not extend beyond the finished products. However, for any such assembled-component or purchase component, by simply starting at its corresponding node and traversing the network, we can efficiently determine the set of product families of all the finished products to which the item is a sub-component of. Likewise simulating the effect of running out of a given component, or of having the cost of a purchased component drastically increase, can be done by propagating these changes across the network. Given that the longest path in this network is 7 (meaning there is a product whose BoM has 7 levels), computing the effect of such changes or monitoring supply levels comes with tremendous complexity, much of which can be reduced with graphical methods. In short, this graphical representation allows for a holistic model to all the products made by a manufacturer.

Perhaps the most defining characteristic of this network's (global) topology is its connectedness, or rather, lack thereof. Note that the network is comprised of 8 connected components, in spite of the fact that it is representing products from only 4 product families. This means that the first product family (the red nodes) is in actuality comprised of 5 (disconnected) sub-families, which have no common components between them. This highlights the fact that product groupings are often done for marketing and sales purposes and do not necessarily reflect similarity in the manufacturing process. To explore further, we can compute the connected components of the network.

Each node is now colored according to these connected components. By double-clicking on the cubes below the components, we can further analyze the components as separate sub-networks.

Armed with no additional information, such as what these products are or how these particular product families are derived, we already have a good deal of a priori knowledge when it comes to developing artificial intelligence or machine learning applications related to these parts. We can further expand this knowledge by analyzing the roles that given components play within their respective sub-networks. Below we compare connected Component 1 (green) and connected Component 5 (purple). The degree of various nodes in a network is often one of the most descriptive features. In this case, the degree of a component as a node in this network would give us a notion of how many distinct assembled components require this component (if using the downward network flow). Thus, the higher the degree of a node, the more critical the corresponding component is to this collection of finished goods as a whole. Given the fact that components typically become more specialized the more you move 'up' a BoM, high degree nodes can sometimes be purchased components or universal components, such as packaging, labeling, etc. We could also use the concept of betweeness centrality to carry out this estimation, which by comparison, would give us a notion of critical sub-components, which may or may not have high degree. In the examples below, the red nodes are the ones having high degree. To compare these competing notions of node criticality, we size the nodes in their degree and centrality, respectively.

Notice that the network on the left has a much more dense topology, resulting in a lack of 'hub' nodes having high centrality, and thus centrality is not a defining characteristic of nodes in this sub-network. Equivalently, this means that there are many different ways to travel between nodes, which in the context of our problem, signifies that many components in this network are more universal in nature. This is akin to the component $I$ in the BoM example above, which is a sub-component of $C$, $F$, and $G$. Conversely, in the network on the right, note that there are quite a few nodes having high betweenness centrality, several of which also have low degree (the larger white nodes). This is due to the sparseness of this topology. Notice that this sub-network has 3 relatively distinct clusters which are connected only by these low degree, high centrality nodes. Intuitively, this means that, while the components in this sub-network are indeed similar, there is perhaps more dissimilarity in this sub-network than in the more dense sub-network on the left, outside of these clusters. While this sub-network is indeed connected, it has a much lesser degree of connectedness than the sub-network on the left, in that it is only 1-connected.

In fact, this 'triangular' topology is evident in several of our sub-networks, all characterized by having 3 dense clusters connected by high centrality nodes of low degree. This fact becomes more immediate after 'flattening' these sub-networks by projecting them to two dimensions. This suggests there could be commonalities between finished goods in these sub-networks, even when they are, from a BoM perspective, dissimilar. This notion of determining a structural equivalence amongst these sub-networks is the conceptual basis of graph homomorphisms.

Dynamic Networks

We can also use a graphical approach to model the flow of a system dynamically, to infer how these relational structures change over time. Inventory management is an incredibly challenging and complex process in manufacturing, from managing a, perhaps global, supply chain, optimizing how components are shipped and moved, and isolating scrap and yield issues. It would be of tremendous utility to have a representation of how these components are flowing throughout inventory as a result of the manufacturing process, which we can achieve with graph-theoretic approaches in the following way. First, components are stored in inventory in some manner of stock-keeping units (SKUs). The components are then taken from these SKUs to be used in the fabrication of an assembled-component, as a function of work orders, which specify the number assembled components needed. This assembled-component is then placed back in inventory into some SKU or is sent to a customer. In a sense, this can be thought of as a realization of sorts of the BoM for a given assembled-component in that we are now also concerned with specific instances of manufacturing assembled-components and where the associated sub-components were stored and, subsequently, taken from inventory. Studying in such a temporal manner will allow us to better understand how this network grows, morphs, and evolves over time.

Clicking the Simulate Inventory Flow button will enable the visualization of how components are moving throughout inventory, as detailed above. Also, the time windowing buttons <, - ,+, and >, can be used to simulate how this network develops and morphs over time, as new orders come in and new components are purchased. Finally, a common problem in inventory management is isolating and containing potentially defective components upon the realization of some defect or scrap problem. The Simulate Material Containment button shows an example of how we may propagate these problems across this network. Here we compute all the potentially compromised material in inventory, which now needs to be quarantined and tested, based on the fact that the SKUs given by the red nodes were found to have defect components, but were later used in other orders.

Latent Representations

As discussed previously, learned or latent representations are paramount to the majority of modern machine learning methods. We now consider how one may leverage such latent representations in a graph-theoretic manner by using an example from the field of healthcare. First, a bit of background. Healthcare is a field in which AI applications can have tremendous value and impact, but healthcare data possesses the usual complicating factors, such as: missing or messy data; siloed and disparate data sources; inherent privacy concerns; and a lack of ubiquity and standardization. While there will often be little overlap in data sources amassed from various sources (such as between different hospitals), data related to diagnosis and procedure information is somewhat ubiquitous in the form of ICD (International Classification of Diseases) codes, which is a collection of alphanumeric codes which is maintained by the World Health Organization and is used within the medical field to represent diagnosis and procedures.

We will explore data from the MIMIC-III Clinical Database, which consists of deidentified health data for roughly 60,000 patients, which is openly available to researchers. We will explore the application of graphical data representations to learned representations to ICD codes. Briefly, a word2vec model is an unsupervised learning technique used to compute word embeddings, which are in-effect an encoding (or vectorization) of words in a corpus which respects contextual similarity: the more similar two words are, the closer they should be as vectors under this embedding. A defining characteristic of word2vec models is that they learn distributed representations, which can be thought of as more dimensionality compact representations than the extremely sparse representations yielded by classical encoding methods such as one-hot encoding.

A word2vec model was trained on the ICD codes appearing in the MIMIC-III dataset. This is carried out by taking the list of all ICD codes from a patient's visit and treating them as code sequences. In the vernacular of NLP, the sequence of ICD codes for a patients visit, would serve as our sentences, whilst this individual ICD codes would be our words. A typical approach is to embed the words in a high-dimensional (real) space to maximize expressiveness of the learned representations (100-dimensional vectors were used here) and then apply some dimensionality reduction techniques so as to assess the outputs. Below is a visualization of the latent space of this model, projected down to 3-dimensions, using a combination of principal component analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE). Each point is this visual represents one of the ICD codes appearing in the dataset.

One of the principal challenges of unsupervised learning is inferring the results: there are a myriad of potential representations for a given dataset, none of which will likely be useful for every problem or context. However, we can represent this latent space graphically to aid in this task. To do so, we can treat each point as a node and compute edges between them based on similarity. There are a variety of ways in which we can do this, such as computing the distribution of inter-node distances with respect to some distance metric and select a threshold so that we define an edge between two points if their distance is less than this threshold. Likewise, we could apply spectral clustering to the resultant similarity matrix and define an edge between nodes if they are in the same spectral cluster. The former was used here, as one major downside of the spectral clustering approach is that it sacrifices inter-cluster relations: every node in a given cluster will be connected.

Clicking the Compute Edges button in the above model carries out this procedure on the learned ICD code representations. Notice that this yields a network with a modest number of edges, which also exhibits a noticeable structure. The network is rather disconnected in that it has a large number of connected components and has a good deal of isolated nodes. An advantage of this approach over a pure clustering method, is that it captures a wider dimension of similarity (via paths.)

Often such representations, while useful in-and-of-themselves, are often developed for use in down-stream AI applications. This means that any time we were to deal with any data sources consisting of ICD codes, we could transfer these representations and learnings to those new applications. For example, having such computational representations for the diagnosis and procedure data, we can apply this encoding and study the structure of patients in the data set; that is, to learn patient-level representations. This is done by applying the learned ICD representations to the ICD code sequence for a patient's visit. Where a patient's visit was previously represented by a sequence of ICD codes, it is now represented by a sequence of 100-dimensional real vectors. We can then apply more advanced unsupervised learning methods to learn the desired patient level representations. Here a stacked, denoising, convolutional autoencoder (SDCAE) was used to learn lower-dimensional representations of the patients.

Notice that we could jump directly to this level by training a SDCAE instead on the patient's ICD sequences directly. Applying transfer learning in this way, however, gives us the advantage of leveraging the learned contextual similarity of the ICD codes. Further, we can add significant robustness to the model by leveraging the 'wider' similarity criterion established by the connected components, by randomly substituting ICD codes with other codes in its connected component! Below is an example of this model applied to a random subset of 6,000 patients from the dataset.

Carrying out the same process as with the ICD codes, we can see that this patient representation network is far more dense than the code representation network. In fact, this network has only 6 connected components. Notice that the largest component (the red edges) while on the one hand rather densely connect, it certainly has many 'holes' in it. We can think of the connected components here as defining various computational phenotypes yielding the distinct subgroups evident in the network.

Conclusion

In closing, the framework of graphical data representations presented here has tremendous utility in settings where traditional artificial intelligence and machine learning methods suffer from diminished performance, such as data having a complex or relation structure. Additionally, it can be used to improve the performance of many traditional methods by giving practitioners a better understanding of the overall structure of his or her data, or in the development of new features in the feature engineering process.




Appendix A: Graph Theory

So what exactly is graph theory? First off, a graph in this context is different from how one usually encounters the notion of a graph in data science and machine learning, that is, the graph of a function, which is of course a way to visualize the relation of outputs to inputs under this function. Instead, in the context of graph theory, a graph refers to a mathematical structure which is used to model pair-wise relationships between entities of some kind. A graph is composed of a set of objects, called nodes or vertices, and a set of relationships between these objects, called edges or links. A graph can be specified by an ordered pair $G = (V, E)$, where $V$ and $E$ denote the set of nodes and edges, respectively. Graph theory is the branch of mathematics concerned with studying and understanding the structure of these objects, including various properties or attributes they might have, as well as developing various algorithms related to understanding these structures.

Example Graphs




One of the most fundamental properties for a graph $G = (V,E)$, is whether the edges in $E$ are undirected (the yellow, green and purple graph examples) or directed (the blue and red graph examples). This delineation specifies whether the edges in the graph are bi-directional, or equivalently, whether the underlying relationship to which they represent is symmetric or non-symmetric. Many key graph-theoretic properties and algorithms must be defined differently in the context of undirected vs directed graphs, or often are not defined at all in the opposing case. For example, the notion of a graph being connected in the undirected case, must instead be replaced with weakly and/or strongly connected in the directed case. Even fundamental concepts such as the degree of a node requires some alterations (the degree of a node in a directed graph can be delineated into in-degree and out-degree).

Network Theory

Network Theory or Network Science is a sub-field of graph theory in which the graphs studied represent or model some real-world entity or dynamical system. The key (albeit murky) distinction here is that network theory is concerned with graphs which are modeling some real-world system and will often be accompanied by various meta-data beyond that of the graphical structure alone, such as node groupings (blue example graph), edge weights (green example graph), or concepts not expressed explicitly by the graph at hand. Properties related to the layout complex networks are of particular interest. Namely, the network topology, or simply topology, of a network are characteristics, properties, or attributes related to the layout and arrangement of nodes and edges, either from a global (the graph as a whole) or local (sub-graphical structures) perspective. Likewise, it is often of interest to analyze various properties of the nodes and edges themselves since after all, these are the very objects to which we are modeling. Common topological properties of interest include shape, connectedness, density, degree distribution, and centrality.

Network Topology in Action

Consider the problem of modeling the dynamics related to some set of objects or entities enumerated by $i \in \{0,1,...,32\}$. Suppose we have the following relational dataset, commonly called an edge list $$\{(26, 30), (26, 31), (26, 21), (26, 29), (26, 20), (26, 22), (26, 27), (21, 20), (21, 29),$$ $$(29, 22), (0, 7), (7, 1), (7, 2), (7, 6), (1, 2), (2, 6), (2, 9), (6, 8), (9, 3), (3, 5), $$ $$(5, 8), (10, 15), (15, 12), (15, 14), (12, 13), (12, 16), (12, 17), (13, 14),$$ $$ (16, 17), (18, 19), (20, 27), (27, 22), (27, 23), (27, 25), (24, 25), (25, 28) \}$$ in which the tuple $(i,j)$ signifies that the ith entity is related to the jth entity. We can represent this system in a graphical framework by taking it to be the graph $G = (V,E)$, where $V = \{0,1,...,32\}$ and $E$ to the be edge list. Taking these edges to be undirected (or equivalently this relationship to be symmetric), induces the following network.

$G = (V,E)$

There exists a multitude of methods and approaches in which we can study and analyze the rich representation structures induced by viewing datasets in this way, the most applicable of which will often be determined by the problem at hand. Several such graph-theoretic concepts are illustrated below for the network induced by the edge list given above, as both an undirected and directed graph.

For example, we can study properties related to the network as a whole by summarizing its network topology, or by computing its connected components (example A), which illustrates potential sub-populations in our data. Specifically, this network has 5 connected components, meaning a node from, say, the component of purple nodes is not only not directly adjacent to any non-purple node, but that there is no relation of this node to any non-purple node whatsoever; there is no path from a purple node to any non-purple node.

We can develop models on $G$ in a more direct manner, using methods such as graph convolutions, node embeddings, or spectral clustering. This can be done by representing $G$ by its adjacency matrix (example C). For a general graph $G = (V,E)$, the adjacency matrix $A$ is the $|V| \times |V|$ matrix, whose rows and columns represent the nodes in the graph, and in which $$ \begin{cases} a_{i,j} = 1, & i \text{ and } j \text{ are adjacent} \\ a_{i,j} = 0, & \text{otherwise} \\ \end{cases} $$ In the case of weighted edges, we can take $a_{i,j}$ to be the weight of the incident edge, if they're adjacent.

Likewise, we can gain a deeper understanding of the various nodes themselves (or rather, the real-world objects they represent), and how they compare to one another. Things such as ascertaining how similar a pair of nodes is by determining if they are in the same connected component, comparing the size and shape of their neighborhoods (example D), or perhaps whether or not they are dominant in the larger network (example H). Another powerful concept in network theory is cliques, which are critical to undirected probabilistic graphical models or Markov random fields and give a stronger notion of how nodes group across a network. Of course, we can do a similar analysis at the edge level as well (example F).

We can view this network as a weighted graph in which we size both the nodes and/or edges either by their topological properties or using meta-data. In example B, the edges are weighted by whether or not they are a bridge and the nodes are sized by their degree. Opening the control pane on this figure allows you to selected other possible weighting options. Often the most useful scheme is to imbue these attributes based on the available meta-data we have at hand. For example, if this network represents possible flights (edges) between a set of cities (nodes), we could perhaps weight the edges by how far a plane would have to travel from city to city, how long the flight would take, or how much a flight would cost. Likewise, we can represent various properties of the nodes by coloring them according to what state the city is in and could size them based on how much the average flight arriving at each city costs (namely, the average weight of edges incident to a node, if the edges are weighted by flight cost). Note that, depending on the context of what cities and/or flights we're considering, it may be more appropriate to model these relations as a directed network instead (example G).

Appendix B: Technology

All modeling, examples, and visualizations were done in Python and JavaScript, principally using the following frameworks and libraries: