Presented here is an overview of a general framework in which problems in artificial intelligence can be expressed in a graph-theoretic setting and how learning can be performed under over this framework. Specifically, this includes an in-depth development of the framework from both a theoretical and applied perspective, with applications specific to network theory, social networks, manufacturing, and healthcare. The requisite graph-theoretic concepts are summarized as a bulleted list in the drop-down to the right, whilst a more detailed introduction to graph theory (with examples) is given in Appendix A.
Conceptually, one can think of data representations as a way to store and transmit information about
something, be it a group of entities, a collection of outcomes, or a population.
Let $X$ denote a dataset. By data representation, we simply mean any map $\psi: X \to \mathcal{F}$
from $X$ to a feature space $\mathcal{F}$. For some data point $x$, its image under $\psi$
is often called its latent representation.
Data representations are paramount to the success of most statistical learning methods. Changing the manner
in which an initial dataset is represented (that is,
finding a good feature space $\mathcal{F}$) is the core idea behind feature engineering,
dimensionality reduction, and unsupervised learning.
Also, many popular methods used in supervised learning implicitly learn better representations of a data
source, as part of their learning process.
For example, kernel based methods (such as support vector machines) work by embedding $X$ with respect to a
kernel function $\kappa$, with the embedding map
$\psi: X \to \mathcal{F}$, where the inner-product of the image of two data points $x$ and $x'$ under $\phi$
(their latent representations) meets the following condition
$$\langle \psi(x), \psi(x')\rangle_{\mathcal{F}} = \kappa(x,x')$$
The power of this approach is that one need not compute the full latent representation $\psi(x)$ of every
data point (the feature space here can even be infinite dimensional!), and in fact, one
need not have any explicit knowledge of what $\psi$ actually is
A commonality amongst the majority of popular machine learning paradigms, is that they all result in or learn from
data represented as matrices (linear models, feed-forward networks) or, more generally, as tensors (convolutional networks). The framework developed herein, however, affords a far more flexible
representation of data which
does not admit well to such representations, such as hierarchical data or data consisting of various
relationships between entities/objects.
By graphical data representations in this context, we refer to a framework in which problems in data
science, machine learning, artificial intelligence, and the like, are expressed as graph theory problems,
to
which we can
then apply graph-theoretic methods to attempt to solve them. Often these graphical representations can
inform
and simplify future modeling methods and approaches, and can sometimes allow us to solve problems directly.
This approach is central to many areas of symbolic artificial intelligence such as knowledge graphs,
ontologies
One challenge of dealing with data which is inherently relational in nature is that complex, higher-order
relationships can be hard to discover. Determining the degree of separation between two people in a
social network, for instance,
can be challenging to compute directly.
However, when approached under a graphical paradigm these
insights can often be gained with relative ease by using graph-theoretic algorithms related to paths,
traversals, or connectedness.
Suppose we are modeling a social network for a fixed set of people as users of some social
media platform using
the dataset below, where each individual is enumerated by $i \in \{0,1,...,32\}$.
User | Connected User |
---|---|
0 | 7 |
10 | 15 |
18 | 19 |
12 | 16 |
5 | 8 |
7 | 2 |
1 | 2 |
15 | 12 |
21 | 20 |
16 | 17 |
26 | 20 |
2 | 9 |
12 | 13 |
26 | 29 |
7 | 6 |
27 | 25 |
29 | 22 |
12 | 17 |
20 | 27 |
27 | 23 |
26 | 21 |
2 | 6 |
7 | 1 |
9 | 3 |
26 | 30 |
15 | 14 |
27 | 22 |
3 | 5 |
25 | 28 |
26 | 27 |
6 | 8 |
13 | 14 |
26 | 22 |
21 | 29 |
26 | 31 |
24 | 25 |
From this dataset, imagine trying to answer questions, such as:
These questions, while easily stated, are quite challenging to answer when this data is given in such a tabular format but are trivial when represented and studied under a graphical framework. This is a prototypical example in which the requisite information needed to answer a question is in fact provided by a data source, but is represented in a form in which extracting said information is more arduous, or perhaps even obfuscates learning altogether. A graphical representation of this data can be seen in the visualization below, while a more detailed analysis of this data set is covered in Network Topology in Action. section in Appendix A.
While determining the underlying graphical structure in this case is rather natural since the
relation in question can simply be whether two users are connected on a given platform, there does exist
some subtlety in doing so. To see this, let us compare the
situations in which this data consists of
users on the two popular social media platforms: Facebook and Twitter.
In the standard vernacular of each, this is explicitly the relation friends_with on Facebook and
follows on Twitter, respectively.
While at a high-level these settings may appear to be identical in the sense that we're simply specifying whether
two
users are connected on a given platform, notice that the mechanics of how users are connected on the
respective
platforms is ultimately different. The relation of friends_with is in fact a well-defined, symmetric
relation (for
Facebook users): if user $A$ is friends_with user $B$, then user $B$ is friends_with user $A$.
However, this does not hold for the relation of follows in the case of Twitter users. Namely, that
user
$A$ follows user $B$, does not necessary mean that the converse, user $B$ follows user $A$,
holds
true in general. In this case, it would likely be more appropriate to model the social network as an
undirected graph in the case that this dataset represents these individuals as Facebook users, and
conversely as a directed graph,
in the case that they are Twitter users.
In the most general setting, we can think of the graphs constructed in is manner are done so with respect to
some
ontology, denoted $\mathcal{O}$. A full treatment of ontologies, knowledge representations, and
generalities of information science are beyond the scope of this work, for our needs we may simply think of
ontologies as a set of objects and any relation amongst those objects. The graphs discussed in
the
example were constructed with respect to the set of social media users and the underlying relations defining
a
connection in Facebook and Twitter.
Being able to computationally determine the similarity amongst a set of entities is valuable in dealing
with situations in which there are many such entities at hand. One such setting, is in the field of
manufacturing
which generates large amounts of data and information, but to which effective
modeling can be incredibly challenging. This is due to many factors, such as the underlying complexity of the
manufacturing
processes themselves, having to manage global supply chains, inventory systems, and managing yield, which
are
all further complicated by the
inherently stochastic nature of manufacturing complex products. There are often a multitude of factors
influencing an
outcome in manufacturing, from
the products being manufactured, the location or machine being used in fabrication, or where the underlying
components came from.
We now explore how we may leverage the complexity reducing and holistic properties of graphical
representations
to aid in the development of intelligence applications within this field.
One way in which this can be done is by leveraging the relational structure of a bill of material. For some
manufactured product, the bill of material or BoM for short, is a list of all the components
(and
quantities) required to create and assemble that product, from the purchased (raw) components to the assembled
components to finished goods. The details can vary situation to situation, but for simplicity's sake, let us
assume that we are dealing with a fixed set of all finished products and every component needed to
manufacture
them, in which each has a unique identifier. Let $\Omega$ denote the full set of all these components and
let us
simply refer to them as purchased and assembled components. Consider some assembled component $A \in
\Omega$.
Suppose that $A$ is manufactured using 1 of component $B$ and 2 of component $C$. Suppose further, that $B$
is a purchased
component, while $C$ is a assembled component, which is manufactured using 1 of component $D$ and 1 of
component $E$, both being purchased.
More concretely, $A$ could represent a type of bicycle, whose main
components are the bike frame, ($B$), which is purchased, and two wheels ($C$), each of which is assembled
from a
tire ($D$) and rim ($E$).
Regardless of what the context of these components are, $A$ can be embedded under a
graphical representation by the weighted graph $\beta(A)$, which is given by $$\beta(A)=
(\{A,B,C,D,E\},\{(A,B,1),(A,C,2),(C,D,1),(C,E,1)\})$$
For brevity, we will simply refer to $\beta(A)$ as the BoM of A. Notice that, while the details may vary
between
BoMs or even across the finished products themselves, each BoM $\beta(A)$ will have similar (global) network
topologies in that they are finite, directed graphs with no cycles, which are called directed acyclic
graphs or DAGs. Moreover, each BoM forms a polytree, which is a DAG whose underlying
undirected graph is a tree -- we can treat $\beta(A)$ as either an undirected or directed graph depending on
the
context. In general, most applications of this approach will model this as a directed network, in that most
useful relationships between these components will be non-symmetric in nature. It is, however, of
convenience to
treat these networks as undirected in a general setting, especially for visualization purposes.
This is because both directions
make sense, and have use in various contexts, especially in simulating 'ripple' effects. If we want to
determine the supply level for sub-component needed to manufacturer a given amount of finished goods,
we'd use the downward directional flow, because the number of finished goods needed is given and
we're
propagating these values in a top-down manner. Conversely, to simulate the effects of a increasing price of some
purchased component holistically, we'd use the upward flow because this is a bottom-up
propagation across the network since a price increase in any purchased component will cause a corresponding
increase in any
finished good using that component, either directly or indirectly. More technically, this is equivalent to
considering the two directional flows as transposes of one another.
Notice that part $A$ will be the root node of the BoM $\beta(A)$ and in the case that $B$ is a sub-component
of
$A$, then $\beta(B)$ is a subgraph of $\beta(A)$, which we will denote as $\beta(B) \leq \beta(A)$.
Formally, we are
constructing graphical relations relative to the ontology specified by the set of all parts in $\Omega$ and
one
of the non-symmetric relations given above.
We can carry out this representation process to the entire set of products or parts to which we want to
model,
and depending on the situation, we may treat the BoMs are individual graphs or consolidate them into a
single
network. Consider the following BoMs for parts $A$ and $G$, respectively
$$\beta(A)= (\{A,B,C,D,E,I\},\{(A,B,1),(A,C,2),(C,D,4),(C,E,2),(C,I,2)\})$$
$$\beta(G)= (\{G,F,C,H,I,D,E\},\{(G,F,1),(G,C,1),(F,H,3),(F,I,1),$$
$$ (C,I,2),(C,D,4),(C,E,2),(H,J,1),(H,K,2)\})$$
Notice that specifying even these simple BoMs in a non-graphical manner quickly becomes unwieldy to the
point of
becoming nearly intractable, as the number of nodes and edges increase even slightly. However, expressing
each
BoM as a network affords a far more compact and informative representation. Further, we can add to the
richness
and expressiveness of this representation. For example, we can specify these as weight graphs, in which the
edge
from a component to a sub-component is sized according to how many of those sub-components are required in
the
manufacturing of the component. Likewise, we can quickly distinguish whether a component is a finished
product
(an assembled-component which is never a sub-component), assembled-component or purchased component, and
color
them by red, orange, and grey, respectively. We can also compare various topological properties of
$\beta(A)$
and $\beta(G)$ to establish a similarity metric between components.
An alternate approach is to consider these components in a more holistic nature, namely, as a single network, by consolidating $\beta(A)$ and $\beta(G)$. This allows us to study and analyze interdependencies between these finished products and will highlight components which are critical to both. If we are given a fixed number of finished goods needed to be manufactured, either by considering the number of open orders we have for each finished product, or by forecasting future demand, we propagate these requirements over this network and compute overall the quantity of each component is required to meet these requirements.
We now have a better notion of how one can construct graphical data representations in real-world settings. As such, we now turn our attention to how one can make use of these representations.
One of the easiest, yet nonetheless effective methods of making use of graphical representations is to simply visualize them, provided that we can find a good visualization for it. The challenge is that coming up with a good visualization for a general network is often non-trivial. In the same way that a poor choice of scale or the type of plot to use can obfuscate an analysis and lead to an ultimately misleading visualization in a more traditional setting, the same applies to a graphical one, perhaps even more so. For visualization of a network to be fruitful, it is typically best to use a visualization which is done in such a way as to minimize the number of times that edges cross or intersect one another. The lowest possible number of edge crossings of a drawing of a graph (when drawn on a plane) is called its crossing number, and in this case that this number is zero, the graph is called a planar graph. This perhaps seemingly innocuous criterion is, in fact, a very hard problem. For example, consider the following visualization for a given graph, in which the nodes are positioned randomly.
There is very little we can likely learn about this graph from analyzing this particular visualization, and
in
fact it even may lead us to believe that the underlying data set represented by this graph is very chaotic and has
little to no structure nor noticeable patterns. It turns out, however, that this network is more
well-behaved
than this
visualization implies. Before going into more detail regarding this example, it can be helpful to better
understand why minimizing the number of edge crossings of a graph visualization is so challenging. Imagine
trying
to rearrange the nodes in this visualization so as to end up with the fewest edge crossing as we can (you
can
try it if you like, the graph is interactive -- but don't try too hard, this is not a planar graph). How
will
you know that you're done, or rather, that you've rearranged the nodes so that the number of edge crossings is
now
minimal and is equal to the crossing number of this graph? Better yet, what is the crossing number of
this graph? It turns out that computing the crossing number for a general graph, is quite hard; in fact, it
is
an NP-Hard problem! But even then, knowing the crossing number for a graph and obtaining such a
representation
are two different things.
Luckily for us, obtaining a visualization reaching the true minimum number of edge crossings is not a
necessity,
at least as it relates to representing data within the field of artificial intelligence. In fact, for most of
our
purposes, a visualization having a reasonable amount of edge crossings will likely be just as useful to us
as a
true planar visualization having only the minimal number of crossings! Such 'good enough' visualizations can
be
had by applying various heuristics and algorithms to generate candidate visualization layouts for a graph.
Several examples of these algorithms applied to this graph (and the resulting visualization) are shown
below.
Further, many of these algorithms are focused on specific concepts aside from edge crossings.
While not perfect in that each example has (unnecessary) edge crossings, we can see that these visualizations
are
far more informative than the original and that each conveys, at a minimum, slightly different information.
Various characteristics of this networks global topology are now more immediately evident. Through visual inspect,
one can confirm that this network is: 1-connected (it is a
connected graph which can be disconnected by removing node only a single node: 8, 12, 16, or 26),
cyclic
(the
path (0, 1), (1, 11), (11, 2), (2, 0) is a cycle), and non-planar (the sub-graph induced by the nodes
42,
44, 45, 43, 46, and 47 is non-planar).
There are a myriad more insights to be had regarding the local or node level structure of this network.
These visualizations may seem quite different and one could argue that each tells a slightly different stories about the underlying system to which this network models. An important consideration to keep in mind, however, is that are all identical from a graph-theoretic viewpoint because they all specify the exact same graphical structure!
Another useful aspect of graphical data representations are the various ways in which one can determine similarity or likeness amongst the set of entities represented by the nodes in the network. The beauty here is that these notions can be had in a, more or less, purely graphical way. For instance, the very fact that two nodes are connected, suggests that, in and of itself, these nodes are similar to one another or alike in some way. We can further quantify this notion of similarity, by determining the magnitude of this relation using their distance in the network; that is to say, the length of the shortest path connecting them. Depending on the context of what the edges represent, it could be that two directly adjacent nodes (shortest path is simply a single edge) are more similar to one another than two nodes connected by a shortest path of, say, length 3. In the example below, we compute the neighborhood of the green node in the given network; these nodes are colored using three shades of red (darkest to lightest) corresponding the 1st-, 2nd-, and 3rd-neighbors of the green node, respectively.
This coloring scheme gives us the nodes which are similar to the selected node by virtue of being not only connected but, in the case of the dark red nodes, directly adjacent to it. By the above remarks, it may make sense to conclude that the dark red nodes are more similar to the green node than the medium and light red nodes. Under this notion of similarity, this would suggest that the green node and the blue node are not similar, in that they are not connected. However, an alternative notion of likeness can be had by instead noticing that, at least qualitatively speaking, the green and blue nodes play similar roles in the network. They both have high out-degree and zero in-degree and both are positioned near the 'center' of this network. Also, notice that the subgraph induced by the neighborhood of each node, are both star graphs and are structurally identical to one another, up to the number of nodes/edges (click the button in the above visualization to see these neighborhoods).
Now let us explore the wealth of information and insights which can be had by extending this approach to a
more
general setting to see how it enables more complete and holistic intelligence. We do so by returning to the field of manufacturing,
and apply this framework to a collection of BoMs for a much larger set of products and components. This is
closer
to many real-world manufacturing applications in which one is often dealing with a cardinality in the thousands to millions of
distinct products and components. We extend this approach to the (synthetic) dataset of BoMs illustrated
below, across four
distinct product families (groupings of like products), representing ~5,500 distinct components. The nodes
are
colored according to one of the four respective product families to which they belong.
Click the button above example
to
enable the visualization.
Given the approach detailed above, this network is relatively straight-forward to produce and yet greatly
simplifies a wide array of common problems to the point of which they are almost trivial to answer. For
example,
the groupings of products and components into product families often does not extend beyond the finished
products.
However, for any such assembled-component or purchase
component, by simply starting at its corresponding node and traversing the network, we can efficiently
determine
the set of product families of all the finished products to which the item is a sub-component of. Likewise
simulating the effect of running out of a given component, or of having the cost of a purchased component
drastically increase, can be done by propagating these changes across the network. Given that the longest
path
in this network is 7 (meaning there is a product whose BoM has 7 levels), computing the effect of
such changes or monitoring supply levels comes with tremendous complexity, much of which can be reduced
with
graphical methods. In short, this graphical representation allows for a holistic model to all the
products made by a manufacturer.
Perhaps the most defining characteristic of this network's (global) topology is its connectedness, or
rather,
lack thereof. Note that the network is comprised of 8 connected components, in spite of the fact that it is
representing products from only 4 product families. This means that the first product family (the red nodes)
is
in actuality comprised of 5 (disconnected) sub-families, which have no common components between them. This
highlights the fact that product groupings are often done for marketing and sales purposes and do not
necessarily
reflect similarity in the manufacturing process. To explore further, we can compute the
connected components of the network.
Each node is now colored according to these connected components. By double-clicking on the cubes below the
components, we can further analyze the components as separate sub-networks.
Armed with no additional information, such as what these products are or how these particular product families
are
derived, we already have a good deal of a priori knowledge when it comes to developing artificial
intelligence
or machine learning applications related to these parts. We can further expand this knowledge by analyzing
the
roles that given components play within their respective sub-networks. Below we
compare connected Component 1 (green) and connected Component 5 (purple). The degree of various nodes in a
network is often one
of the most descriptive features. In this case, the degree of a component as a node in this network would
give
us a notion of how many distinct assembled components require this component (if using the downward network
flow). Thus, the higher the degree of a node, the more critical the corresponding component is to this
collection of
finished goods as a whole. Given the fact that components typically become more specialized the more you
move
'up' a BoM, high degree nodes can sometimes be purchased components or universal components, such as
packaging,
labeling, etc.
We could also use the concept of betweeness centrality to carry out this estimation, which by
comparison,
would give us a notion of critical sub-components, which may or may not have high degree.
In the examples below, the red nodes are the ones having high degree. To compare these competing notions of
node
criticality, we size the nodes in their degree and centrality, respectively.
Notice that the network on the left has a much more dense topology, resulting in a lack of 'hub' nodes
having
high centrality, and thus centrality is not a defining characteristic of nodes in this sub-network.
Equivalently, this means that there are many different ways to travel between nodes, which in
the context of our problem, signifies that many components in this network are more universal in nature.
This is
akin to the component $I$ in the BoM example above, which is a sub-component of $C$, $F$, and $G$.
Conversely, in the network on the right, note that there are quite a few nodes having high
betweenness centrality, several of which also have low degree (the larger white nodes). This is due to the
sparseness of this topology. Notice that this sub-network has 3 relatively distinct clusters which are
connected only by these low degree, high centrality nodes. Intuitively, this means that, while the
components in
this sub-network are indeed similar, there is perhaps more dissimilarity in this sub-network than in the
more
dense sub-network on the left, outside of these clusters. While this sub-network is indeed connected, it has
a
much lesser degree of connectedness than the sub-network on the left, in that it is only 1-connected.
In fact, this 'triangular' topology is evident in several of our sub-networks, all characterized by having
3
dense clusters connected by high centrality nodes of low degree. This fact becomes more immediate after
'flattening' these sub-networks by projecting them to two dimensions. This suggests there could be
commonalities
between finished goods in these sub-networks, even when they are, from a BoM perspective, dissimilar. This
notion
of determining a structural equivalence amongst these sub-networks is the conceptual basis of graph
homomorphisms.
We can also use a graphical approach to model the flow of a system dynamically, to infer how these relational structures change over time. Inventory management is an incredibly challenging and complex process in manufacturing, from managing a, perhaps global, supply chain, optimizing how components are shipped and moved, and isolating scrap and yield issues. It would be of tremendous utility to have a representation of how these components are flowing throughout inventory as a result of the manufacturing process, which we can achieve with graph-theoretic approaches in the following way. First, components are stored in inventory in some manner of stock-keeping units (SKUs). The components are then taken from these SKUs to be used in the fabrication of an assembled-component, as a function of work orders, which specify the number assembled components needed. This assembled-component is then placed back in inventory into some SKU or is sent to a customer. In a sense, this can be thought of as a realization of sorts of the BoM for a given assembled-component in that we are now also concerned with specific instances of manufacturing assembled-components and where the associated sub-components were stored and, subsequently, taken from inventory. Studying in such a temporal manner will allow us to better understand how this network grows, morphs, and evolves over time.
Clicking the
button will enable the visualization of how components are moving throughout inventory, as detailed above. Also, the time windowing buttons , , , and , can be used to simulate how this network develops and morphs over time, as new orders come in and new components are purchased. Finally, a common problem in inventory management is isolating and containing potentially defective components upon the realization of some defect or scrap problem. The button shows an example of how we may propagate these problems across this network. Here we compute all the potentially compromised material in inventory, which now needs to be quarantined and tested, based on the fact that the SKUs given by the red nodes were found to have defect components, but were later used in other orders.
As discussed previously, learned or latent representations are paramount to the majority of
modern machine learning methods. We now consider how one may leverage such latent representations in a
graph-theoretic manner by using an example from the field of healthcare. First, a bit of background.
Healthcare is a field in which AI applications can have tremendous value and impact, but healthcare data
possesses the usual complicating factors, such as: missing or messy data; siloed and disparate data sources;
inherent privacy concerns; and a lack of ubiquity and standardization. While there will often be little
overlap in data sources amassed from various sources (such as between different hospitals), data related to
diagnosis and procedure information is somewhat ubiquitous in the form of ICD (International Classification
of Diseases) codes, which is a collection of alphanumeric codes which is maintained by the World Health
Organization and is used within the medical field to represent diagnosis and procedures.
We will explore data from the MIMIC-III Clinical Database
A word2vec model was trained on the ICD codes appearing in the MIMIC-III dataset. This is carried out by
taking the list of all ICD codes from a patient's visit
and treating them as code sequences. In the vernacular of NLP, the sequence of ICD codes for a patients visit,
would serve as our sentences, whilst this individual ICD codes would be our words.
A typical approach is
to embed the words in a high-dimensional (real) space to maximize expressiveness of the learned representations (100-dimensional vectors were used here) and then apply
some dimensionality reduction techniques so as to assess the outputs. Below is a visualization of the latent
space of this model, projected down to 3-dimensions, using a combination of principal component
analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE). Each point is this visual represents one
of the ICD codes appearing in the dataset.
One of the principal challenges of unsupervised learning is inferring the results: there are a myriad of
potential representations for a given dataset, none of which will likely be useful for every problem or
context. However, we can represent this latent space graphically to aid in this task. To do so, we can treat
each point as a node and compute edges between them based on similarity. There are a variety of ways in
which we can do this, such as computing the distribution of inter-node distances with respect to some
distance metric and select a threshold so that we define an edge between two points if their distance is less than this threshold.
Likewise, we could apply spectral clustering to the resultant
similarity matrix and define an edge between nodes if they are in the same spectral cluster. The former was
used here, as one major downside of the spectral clustering approach is that it sacrifices inter-cluster
relations: every node in a given cluster will be connected.
Clicking the button in the above model carries out this procedure on the learned ICD code representations.
Notice that this yields a network with a modest number of edges, which also exhibits a
noticeable structure. The network is rather disconnected in that it has a large number of connected
components and has a good deal of isolated nodes. An advantage of this approach over a pure
clustering method, is that it captures a wider dimension of similarity (via paths.)
Often such representations, while useful in-and-of-themselves, are often developed for use in down-stream AI
applications. This means that any time we were to deal with any data sources consisting of ICD codes, we could
transfer these representations and learnings to those new applications.
For example, having such computational representations for the
diagnosis and procedure data, we can apply this encoding and study the structure of patients in the data set; that is,
to learn patient-level representations. This is done by applying the learned ICD representations to the ICD
code sequence for a patient's visit. Where a patient's visit was previously represented by a sequence of ICD
codes, it is now represented by a sequence of 100-dimensional real vectors. We can then apply more
advanced unsupervised learning methods to learn the desired patient level representations. Here a
stacked, denoising, convolutional autoencoder (SDCAE) was used to learn lower-dimensional representations of
the patients.
Notice that we could jump directly to this level by training a SDCAE instead on the patient's ICD
sequences directly. Applying transfer learning in this way, however, gives us the advantage of leveraging the
learned contextual similarity of the ICD codes. Further, we can add significant robustness to the model by
leveraging the 'wider' similarity criterion established by the connected components, by randomly substituting
ICD codes with other codes in its connected component! Below is an example of this model applied to a random
subset of 6,000 patients from the dataset.
Carrying out the same process as with the ICD codes, we can see that this patient representation network is far more dense than the code representation network. In fact, this network has only 6 connected components. Notice that the largest component (the red edges) while on the one hand rather densely connect, it certainly has many 'holes' in it. We can think of the connected components here as defining various computational phenotypes yielding the distinct subgroups evident in the network.
In closing, the framework of graphical data representations presented here has tremendous utility in settings where traditional artificial intelligence and machine learning methods suffer from diminished performance, such as data having a complex or relation structure. Additionally, it can be used to improve the performance of many traditional methods by giving practitioners a better understanding of the overall structure of his or her data, or in the development of new features in the feature engineering process.
So what exactly is graph theory? First off, a graph in this context is different from how one usually encounters the notion of a graph in data science and machine learning, that is, the graph of a function, which is of course a way to visualize the relation of outputs to inputs under this function. Instead, in the context of graph theory, a graph refers to a mathematical structure which is used to model pair-wise relationships between entities of some kind. A graph is composed of a set of objects, called nodes or vertices, and a set of relationships between these objects, called edges or links. A graph can be specified by an ordered pair $G = (V, E)$, where $V$ and $E$ denote the set of nodes and edges, respectively. Graph theory is the branch of mathematics concerned with studying and understanding the structure of these objects, including various properties or attributes they might have, as well as developing various algorithms related to understanding these structures.
One of the most fundamental properties for a graph $G = (V,E)$, is whether the edges in $E$ are undirected (the yellow, green and purple graph examples) or directed (the blue and red graph examples). This delineation specifies whether the edges in the graph are bi-directional, or equivalently, whether the underlying relationship to which they represent is symmetric or non-symmetric. Many key graph-theoretic properties and algorithms must be defined differently in the context of undirected vs directed graphs, or often are not defined at all in the opposing case. For example, the notion of a graph being connected in the undirected case, must instead be replaced with weakly and/or strongly connected in the directed case. Even fundamental concepts such as the degree of a node requires some alterations (the degree of a node in a directed graph can be delineated into in-degree and out-degree).
Network Theory or Network Science is a sub-field of graph theory in which the graphs studied
represent or model some real-world entity or dynamical system. The key (albeit murky) distinction here is that
network theory is concerned with graphs which are modeling some real-world system and will often be
accompanied
by various meta-data beyond that of the graphical structure alone, such as node groupings (blue example
graph), edge
weights (green example
graph), or concepts not expressed explicitly by the graph at hand. Properties related to the layout complex networks
are of particular interest.
Namely, the network topology, or simply
topology, of a network are characteristics, properties, or attributes related to the layout and
arrangement of nodes and edges, either from a global (the graph as a whole) or local (sub-graphical
structures)
perspective. Likewise, it is often of interest to analyze various properties of the nodes and edges
themselves
since after all, these are the very objects to which we are modeling. Common topological properties of
interest
include shape, connectedness, density, degree distribution, and centrality.
Consider the problem of modeling the dynamics related to some set of objects or entities enumerated by $i \in \{0,1,...,32\}$. Suppose we have the following relational dataset, commonly called an edge list $$\{(26, 30), (26, 31), (26, 21), (26, 29), (26, 20), (26, 22), (26, 27), (21, 20), (21, 29),$$ $$(29, 22), (0, 7), (7, 1), (7, 2), (7, 6), (1, 2), (2, 6), (2, 9), (6, 8), (9, 3), (3, 5), $$ $$(5, 8), (10, 15), (15, 12), (15, 14), (12, 13), (12, 16), (12, 17), (13, 14),$$ $$ (16, 17), (18, 19), (20, 27), (27, 22), (27, 23), (27, 25), (24, 25), (25, 28) \}$$ in which the tuple $(i,j)$ signifies that the ith entity is related to the jth entity. We can represent this system in a graphical framework by taking it to be the graph $G = (V,E)$, where $V = \{0,1,...,32\}$ and $E$ to the be edge list. Taking these edges to be undirected (or equivalently this relationship to be symmetric), induces the following network.
There exists a multitude of methods and approaches in which we can study and analyze the rich representation structures induced by viewing datasets in this way, the most applicable of which will often be determined by the problem at hand. Several such graph-theoretic concepts are illustrated below for the network induced by the edge list given above, as both an undirected and directed graph.
For example, we can study properties related to the network as a whole by summarizing its network topology,
or
by computing its connected components (example A), which illustrates potential sub-populations in our
data. Specifically, this network has 5 connected components, meaning a node from, say, the component of
purple
nodes is
not only not directly adjacent to any non-purple node, but that there is no relation of this node to any
non-purple
node whatsoever; there is no path from a purple node to any non-purple node.
We can develop models on $G$ in a more direct manner, using methods such as graph convolutions, node
embeddings,
or spectral clustering. This can be done by representing $G$ by its adjacency
matrix (example C). For a general graph $G = (V,E)$, the adjacency matrix $A$ is the $|V| \times |V|$
matrix, whose rows and columns
represent the nodes in the graph, and in which
$$ \begin{cases}
a_{i,j} = 1, & i \text{ and } j \text{ are adjacent} \\
a_{i,j} = 0, & \text{otherwise} \\
\end{cases}
$$
In the case of weighted edges, we can take $a_{i,j}$ to be the weight of the incident edge, if they're
adjacent.
Likewise, we can gain a deeper understanding of the various nodes themselves (or rather, the real-world
objects
they represent), and how they compare to one another. Things such as ascertaining how similar a pair of
nodes is
by determining if they are in the same connected component, comparing the size and shape of their
neighborhoods (example D), or perhaps whether or not they are dominant in the larger network
(example H). Another powerful concept in network theory is cliques, which are critical to undirected
probabilistic graphical models or Markov random fields and give a stronger notion of how nodes group across
a
network. Of course, we can do a similar analysis at the edge level as well (example F).
We can view this network as a weighted graph in which we size both the nodes and/or edges either by their
topological properties or using meta-data. In example B, the edges are weighted by whether or not they are a bridge and
the
nodes are sized by their degree. Opening the control pane on this figure allows you to selected other
possible weighting options. Often the most useful scheme is to imbue these attributes based on the available
meta-data we have at hand. For example, if this network represents possible flights (edges) between a set of
cities (nodes), we could perhaps weight the edges by how far a plane would have to travel from city to city,
how
long the flight would take, or how much a flight would cost. Likewise, we can represent various properties
of
the nodes by coloring them according to what state the city is in and could size them based on how much the
average flight arriving at each city costs (namely, the average weight of edges incident to a node, if the
edges
are weighted by flight cost). Note that, depending on the context of what cities and/or flights we're
considering, it may be more appropriate to model these relations as a directed network instead (example G).
All modeling, examples, and visualizations were done in Python and JavaScript, principally using the following frameworks and libraries: