I’m writing this blog post to help our new engineers starting in the summer. It can also be helpful for anyone interested in learning how LynxKite works. Every large application is like a thousand-year-old city: full of clever ideas, some of which became solid pillars while others got covered up with new clever ideas. Put on your archeologist’s hat and let’s dive in!
Let’s get, build, and run LynxKite:
git clone https://github.com/lynxkite/lynxkite.git
cd lynxkite
conda env create --name lk --file conda-env.yml
conda activate lk
cp conf/kiterc_template ~/.kiterc
./run.sh
If everything went well, you can open http://localhost:2200/
in your browser. Take
the Flight Routes tutorial for a spin.
(Or another tutorial.)
You can see that workspaces are built from boxes. Where are these boxes in the code? Let’s look at PageRank.
A box on the UI is called a “frontend operation” in the code. PageRank in particular is found in
GraphComputationOperations.scala
:
register("Compute PageRank")(new ProjectTransformation(_) {
params ++= List(
Param("name", "Attribute name", defaultValue = "page_rank"),
Choice("weights", "Weight attribute", options = FEOption.noWeight +: project.edgeAttrList[Double]),
NonNegInt("iterations", "Number of iterations", default = 5),
Ratio("damping", "Damping factor", defaultValue = "0.85"),
Choice("direction", "Direction", options = Direction.attrOptionsWithDefault("outgoing edges")),
)
def enabled = project.hasEdgeBundle
def apply() = {
assert(params("name").nonEmpty, "Please set an attribute name.")
val op = graph_operations.PageRank(params("damping").toDouble, params("iterations").toInt)
val weightsName = params("weights")
val direction = Direction(
params("direction"),
project.edgeBundle,
reversed = true)
val es = direction.edgeBundle
val weights =
if (weightsName == FEOption.noWeight.id) es.const(1.0)
else direction.pull(project.edgeAttributes(params("weights"))).runtimeSafeCast[Double]
project.newVertexAttribute(
params("name"),
op(op.es, es)(op.weights, weights).result.pagerank,
help)
}
})
It manages the UI of the box, translates the frontend operation into a series of backend operations, and updates the “project” state. (The box output.) Looking at these one by one:
params
is an instance of
ParameterHolder
and makes it easy to define and access parameters. Parameters have a “kind” field which controls
how they are displayed on the UI. Some “kinds” are simple text boxes, others are complex JavaScript
monsters. They all live in harmony in
app/scripts/operation
.
Changing yes/no parameters into checkboxes by adding a new “kind” could be a great starter task.
Backend operations are the stars of the show! They actually implement algorithms. We explicitly
create op
, an instance of graph_operations.PageRank
. The op(op.es, es)(op.weights,
weights).result.pagerank
expression at the end fills in the inputs of the operation. op.es
and
op.weights
are the input “slots” and es
and weights
are the entities that go into them. The
lines of code that apply the direction and generate unit weights also create backend operations
indirectly. No actual computation is performed at this point, but new “metagraph entities” are
created.
Lastly, project.newVertexAttribute()
mutates the “project” by adding the new attribute that we
just computed. The “project” is the graph state that many boxes take as input/output. It is
a bundle of metagraph entities (vertices, edges, attributes) structured to be presentable to the
user.
Because this box is a subclass of
ProjectTransformation
it has a project
variable initialized to its input state and will use it to create its output.
A box can be re-executed many times. Before #220 we executed all of them on every little change. It’s cheap because they don’t touch the data. They work at the “metagraph” level. Plus it’s not like creating many backend operations will cause a pile-up of PageRanks. If the backend operation has already existed then we are basically getting a new reference to it.
Time to look at a backend operation! Let’s head to
PageRank.scala
.
Adding new algorithms to LynxKite almost always means adding new backend operations. The inputs of frontend operations (boxes) are “states”, such as a project (a graph) or a table. The inputs of backend operations on the other hand are metagraph entities with carefully defined relationships. The same goes for the outputs.
Because we had to define inputs so many times, we made it very comfortable and type-safe. On the other hand a lot of Scala wizardry is taking place behind the scenes, which can obfuscate things a bit. I find it a good trade in the end, because we write operation signatures all the time, but basically never have to touch the “magic” code.
For PageRank the signature is:
class Input extends MagicInputSignature {
val (vs, es) = graph
val weights = edgeAttribute[Double](es)
}
class Output(implicit instance: MetaGraphOperationInstance, inputs: Input) extends MagicOutput(instance) {
val pagerank = vertexAttribute[Double](inputs.vs.entity)
}
These es
and weights
identifiers are exactly the ones used in the frontend operation to specify
the inputs (op.es
and op.weights
). We didn’t have to pass in vs
because passing in es
already unambiguously determined its value. In fact we could omit es
too, since weights
has to
be an attribute on es
.
Likewise pagerank
is used to access the output of the operation.
In the PageRank.execute()
method we finally arrive at the implementation of an algorithm. It is
only here that we finally get to touch data. This is a Spark operation, so the data appears as
RDDs.
Looking back at the history in Git shows that the code used to be fairly simple in 2014 but picked up a series of optimizations over time.
First we added our own implementations of some Spark operations that take advantage of how we store
graph data. (Partitioned by the hash of the ID, but then sorted by ID within each partition.) These
are the reduceBySortedKey()
and sortedLeftOuterJoin()
methods.
Later HybridRDDs were added to handle skewed datasets. (Graphs with a small number of
very-high-degree vertices.) See
HybridRDD.scala
for how this works.
But if you look at the for (i <- 0 until iterations)
loop, that is still readable as a PageRank
implementation.
Scrolling down in
GraphComputationOperations.scala
,
the next box under Compute PageRank is Find Steiner tree. It’s pretty much
the same, but it uses graph_operations.Dapcstp
instead of
graph_operations.PageRank
as the backend operation. When we look inside
Dapcstp.scala
,
it has a signature just like PageRank, but no execute
method!
LynxKite has a concept of execution domains. We have a Spark domain for running operations
implemented with Spark and a Sphynx domain for operations that are implemented on our
high-performance single-machine backend written in Go. (In reality we have half a dozen domains,
see
BigGraphEnvironment.scala
for details.)
All domains can advertise that they have the data for a metagraph entity, that they can compute an
operation, and how data can be relocated to them. They have a compute()
method that has
to make the operation happen. The Spark domain’s compute
will call the execute()
method on an
operation, while the Sphynx domain’s compute
sends an RPC to Sphynx.
We find the implementation of Dapcstp
in
dapcstp.go
.
If you look around the sphynx
directory, you will also find some operations implemented in Python
and others used from the NetworKit library through a SWIG wrapper
for its C++ API. Sphynx has proven to be a great way to integrate single-machine algorithms!
By now you know everything needed to start creating your own boxes with new backend operations. To finish this post I will point you to the parts of LynxKite that are propping up the whole operation.
MetaGraph.scala
defines the metagraph entities, the operation base classes, and all the utilities for creating
operation signatures. VertexSet
, EdgeBundle
, Attribute[T]
, Scalar
, and Table
are simple
and very widely used classes defined here.
MetaGraphManager.scala
is making the operations persistent. The MetaGraphManager
is passed around a lot, often as an
implicit parameter. It is needed anywhere where we want to be able to create backend operations.
DataManager.scala
orchestrates the work of the domains. This is the class that calls a domain’s compute()
to get
something computed.
Workspace.scala
is the internal representation of workspaces and boxes and it also has the code for executing a
workspace. (Not in the execute()
sense, just creating the output states.)
Project.scala
is the code for the graph state (“project”) that boxes input and output. Unexpectedly, it also has
the code for the directory structure in which workspaces, wizards, and snapshots are stored.
graph_api/Scripting.scala
has the implicit classes that are responsible for the op(op.es, es).result
form of invoking
operations.
graph_util/Scripting.scala
is less important, but has a few convenience methods that are often used in frontend operations.
For example the reverse
method on EdgeBundle
s comes from here.
Hopefully this is useful for someone diving into LynxKite. Unlike an archeological site, LynxKite cannot be cordoned off. There is a risk that in a few months time there will be new clever ideas plastered all over this. See What’s new in LynxKite? or watch the LynxKite repository on GitHub to keep on top of it: