Introduction

LynxKite from Lynx Analytics is a graph analytics platform. It can ingest vast amounts of data, interpret it as huge graphs (aka networks) and enable its users to turn the immense information hidden as billions of network connections into business value.

It does that by providing fast data discovery via innovative visualization options, featuring a rich set of business relevant graph algorithms and facilitating various ways of propagating information via the network connections.

With a distributed architecture powered by Apache Spark, it can scale up to any size of data.

But don’t just believe us — try it! We hope this user guide will be a good companion in your journey of network data mining and you will strike gold for your enterprise with LynxKite!

Hotkeys

For faster navigation you can access certain LynxKite features via hotkeys. The keys available depend on where you are in the program. You can always see the list of currently available hotkeys by pressing the ? key.

Workspace browser

The workspace browser is the interface that welcomes you when you navigate to LynxKite in a browser. Like a file browser, it makes it possible to navigate a folder structure and delete or move items. It also allows creating new folders and workspaces — commonly referred to as entries.

To make navigation easier the workspace browser remembers the last folder that was open.

Folders

Folders make it possible to keep the workspaces and other items in LynxKite organized. A common way to group the items is by user: so the workspaces and snaphots of one user would be in a separate folder from the workspaces and snapshots of another. This organization is encouraged by assigning a private folder to each user inside the Users folder.

Folders can have access control settings. A list of users who can read or write the folder contents can be specified by opening the settings panel (). See the section on User authentication & access control for more details.

Administrator users have access to everything and can fine-tune the access control settings to set up any desired system of permissions. This is recommended as part of the LynxKite installation procedure.

Click New folder to create a new folder inside the current folder.

Access the dropdown menu for a folder in the workspace browser () to discard, duplicate, or rename the folder. The rename command also makes it possible to move the folder to a different path.

Workspaces

Workspaces allow users to describe complex computation flows visually. For a detailed description see the Workspace user interface section.

Click New workspace to create a new, empty workspace inside the current folder. The workspace immediately opens when created and you can start importing data into it.

Access the dropdown menu for a workspace in the workspace browser () to discard, duplicate, or rename the workspace. The rename command also makes it possible to move the workspace to a different path.

Discarding a workspace moves it to the Trash folder in your home folder. This provides means to undo a deletion: just navigate to Trash and move the workspace back to its original location. Discarding a workspace that is already inside Trash deletes it irretrievably. Delete Trash to discard everything inside permanently.

Wizards

Wizards are dedicated tools that distill complex analysis workflows into a series of simple steps. See Authoring wizards to learn how they are created.

Wizards appear in the workspace browser with the icon.

If you click a wizard, a copy will be created in your user directory. This copy is marked as in-progress and its icon changes to . When you click an in-progress wizard, it opens normally and you can continue where you left off.

If you want to edit the workspace behind the wizard, open the dropdown menu in the workspace browser () and choose the Open workspace option.

You can also access the workspace of an in-progress wizard by opening the wizard and clicking the View workspace / Fine tune in workspace button

Wizard screenshot

After opening a wizard, you can fill out the parameters for each step. Click on a heading to move to that step. You can move back or forward as much as you like. Your changes are captured in your "In progress wizards" directory.

Steps with visualizations or large parameter lists benefit from a full-screen view. Click the icon on the current step to switch to maximized view. Click the icon to return to the sequential view.

Snapshots

Snapshots are saved box output states from workspaces. Once a snapshot is saved (see Saving snapshots) it is detached from all workspaces. A snapshot can be of any type that a box output can, such as a project or a table.

Snapshots can be loaded back into a workspace with an Import snapshot box.

Snapshot content can be viewed inside the workspace browser. Click on the snapshot entry to open/close the snapshot viewer.

SQL for snapshots

There is a SQL interface on the workspace browser page that can be expanded by clicking on the plus button. It can be used to make queries to all available snapshots in the current folder, those in subfolders included. To refer to the table you want to access, you first need to provide the path from your current folder to the snapshot, then in case of project snapshots use . to specify the table you want to access. The table reference must be enclosed between two ` characters (see example below).

For example, let’s say you are in your private folder where you have a subfolder called Premier_League, in which you have a project snapshot named Arsenal. If you want to access the vertices table of the Arsenal project snapshot from your private folder, you need to refer to it by `Premier_League/Arsenal.vertices`. In case you are already in the Premier_League folder, the reference shortens to `Arsenal.vertices`

The SQL interface on the workspace browser page can also be used to reference table snapshots. For example, let’s say you have a table snapshot called Players which has the data of all football players playing in the Premier League. Then you can reference it the same way as the tables in project snapshots: e.g. you can list all Arsenal players with select * from `Players` where team = "Arsenal". Notice that you still need to enclose the name of the snapshot between two ` symbols.

For details about querying project snapshots, see the documentation for the SQL1 box.

The table browser

The table browser helps to find available table and column names for the global SQL box or for SQL boxes in the workspace. The following hints help with usage:

  • Drag table and column names into the editor box with your mouse.

  • Double click on names works too with the global SQL editor.

  • Click on the icon to expand a directory, a snapshot or a table.

Exporting results

The first few rows of query results can be inspected in the browser. The full results can be exported into files. LynxKite provides a range of export formats. For details about the available formats, see the documentation of the Export to CSV, Export to JDBC, Export to JSON, Export to ORC, and Export to Parquet operations.

Built-ins

The built-ins directory is created by default for every LynxKite instance. It contains helpful built-in workspaces which can be used as custom boxes. Built-ins are loaded automatically every time LynxKite restarts and should not be modified directly.

Workspace user interface

A workspace can be opened from the Workspace browser. This section describes the user interface of a workspace.

Workspace header bar

The workspace title bar contains the name of the workspace, its full path (the folders they are in) and buttons to various program functions. It looks something like this:

If the workspace is in the Root folder, it will only show the name of the workspace, as seen above. When you dive into a custom box, the workspace title changes and shows the custom box’s name and path.

Workspace header buttons

Not all the buttons listed here are accessible at all times, please see the details below on when each function is available.

Save selection as custom box

Creates a custom box of the selected boxes. Only available if at least one box is selected. The custom box will be saved under the specified full path. A full path in the LynxKite directory system has the following form: top_folder/subfolder_1/subfolder_2/…​/subfolder_n/name
Keep in mind that there is no leading slash at the beginning of the path. The list of custom boxes, shown on the UI, is limited to special directories built-ins, custom_boxes, a/custom_boxes, a/b/custom_boxes,…​ when we edit the workspace a/b/…​/workspace_name.

Save as Python code

Generates Python API code for the selected boxes. If nothing is selected, the whole workspace is used.

Delete selected boxes

Removes the selected boxes. Only available if at least one box is selected.

Dive out of custom box

Closes the custom box workspace and returns to the main workspace. Only available if a custom box workspace is opened.

Dive into custom box

Opens the selected custom box as a workspace. Only available if a custom box is selected.

Select boxes on drag

If this mode is enabled, boxes can be selected by dragging a selection rectangle. You can still pan (move the viewport) by clicking and dragging while holding Shift, or select boxes individually (and add boxes to the selection by holding Ctrl).

Pan workspace on drag

If this mode is enabled, clicking and dragging will move the viewport. Boxes can be selected two ways: individually, when additional boxes can be added to the selection by holding Ctrl or by dragging a selection rectangle while holding Shift.

Undo

Undoes the last change performed on the workspace.

Redo

Redoes the last undone change. Only available if you haven’t performed any new changes since the last undo.

Save workspace as

Makes a copy of the current workspace with a new name. You will have write permissions to the new copy even if you did not have for the original.

Close workspace

Closes the workspace.

Boxes and arrows

Workspaces allow users to describe complex computation flows visually by creating workflows represented by boxes and arrows. Boxes represent operations and they are connected by arrows. The sequence of operations applied to the data is shown on a path determined by the arrows.

After creating a new workspace, the viewport is empty, except for the Anchor located in the left corner. The anchor can be used to explain the overall purpose of the workspace. You can add a description, an image and set parameters (more details: Parametric parameters). The URL to an image is useful when you want to reuse the workflow as a custom box in another workspace: in that case the image will serve as the custom box’s icon. Preferably this should be a link to a local image, like images/icons/anchor.png.

You can add a box to the workspace by dragging an operation from The operation toolbox. Clicking on the box opens its Box parameters popup, which allows you to set the parameters.

A box can have: inputs (on its left) and outputs (on its right). A box will indicate the number of boxes that can be connected to it and the type of the required input or output (for example: project, table).

You can add arrows to the viewport by connecting the boxes. Boxes can be connected two ways:

  • Automatically, by hovering the input of one box over the output of another.

  • Manually, by clicking on the output of one box, then dragging the arrow to the input of another.

When two boxes are connected, the computation of the selected operation starts. The color of the output will indicate the status:

  • Red: error, something’s wrong

  • Blue: not yet computed

  • Yellow: currently computing

  • Green: computed

Clicking on the output of a box will open State popups.

Tips & tricks

Instead of clicking on the search bar, you can use the / button. After finding the coveted box, you can press Enter to place the box under your mouse. You can place multiple boxes without leaving the search bar.

Boxes and connected box sequences can be copy-pasted, even to different workspaces and LynxKite instances. A limitation here is that the custom boxes are not copied, so they have to be present on the target instance too.

The copy-paste mechanism is implemented via serializing to YAML, a human-readable and editable textual format, so you can even save box sequences to text files or share them via email. Such a YAML-file (if it has a .yaml extension) can also simply be drag-and-dropped into a LynxKite workspace.

Hold SHIFT while moving a box to align it to a grid.

Box parameters popup

Clicking on a box opens its box parameters popup. This popup allows you to set the parameters of the box. A faint trail connects the popup to the box it controls. Click the box again, or click on the in the top right corner to close the popup.

Click More about "…​" to expand the help page for the box. It can be useful to review the help page when using a box for the first time.

The short description for each parameter can also be accessed by clicking or hovering over the icons by each parameter.

Applying boxes to segmentations

What if you wanted to compute PageRank for the communities in the graph?

If you want to apply a box to a segmentation, first add the box as normal. Then in the box parameters popup adjust the special Apply to parameter to pick the segmentation. This special parameter is added for all project-typed inputs, making it possible to work with segmentations (and the segmentations of those segmentations, etc.) inside projects.

Parametric parameters

Parametric parameters can reference workspace parameters.

For example, consider a workspace with two Import CSV boxes, one importing accounts-2017.csv and the other importing transactions-2017.csv. You could add a workspace parameter called date with default value 2017. Make the file name parameter of the import boxes parametric by clicking the icon to the right of the parameter input. Change the file name parameters to accounts-$date.csv and transactions-$date.csv. Now 2017 will be substituted for $date, importing the same files as before.

One benefit of this is that you can change the date in a single place (on the anchor box) instead of having to update multiple boxes when the time comes.

Another benefit is that if your workspace is used as a custom box in another workspace, the workspace parameters are specified by the user. Parametric parameters allow you to pass these user-specified parameters on to boxes in the workspace.

Even complex parameters, like a list of vertex attributes, can be toggled to become parametric. In this case the original input field is replaced by a simple text field.

Parametric parameters are evaluated using Scala string interpolation. This means that Scala expressions can be embedded in these parameters. For example, you could write accounts-${date.toInt + 1}.csv.

Unexpected parameters

Unexpected parameters are parameters that have been set at some point on the box, but are no longer recognized.

The list of parameters for many boxes is determined dynamically. For example in Aggregate on neighbors there is one parameter for each vertex attribute. If you have configured an aggregation for attribute X but then changed the input to no longer have an attribute called X, then the parameter that sets aggregation on X becomes an unexpected parameter.

Unexpected parameters are treated as errors. You can click the icon to the right to remove the unexpected parameter. Or you can change the input so that the parameter becomes recognized again.

Box metadata

Click the icon in the popup header to access the box metadata. Click the icon to return to the parameter editor.

Box ID

The internal identifier of this box within the workspace. This is only visible when storing the box in a text format.

Operation

The operation that this box represents. You can edit this to change the type of the box. For example you could turn an Import CSV box into an Import Parquet box.

State popups

Click on an output of a box to open that output state in a popup. Click the output again, or click on the icon in the top right corner to close the popup. You can also press ESC to close the last used popup.

Different output types have different data and features available in their popups. But some things they all have in common.

Saving snapshots

The toolbar at the top of the state popup always contains a icon, for saving the state as a snapshot. The snapshot will be saved outside of the workspace, in the directory tree. Snapshots are independent of the workspaces from which they were saved. Use them to share final results, or record intermediate results for comparison.

To save a snapshot you have to specify the full path of the snapshot. A full path in the LynxKite directory system has the following form: top_folder/subfolder_1/subfolder_2/…​/subfolder_n/name
Keep in mind that there is no leading slash at the beginning of the path.

Snapshots can be loaded back into a workspace with an Import snapshot box.

Instruments

Boxes like Graph visualization, SQL1, Custom plot are essential for looking at your data. It is very natural to want to take a quick look at the data in the middle of a complex workspace.

One option is to quickly create and attach a Graph visualization box, see what the graph looks like at that point, and then delete the box. Instruments are effectively the same, except that no temporary box is added to the workspace. This means instruments can be used even on read-only workspaces.

The instrument buttons are in the popup toolbar. For example, in the last screenshot the buttons for SQL and Visualize are visible, corresponding to the SQL1 and Graph visualization boxes. If you click on SQL, the popup contents are replaced by the box parameters of the SQL1 box at the top and the output state of the SQL1 box at the bottom.

The output state of the instrument once again has a toolbar for snapshotting and applying instruments. This makes it possible to apply one instrument after the other:

Instruments are not saved into the workspace. But they are built from regular boxes, so the same calculations can always be reproduced using conventional boxes.

Project state

A "project" is a rich type that represents graphs and their segmentations in one bundle. The popup for a project output shows basic information about the graph, such as the number of vertices and edges. It lists the scalars, attributes, and segmentations. Scalar values are displayed, attribute histograms are available on click, and segmentations can be opened to dig deeper.

The Projects chapter gives a more in-depth description of projects.

Table state

Tables are the same in LynxKite as in relational databases and spreadsheet programs: they are a matrix of columns and rows. Tables are the input and output of SQL queries. Projects can be built from tables via Use table as vertices, Use table as edges, and similar operations.

Plot state

The plot state is a data visualization created via the Custom plot box, or one of the built-in plotting boxes.

Export state

Export boxes, such as Export to CSV, allow you to configure an export operation. The output of these boxes is an export state. It is the export state which actually allows triggering the often resource-intensive computation of creating the output files.

This two-step process avoids accidental exports while editing the workspace. It also provides metadata information about the output, for example a file path. To trigger the export, click on the icon.

Custom boxes

It is easy to extend LynxKite with custom boxes that are specific to a project or organization. Wrapping logical parts of your workspaces in custom boxes makes the workspace easier to understand and avoids repetition.

A custom box is simply another workspace. If you place a workspace in the X/Y/custom_boxes directory, you will be able to use it as a custom box in any workspaces recursively under X/Y. If you place a workspace in the top-level custom_boxes directory, any workspace in this LynxKite instance will be able to use it. This system of scoping makes it possible to organize project-specific or universally useful custom boxes.

If you place a workspace in custom_boxes, it will appear in the box catalog under the "Custom boxes" category, and in the box search. You can place it in a workspace.

A usual workspace used this way will result in a custom box that has no inputs and outputs. That is not very useful! To fix that, just add Input and Output boxes to the workspace of the custom box.

It is inconvenient to work with Input boxes, because their output is missing. It will be filled in when the custom box is used in another workspace. But when you’re editing the workspace of the custom box directly, there is nothing coming in yet. There are two solutions to this:

  • Place your custom box in a workspace. Connect its inputs. Select it and dive into the custom box with the button. Now you will see and edit the workspace of the custom box in the context of the parent workspace. The input box will have a valid output: the state that is coming in from the parent workspace.

    Any changes you make will affect all instances of the custom box.

  • It is often the case that your workspace grows and you reach a point where you want to extract part of it into a custom box. Do not create a workspace in custom_boxes manually in this case. It is simpler to select the part of the workspace that you want to wrap into a custom box and click the Save selection as custom box button instead.

    The workspaces of custom boxes created this way will automatically have the input and output boxes set up.

Custom box parameters

Your custom box now has inputs and outputs and can provide useful functionality. Custom boxes can also take parameters. This is configured through the Anchor box of the workspace of the custom box.

You can set the name, type, and default value of the parameters. The following parameter types are supported:

  • Text: Anything that the user can type. It could be a string or a number. This will appear as a plain input box in the custom box’s parameters popup.

  • Boolean: Will appear as a true/false dropdown selection in the box parameters popup.

  • Code: Will appear as a multi-line code editor to the user.

  • Vertex attribute, edge attribute, segmentation, scalar, column: These types allow the user to select an attribute, segmentation, scalar, or column of the input via a dropdown list. If the custom box has multiple inputs, the options belonging to all the inputs will be offered in the list.

To make use of the custom box’s parameters in the workspace of the custom box, you need to access them from Parametric parameters. Regardless of their type, all the parameters are seen as Strings from the Scala code of the parametric parameters. Use .toInt, .toDouble, .toBoolean on them if you need to do more than simple string substitution.

Authoring wizards

You can build complex analysis workflows in LynxKite workspaces. You can encapsulate such workflows in Custom boxes so that other LynxKite users can reuse them. Another way to share your work is in the form of wizards.

To turn a workspace into a wizard, open the parameters of the Anchor box and set the Wizard parameter to yes. Now your workspace is a wizard. But it doesn’t have any steps yet.

Each step in a wizard corresponds to a parameter or state popup from the workspace. There are two ways to add steps to the wizard. The anchor box has a table of steps:

Screenshot of wizard steps

In this table you can specify:

  • The title of the step. This appears on the wizard view in a large font.

  • The description of the step. This is a multi-line field where you can add more text to the step using Markdown syntax. This makes it possible to use formatted text with images and links.

  • The box from which you want to use the parameter or output state.

  • The popup column lets you choose "parameters" (to use the parameter popup) or one of the output states of the box.

  • The order of the steps using the buttons on the right. Press or to move the step up or down, or to delete the step.

You can also quickly add steps to a wizard from a parameter or state popup. Once the workspace is configured as a wizard, each popup will have a icon in the header bar. Click this icon to add or remove the popup as a step.

Using custom boxes as steps in a wizard makes it possible to create interfaces specially crafted for a specific use case.

Using wizards

Once a workspace has been configured as a wizard, clicking it in the workspace browser takes you to the wizard view.

Wizard view screenshot

If the In progress setting is disabled in the Anchor box, opening the wizard creates a copy of it. This way multiple users can work off of the same wizard without interfering with each other. The copies will be created with the In progress setting enabled. Opening these copies then will not create further copies.

See our section on Wizards in the workspace browser for more about how wizards look from outside of the workspace.

Scala guide

You can derive attributes in LynxKite by implementing the derivation formulas using Scala. For a general introduction to the Scala language, see the Tour of Scala.

Getting started

The simplest way of using Scala to derive attributes is to just provide a one-liner expression in Derive vertex attribute or Derive edge attribute. The examples below are for deriving vertex attributes. The only difference from deriving edge attributes is the way vertex attributes can be accessed.

A simple example:

6.0 * 7.0

will generate a constant Double attribute of value 42.0. You can also use values of other attributes in the expression:

6.0 * age

assuming that there is already an age attribute defined. LynxKite can also accept a list of Scala expressions:

val x = age + 1.0
val y = num_friends + 2.0
y / x

In this case, the value of the last expression will be taken as the value of the derived attribute. More complex code can be structured using functions:

def getAge() {
  age + 1.0
}
def getNumFriends() {
  num_friends + 2.0
}
getNumFriends() / getAge()

Allowed types

LynxKite uses Scala data types internally, so there is no need for type conversion between LynxKite and the derivations script. However, to support persistence, the available types for both input (the type of vertex and edge attributes the script can use) and result are restricted to the following.

  • Double

  • String

  • Int

  • Long

  • Vector[X] where X is a supported type

  • (X, Y) where X and Y are supported types

Values of other types need to be manually converted before returning from the Scala script. For input types, you can use, for example, either of Convert vertex attribute to String or Convert vertex attribute to Double.

Apache Spark status

LynxKite uses Apache Spark as its distributed computation backend. The status of the backend is reflected by the elements in the bottom right corner of the page.

A single LynxKite operation is often performed as a sequence of multiple Spark stages. A single Spark stage is further subdivided into Spark tasks. Tasks are the smallest unit of work. Each task is assigned to one of the machines in the cluster.

The rotating cogwheel in the bottom right indicates that Spark is calculating something.

The Stop calculation button appears when you hover over the cogwheel. It sends an interruption signal to Spark. This signal aborts work on all Spark stages. The tasks that are in progress will still be finished, but the outstanding tasks and stages will be cancelled. The button cancels all Spark stages, not just the ones initiated by the user pressing the button. For this reason the button is restricted to admin users.

The little colorful rectangles represent Spark stages. The height of the rectangle indicates the percentage of tasks completed in the stage. The color corresponds to the type of work it does.

Projects

Projects are a rich box output type that represent graphs and their segmentations in one bundle. The state popup for a project output shows basic information about the graph, such as the number of vertices and edges. It lists the scalars, attributes, and segmentations. Scalar values are displayed, attribute histograms are available on click, and segmentations can be opened to dig deeper.

Scalars

Scalars are data that correspond to the whole graph.

For example, you can compute the average of any numeric vertex attribute with Aggregate vertex attribute globally. This average will show up as a scalar in the output project.

Machine learning models

Machine learning models are one type of scalar. They are created by a machine learning operation (for example Train linear regression model) and used for prediction with the Predict with model operation or for classification with the Classify with model operation.

Press the plus button () to access detailed information about a machine learning model.

Method

The machine learning algorithm used to create this model.

Label

The name of the attribute that this model is trained to predict. (The dependent variable.)

This will not appear for unsupervised machine learning models.

Scaling

Details about the pre-processing scaling step applied to the features before training. The two phases are centering and scaling. The first phase (centering) centers the data with mean before scaling, i.e., the mean is subtracted from all elements. The data set acquired this way has a mean of 0. The second phase (scaling) is acquired by dividing all the elements by the standard deviation. The means and deviations in these steps are computed columnwise.

Suppose we have an original data item (a, b). After these two steps, the data item that is used for the training will be ((a-m1)/d1, (b-m2)/d2), where m1 and d1 are the mean and the standard deviation for the first column (the a’s) and m2 and d2 are the mean and the standard deviation for the second column (the b’s).

Note that both steps are optional: it depends on the model, whether they are applied or not.

Features

The list of the feature attributes that this model uses for predictions. (The independent variables.)

Details

For decision tree classification model:

  • The i-th element of support is the number of occurrences of the i-th class in the training data divided by the size of the training data.

For linear regression and logistic regression models:

  • intercept is the constant parameter in the regression equation of the model.

  • coefficients are the coefficients in the regression equation of the model.

For linear regression model:

  • R-squared is the coefficient of determination, an index of the linear correlation between the features and the label.

  • MAPE is the mean absolute percentage error, a measure of prediction accuracy.

  • T-values can be used for the hypothesis test of coefficient significances. This will not appear for the lasso model.

For logistic regression model:

  • Z-values can be used for the hypothesis test of coefficient significances.

  • psuedo R-squared, or McFadden’s R-squared in our case, is an index of the logistic correlation between the features and the label.

  • threshold is the probability threshold for binary classification. If the outcome probability of the label 1.0 is greater than the threshold, the model will predict the classification label as 1.0. The threshold is obtained by maximizing the F-score.

  • F-score is a measure of test accuracy for binary classifications.

For KMeans clustering model:

  • cluster centers are the vectors of the KMeans cluster centers.

  • cost is the k-means cost (sum of squared distances of points to their nearest center) for this model on the training data.

Vertex and edge attributes

Vertex attributes are values that are defined on some or all individual vertices of the graph. Edge attributes are values that are defined on some or all individual edges of the graph.

Each attribute has a type. For each vertex/edge the attribute is either undefined or the value of the attribute is a value from the attribute’s type.

Clicking on a vertex or edge attribute opens a menu with the for following information/controls.

  • The type of the attribute (e.g. String, Double, …​).

  • A short description of how the attribute was created, if available, with link to a relevant help page.

  • A histogram of the attribute, if the attribute is already computed. A menu item to compute the histogram otherwise. By default, for performance reasons, histograms are only computed on a sample of all the available data. Click the "precise" checkbox to request a computation using all the data. Click the "logarithmic" checkbox, to use a logarithmic X-axis with logarithmic buckets. (Useful when the distribution is strongly skewed.)

  • If you are viewing the project in a Graph visualization box: Controls for adding the attribute

  • to the current visualization, if Concrete vertices view or Bucketed view is enabled. See details in Concrete visualization options.

There are lots of ways you can create attributes:

Undefined values

Sometimes a vertex (or an edge) does not have any value for a particular attribute. For example, in a Facebook graph, the user’s hometown might or might not be given. In such a case, we say that this attribute is undefined for that particular vertex (or edge). Usually, an undefined value represents the fact that the information is unknown. Indeed, some algorithms (e.g., Predict attribute by viral modeling) work on undefined attribute values, and their job is to fill them in with reasonable estimates.

Note that an empty string and an undefined value are two different concepts. Suppose, for example, that a person’s name is represented by three attributes: FirstName, MiddleName, and LastName. In this case, MiddleName could be the empty string (meaning that the person in question has no middle name), or it could be undefined (meaning that their middle name is not known). Thus, the empty string is treated as an ordinary String attribute.

Differences between undefined and defined values:

  • In histograms, undefined values are not counted, whereas defined values (including the empty string) are counted.

  • Filters work only on defined attributes. (See Filter by attributes.)

  • Derive edge attribute and Derive vertex attribute allow you to choose whether to evaluate the expression if some of the inputs are undefined.

Fill vertex attributes with constant default values can be used to replace undefined values with a constant. By replacing them with a special value, they can be made part of histograms or filters.

CSV export/import and undefined

When exporting attributes, LynxKite differentiates between undefined attributes and empty strings. For example, if attribute attr is undefined for Adam and Eve, but is defined to be the empty string for Bob and Joe, here’s what the output looks like. Note that the empty string is denoted by "", whereas the undefined value is completely empty (i.e., there is nothing between the commas):

"name","attr","age"
"Adam",,20.3
"Eve",,18.2
"Bob","",50.3
"Joe","",2.0

Note, however, that importing this data from a CSV file will treat undefined values as empty strings. So, in this case, the distinction between undefined values and empty strings is lost. One way to overcome this difficulty is to replace empty strings with another, unique string (e.g., "@") before exporting to CSV files. (Other export and import formats do not suffer from this limitation.)

Creating undefined values

It might be necessary to create attributes that are undefined for certain vertices/edges. (An example use case is when you want to create input for a fingerprinting or a viral modelling operation.) This can be done with Derive vertex attribute (or Derive edge attribute) operation. For example, the Scala expression

if (attr > 0) Some(attr) else None

will return attr whenever its value is positive, and undefined otherwise.

Segmentations

Segmentations are connected sub-projects. The segmentation of a project is a graph, just like the graph in the base project. The vertices of the segmentation are also called "segments". A set of edges exists between the base project and its segmentation, representing membership in a segment. (To distinguish these special edges we also call them "links".)

For example the Find maximal cliques operation creates a new segmentation, in which each segment represents a clique in the base project. Vertices of the base project are linked to the segments which represent cliques that they belong to.

Segmentations serve as the foundation of many advanced operations. For example the average age for each clique can be calculated using the Aggregate to segmentation operation and the average size of the cliques that a person belongs to can be calculated with Aggregate from segmentation.

Segmentations can be opened on the right hand side by clicking them and choosing "Open" in the menu. They can be visualized the usual way. The links are displayed when both the base project and its segmentation are visualized. This works when both sides are visualized as bucketed graphs, when they are visualized as concrete vertices, or even when one side is bucketed and the other is concrete. This can be used to gain unique insights about the structure of relationships in the graph.

Segmentations act much like projects, and you can even import existing projects to act as segmentations. (In this case it is possible that the links will represent a relationship other than membership.) Segmentations, however, do not have their own operation history. Their history is part of the base project’s history. This also affects the undo button.

Graph visualizations

You can create graph visualizations by adding the operation Graph visualization to your workflows or by clicking on the "Visualize" button in the State popups.

There are multiple types of graph visualizations, but in every case you see some objects connected by some arcs. You can choose to open the Concrete vertices view or the Bucketed view.

Visualized objects can represent vertices or groups of vertices of the graph. The same way arcs on the screen might represent multiple edges in the graph. E.g. if there are multiple parallel edges A → B it will still be represented by a single visualized arc. Also, when we display groups of vertices then a single arc going from one group to another represents all the edges in the graph going from one group to the other.

You can visualize graph attributes in various ways, see details in section Concrete visualization options.

Regardless of the visualization mode you can do the same basic adjustments on the visualization screen:

Zooming in/out

Use your mouse wheel or scroll gesture to zoom in and out. Left double-click and right double-click can also be used for this.

Panning

Hold down your left mouse button anywhere on the visualizaton screen and drag the graph around.

Zooming objects in/out

Hold down the Shift button while zooming in and out to only change the size of objects (vertices, edges).

Concrete vertices view

Shows some selected center vertices and their neighborhood with all the edges among these vertices. The set of the center vertices and the size of the neighborhood can be selected by the user.

The first line shows the "Visualization settings":

Display

The first button lets you select between 2D and 3D visualization. 3D allows for showing more vertices efficiently but that mode has less features. You cannot (yet) visualize attributes in 3D mode and cannot select and move around vertices.

Layout animation

(Only in 2D mode) If the second button is enabled, layout animation will continuously do a physical simulation on the displayed graph as if edges were springs. You can move vertices around and the graph will reorganize itself.

Label attraction

When animation is enabled, this will make vertices with the same label attract each other, which results in same label vertices being grouped together.

Layout style

When animation is enabled, this option determines the exact physics of the simulation. The different options can be useful depending on the structure of the network that is visualized.

The available options are:

Expanded

Try to expand the graph as much as possible.

Centralized

High-degree nodes in the center, low-degree nodes on the periphery.

Decentralized

Low-degree nodes in the center, high-degree nodes on the periphery.

Neutral

Degree is not factored into the layout.

Centers

Lists "center" vertex IDs, that is the vertices whose neighborhood we are displaying. You can change this list manually, using the Pick button.

Radius

You can set the neighborhood radius from 0 to 10. 0 means center vertices only. 1 means center vertices and their immediate neighbors. 2 also contains neighbors of neighbors. And so on.

Pick button

This button is used to select a new set of centers. The vertices placed there will be ones that satisfy all the currently set restrictions (see below). The available options are:

Center count

The number of centers to be picked. (Default: 1)

Restrictions narrow down the potential set of candidates that will be chosen when you click on the Pick button. They have the same syntax as filters. (See Filter by attributes.) There are two ways to specify them:

Use project attribute filters

(Default.) Use the currently set vertex attribute filters as restrictions.

Use custom restrictions

Manually enter restrictions. When switching to this mode, the project filters are automatically copied into the custom restriction list, which can be edited then.

After picking one set of centers with the Pick button the button is replaced by the Next button. Clicking this button will iterate over samples that match the conditions. The samples will show up in a deterministic order. You can skip to an arbitrary sample by clicking on the button. There you can manually enter a position in the sequence and pick it by clicking on Pick by offset.

Concrete visualization options

Vertex visualizations
Label

Shows the value of the attribute as a label on the displayed vertices.

Color

Colors vertices based on this attribute. A different color will be selected for each value of the attribute. If the attribute is numeric, the selected color will be a continuous function of the attribute value. This is available for String and Double attributes.

Opacity

Changes the opacity of vertices based on this attribute. The higher the value of the attribute the more opaque the vertex will get.

Icon

Displays each vertex by an icon based on the value of this attribute. The available icons are "circle", "square", "hexagon", "female", "male", "person", "phone", "home", "triangle", "pentagon", "star", "sim", "radio". If the value of the attribute is one of the above strings, then the corresponding icon will be selected. For other values we select arbitrary icons. When we run out of icons, we fall back to circle. This is only available for String attributes.

Image

Interprets the value of the attribute as an image URL and displays the referenced image in place of the vertex. This can be used e.g. to show facebook profile pictures.

Size

The size of vertices will be set based on this attribute. Only available for numeric attributes.

Position

Available on attributes of type (Double, Double). The attribute will be interpreted as (X, Y) coordinates on the plane and vertices will be laid out on the screen based on these coordinates. (You can create a (Double, Double) from two Double attributes using the Convert vertex attributes to position operation.)

Geo coordinates

Available on attributes of type (Double, Double). The attribute will be interpreted as a latitude-longitude coordinate and the vertices will be put on a world map based on this coordinate. (You can create a (Double, Double) attribute from two Double attributes using the Convert vertex attributes to position operation.)

Slider

Available for Double attributes. Adds an interactive slider to the visualization. As you move the slider from the minimum to the maximum value of the attribute, the vertices change their color. Vertices below the selected value get the first color, vertices above the selected value get the second color.

You can choose the color scheme to use. If you choose a color scheme where vertices can become transparent, the edges of the transparent vertices will also disappear. This is a great option for visualizing the evolution of a graph over time.

Edge visualizations
Edge label

Will show the value of the attribute as a label on each edge.

Edge color

Will color edges based on this attribute. A different color will be selected for each value of the attribute. If the attribute is numeric, the selected color will be a continuous function of the attribute value. Coloring is available for String and Double attributes.

Width

The width of edge will be set based on this attribute. Only available for numeric attributes.

Color maps

When an attribute is visualized as Vertex color, Label color, or Edge color, you can also choose a color map in the same menu. LynxKite offers a wide choice of sequential and divergent color maps. Divergent color maps will have their neutral color assigned to zero values, while sequential color maps simply span from the minimal value to the maximal.

Lightness is an important property of color maps. A good color map is as linear as possible in lightness charts. For more discussion see Matplotlib’s Choosing Colormaps article.

Lightness charts for the available color maps:

Sequential colormaps
Divergent colormaps

Bucketed view

Shows a consolidated view of all the vertices of the graph. Vertices can be grouped by up to two attributes and the system visualizes the sizes of the groups and the amount of edges going among the groups.

To add a vertex attribute to the visualization, click the attribute and pick "Visualize as" X or Y.

For String attributes, the created buckets will correspond to the possible values of the attribute. If the attribute has more possible values than the number of buckets selected by the user then the program will show buckets for the most frequent values and creates an extra Other bucket for the rest.

For Double attributes buckets will correspond to intervals. We split the interval [min, max] (where min and max are the minimum and maximum values of the attribute respectively) into subintervals of the same length. E.g. we might end up with buckets [0, 10), [10, 20), [20, 30].

If logarithmic mode is selected for the attribute then the subintervals are selected so that they have the same length on the logarithmic scale. E.g. a possible bucketing is [1, 2), [2, 4), [4, 8]. In logarithmic mode, if the attribute has any non-positive values, then an extra bucket will be created which will contain all non-positive values.

Edge attributes can also be added to the visualization to be used for calculating the width of the aggregate edges.

By default the visualization has 4×4 buckets, but this can be adjusted in the visualization settings list.

Relative edge density

Bucketed view by default comes in absolute edge density mode. Absolute edge density means the thickness of an edge going from bucket A to bucket B corresponds to the number of edges going from a vertex in bucket A to a vertex in bucket B (or in the weighted case: to the sum of the weights on such edges). This makes the edges going between large buckets typically much thicker than those going between smaller buckets.

Relative edge density, on the other hand, is calculated by dividing the number of edges between bucket A and bucket B by [size of bucket A] × [size of bucket B]. This way, the individual bucket sizes aren’t reflected on the thickness of the edges.

Precise and approximate counts

For very large graphs the bucketed view numbers are extrapolated from a sample. Precise calculation would not produce a visible change in the visualization, so most often it is not necessary. It can be desirable however if the numbers from the visualization are to be used in a report.

Click the "approximate counts" option to switch it to "precise counts".

Color customization

A color customization panel is accessible in visualizations. Click on the white tab on the left to access the panel.

The panel allows you to copy the visualized data to the clipboard () and customize the color settings. You can invert the colors, increase or decrease brightness (), contrast (), and saturation (). For geographic visualizations the same settings can be applied separately to the map background.

Ray tracing

LynxKite has an optional feature for generating ray traced graph visualizations. These visualizations can give simple graphs a more striking look in presentations and marketing materials.

To enable ray tracing the administrator has to install POV-Ray and the graphray Python package found in the tools directory of the LynxKite installation.

Open a graph visualization and click to get a relatively quick draft render. If you are satisfied with the layout, click "Render in high quality" to get the final render. Right-click the final image to save it locally.

Ray tracing supports the following visualization features:

  • Vertex colors.

  • Vertex sizes.

  • Highlighting of center vertex.

  • Vertex shapes are translated to simpler 3D shapes.

  • The relative layout and scaling will be reproduced exactly. Only the camera positioning is different.

The rendered image is generated to match the width and height of the popup. Make the popup smaller for faster render times, or larger for higher resolution. The generated picture has a transparent background.

LynxKite internals

Prefixed paths

LynxKite provides read and write access to distributed file systems for the purpose of importing and exporting project data. To make this access secure and convenient, paths are specified relative to prefixes.

Prefixes are configured during LynxKite deployment through the prefix_definitions.txt file.

For example, let’s say we want to import a file on Amazon S3. The file is in bucket my-company, at data/file.csv. The full Hadoop path to this file would be:

s3n://<key id>:<secret key>@my-company/data/file.csv

During deployment, the COMPANY_S3 prefix has been configured:

COMPANY_S3="s3n://<key id>:<secret key>@my-company/"

In this case the file can be referenced for the import operation as:

COMPANY_S3$/data/file.csv

This scheme has a number of benefits:

  • The user has to type less.

  • The credentials can remain secret from all users.

  • The credentials can be changed at a single location and it will be applied to all file operations.

  • The root directory can be relocated without affecting users.

User authentication & access control

User authentication is an optional feature and can be turned on or off depending on the deployment. If user authentication is enabled LynxKite data can only be accessed after authentication. The logout link is only displayed at the bottom right if authentication is enabled.

Access control lists

Access rights are controlled at two levels: the folder level and the file prefix level. The latter is only relevant to administrators and described in the Admin Manual; the first is described below.

A folder has two access control lists: one for reading and one for writing. A user has read access to a folder if they are on its access control list and have read access to the parent folders recursively. Similarly, write access requires being on the write access control list plus read access for all parents. Being on the write access control list implies being on the read access list on every local level.

  • The users with read access to a folder can view its contents.

  • The users with write access to a folder can create, delete and rename workspaces, snapshots and subfolders, see every workspace and snapshot, and perform any changes (including modifying the writing list). Note that renaming requires write access on both the original and the target folder if those two are different. Similarly, copying (duplicating) a workspace or a folder requires write access to the target folder.

The access control lists can be modified in the folder settings . The lists are comma-delimited and * (asterisk) can be used as a wildcard. * means all logged in users. *@lynxanalytics.com, for example, means all users with user names matching that pattern.

When creating a folder, you have the choice of setting it to private, publicly readable or publicly writable. These options provide different default access control lists, but the lists can be freely modified later.

If a user has no read access to a folder, they will not show up for them in the folder list.

If a user has read-only access to a folder, they can always create copies of the workspaces and make changes to the copies.

To protect your workspaces from other users you have to put it in a folder writable only by you.

Administrator users

Administrator users have special privileges:

  • Administrators can read and write all folders, regardless of the access control lists. They can also change these access control lists.

  • Administrators can create new users, including new administrators. The users are managed through the /users page.

Home folder

A home folder is created for every user automatically. This folder has read and write access only by that user by default.

Database connections

LynxKite can connect to databases via JDBC. JDBC is a widely adopted database connection interface and all major databases support it.

Installation

To be able to connect to a database LynxKite requires the JDBC drivers for the database to be installed. LynxKite comes with the JDBC drivers for MySQL, PostgreSQL, and SQLite pre-installed. For accessing other databases you will need to acquire the driver from the vendor. The driver is a jar file. You have to add the full path of the jar file to KITE_EXTRA_JARS in .kiterc and restart LynxKite.

Usage

The database for import/export operations is specified via a connection URL. The driver is responsible for interpreting the connection URL. Please consult the documentation for the JDBC driver for the connection URL syntax.

If you are in a controlled network environment, make sure that the LynxKite application and all the Spark executors are allowed to connect to the database server.

SQL syntax

SQL is a rich language for expressing database queries. A simple example of such a query is:

select last_age + (2018 - last_update_year) as age_in_2018 from input

For a concise description of the query syntax see Databrick’s documentation for SELECT queries.

SQL also comes with a variety of built-in functions. See the list of built-in functions in the Apache Spark SQL documentation.

LynxKite adds the following built-in functions:

geodistance(lat1, lon1, lat2, lon2)

Computes the geographic distance between two points defined by their GPS coordinates.

hash(string, salt)

Computes a cryptographic hash of string. See Hash vertex attribute.

most_common(column)

Returns the most common value for a string column.

string_intersect(set1, set2)

For two sets of strings (as returned by collect_set()) returns the common subset.

Operations

Each box in a workspace represents a LynxKite operation. There are operations for adding new attributes (such as Compute PageRank), changing the graph structure (such as Reverse edge direction), importing and exporting data, and for creating Segmentations.

The operation toolbox

There are several ways to add a box to the workspace. If you know its name, typing the slash key (/) will bring up the search menu, where operations can be found by name. The same menu can also be accessed via the magnifier icon ().

In case you do not know the name of the operation, functional groups called "categories" will help you find what you need. These categories are listed below, along with their toolbox icon.

Once you have found the operation, drag it to the workspace with the mouse to create a box for it. As you drag, you can touch its inputs to other boxes to set up its connections with one motion. (Or you can add the connections later. See Boxes and arrows.)

Alternatively, you can press Enter on the operation to add its box at the current mouse position. This allows you to search for and add multiple operations in quick succession.

Categories

Import operations

These operations import external data to LynxKite. Example: Import CSV.

Build graph

These operations can build graphs - without importing data to LynxKite. Example: Create example graph.

Subgraph

These operations create subgraphs - a graph formed from a subset of the vertices and edges of the original graph. Example: Filter by attributes.

Build segmentation

These operations create Segmentations. Example: Find connected components.

Use segmentation

These operations modify Segmentations. Example: Copy edges to base project.

Structure

The operations in this category can change the overall graph structure by adding or discarding vertices and/or edges. Examples: Add reversed edges, and Merge vertices by attribute.

Scalars

The operations in this category manipulate global graph attributes (aka scalars). For example, Correlate two attributes computes the Pearson-correlation coefficient of two attributes, and stores the result in a scalar.

Vertex attributes

These operations manipulate (create, discard, convert etc.) vertex attributes. These operations perform their task without looking at other edges or vertices and they are not available if the graph has no vertices. Example: Add constant vertex attribute.

Edge attributes

These operations are similar to vertex attribute operations, but they manipulate edge attributes. They are not available if the graph has no edges. Example: Add random edge attribute.

Attribute propagation

These operations compute vertex attributes from attributes of their neighboring elements. They only differ in how we define "neighboring elements". For example, in operation Aggregate to segmentation, these neighboring elements are all the vertices that belong to the same segment (the segment being the vertex whose attribute this operation computes). Another example is Aggregate edge attribute to vertices; in this case the "neighboring elements" are the edges that leave or enter the vertex. Yet another example is Aggregate on neighbors; the "neighboring elements" here are the other vertices connected to the vertex.

Graph computation

Graph computation operations are similar to the vertex (or edge) attribute operations inasmuch as they compute new attributes for each vertex (or edge). However, they are somewhat more complex, since they are not restricted to that single vertex (or edge) in their computation. For example, Compute degree creates a vertex attribute that depends on how many neighbors a given vertex has, so it depends on the neighborhood of the vertex. A more complex example is Compute PageRank, which is not even restricted on the immediate neighborhood of a vertex: it depends on the entire graph. One might say that this category is about metrics that describe the graph structure in some way.

Machine learning operations

These operations perform machine learning. A machine learning model is trained on a set of data, and it can perform prediction or classification on a new set of data. For example, a logistic regression model can be trained by the operation Train a logistic regression model and it can classify new data with the operation Classify with model.

Workflow

Utility features to efficiently manage workfows. Examples: Users can add a Comment or create a Project union.

Manage project

Utility features to manage and personalize projects by manipulating (discarding, copying, renaming, etc.) attributes, scalars and segmentations. Example: Rename edge attributes.

Visualization operations

Visualization features. Examples: users can create charts with Custom plot, or visualize a subset of the graph with Graph visualization.

Export operations

These operations export data from LynxKite. Example: Export to CSV.

Custom boxes

Users can add previously created custom boxes or Built-ins to their workflow by selecting them from the Custom box menu.

Experimental operations

LynxKite includes cutting-edge algorithms that are under active scientific research. Most of these algorithms are already ready for production use on large datasets. But some of the most recent algorithms are not yet able to handle very large datasets efficiently. Their implementation is subject to future change.

They are marked with the following line:

Warning! Experimental operation.

These experimental operations are included in LynxKite as a preview. Feedback on them is very much appreciated. If you find them useful, let the development team know, so we can prioritize them for improved scalability.

The list of operations

Add constant edge attribute

Adds an attribute with a fixed value to every edge.

Example use case

Create a constant edge attribute with value 'A' to the graph in project A. Then, create a constant edge attribute with value 'B' to the graph in project B. Use the same attribute name in both cases. From then on, if a union graph is created from these two graphs, the edge attribute will tell which graph the edge originally belonged to.

Parameters

Attribute name

The new attribute will be created under this name.

Value

The attribute value. Should be a number if Type is set to Double.

Type

The operation can create either Double (numeric) or String typed attributes.

Add constant vertex attribute

Adds an attribute with a fixed value to every vertex.

Example use case

Create a constant vertex attribute with value 'A' to the graph in project A. Then, create a constant vertex attribute with value 'B' to the graph in project B. Use the same attribute name in both cases. From then on, if a union graph is created from these two graphs, the vertex attribute will tell which graph the vertex originally belonged to.

Parameters

Attribute name

The new attribute will be created under this name.

Value

The attribute value. Should be a number if Type is set to Double.

Type

The operation can create either Double (numeric) or String typed attributes.

Add popularity x similarity optimized edges

Creates a graph with given amount of vertices and average degrees. The edges will follow a power-law - also known as scale-free - distribution and have high clustering. Vertices get two edge attributes called "radial" and "angular" that can later be used for edge strength evaluation or link prediction. Algorithm based on paper 1 and paper 2

The edges are generated by simulating hyperbolic growth. Vertices are added one by one and at the time of each addition new edges are created in two ways. First, the new vertex is added and it creates edges from itself to older vertices - "external" edges. Then some new edges are added between older vertices - "internal" edges. This way the average amount of edges added per vertex will be slightly more than externalDegree + internalDegree.

External degree

The number of edges a vertex creates from itself upon addition to the growing graph.

Internal degree

The average number of edges created between older vertices whenever a new vertex is added to the growing graph.

Exponent

The exponent of the power-law degree distribution. Values can be 0.5 - 1, endpoints excluded.

Seed

The random seed.

LynxKite operations are typically deterministic. If you re-run an operation with the same random seed, you will get the same results as before. To get a truly independent random re-run, make sure you choose a different random seed.

The default value for random seed parameters is randomly picked, so only very rarely do you need to give random seeds any thought.

Add random edge attribute

Generates a new random Double attribute with the specified distribution, which can be either (1) a Standard Normal (i.e., Gaussian) distribution with a mean of 0 and a standard deviation of 1, or (2) a Standard Uniform distribution where values fall between 0 and 1.

Attribute name

The new attribute will be created under this name.

Distribution

The desired random distribution.

Seed

The random seed.

LynxKite operations are typically deterministic. If you re-run an operation with the same random seed, you will get the same results as before. To get a truly independent random re-run, make sure you choose a different random seed.

The default value for random seed parameters is randomly picked, so only very rarely do you need to give random seeds any thought.

Add random vertex attribute

Generates a new random Double attribute with the specified distribution, which can be either (1) a Standard Normal (i.e., Gaussian) distribution with a mean of 0 and a standard deviation of 1, or (2) a Standard Uniform distribution where values fall between 0 and 1.

Attribute name

The new attribute will be created under this name.

Distribution

The desired random distribution.

Seed

The random seed.

LynxKite operations are typically deterministic. If you re-run an operation with the same random seed, you will get the same results as before. To get a truly independent random re-run, make sure you choose a different random seed.

The default value for random seed parameters is randomly picked, so only very rarely do you need to give random seeds any thought.

Add rank attribute

Creates a new vertex attribute that is the rank of the vertex when ordered by the key attribute. Rank 0 will be the vertex with the highest or lowest key attribute value (depending on the direction of the ordering). String attributes will be ranked alphabetically.

This operation makes it easy to find the top (or bottom) N vertices by an attribute. First, create the ranking attribute. Then filter by this attribute.

Rank attribute name

The new attribute will be created under this name.

Key attribute name

The attribute to rank by.

Order

With ascending ordering rank 0 belongs to the vertex with the minimal key attribute value or the vertex that is at the beginning of the alphabet. With descending ordering rank 0 belongs to the vertex with the maximal key attribute value or the vertex that is at the end of the alphabet.

Add reversed edges

For every A → B edge adds a new B → A edge, copying over the attributes of the original. Thus this operation will double the number of edges in the project.

Using this operation you end up with a graph with symmetric edges: if A → B exists then B → A also exists. This is the closest you can get to an "undirected" graph.

Optionally, a new edge attribute (a 'distinguishing attribute') will be created so that you can tell the original edges from the new edges after the operation. Edges where this attribute is 0 are original edges; edges where this attribute is 1 are new edges.

Distinguishing edge attribute

The name of the distinguishing edge attribute; leave it empty if the attribute should not be created.

Aggregate edge attribute globally

Aggregates edge attributes across the entire graph into one scalar for each attribute. For example you could use it to calculate the average call duration across an entire call dataset.

Generated name prefix

Save the aggregated values with this prefix.

The available aggregators are:

  • For Double attributes:

    • average

    • count (number of cases where the attribute is defined)

    • first (arbitrarily picks a value)

    • max

    • min

    • std_deviation (standard deviation)

    • sum

  • For other attributes:

    • count (number of cases where the attribute is defined)

    • first (arbitrarily picks a value)

Aggregate edge attribute to vertices

Aggregates an attribute on all the edges going in or out of vertices. For example it can calculate the average duration of calls for each person in a call dataset.

Generated name prefix

Save the aggregated attributes with this prefix.

Aggregate on
  • incoming edges: Aggregate across the edges coming in to each vertex.

  • outgoing edges: Aggregate across the edges going out of each vertex.

  • all edges: Aggregate across all the edges going in or out of each vertex.

The available aggregators are:

  • For Double attributes:

    • average

    • count_distinct (the number of distinct values)

    • count_most_common (the number of occurrences of the most common value)

    • count (number of cases where the attribute is defined)

    • first (arbitrarily picks a value)

    • max

    • median

    • min

    • most_common

    • set (all the unique values, as a Set attribute)

    • std_deviation (standard deviation)

    • sum

    • vector (all the values, as a Vector attribute)

  • For String attributes:

    • count_distinct (the number of distinct values)

    • count_most_common (the number of occurrences of the most common value)

    • count (number of cases where the attribute is defined)

    • majority_100 (the value that 100% agree on, or empty string)

    • majority_50 (the value that 50% agree on, or empty string)

    • most_common

    • set (all the unique values, as a Set attribute)

    • vector (all the values, as a Vector attribute)

  • For other attributes:

    • count_distinct (the number of distinct values)

    • count_most_common (the number of occurrences of the most common value)

    • count (number of cases where the attribute is defined)

    • most_common

    • set (all the unique values, as a Set attribute)

Aggregate from segmentation

Aggregates vertex attributes across all the segments that a vertex in the base project belongs to. For example, it can calculate the average size of cliques a person belongs to.

Generated name prefix

Save the aggregated attributes with this prefix.

The available aggregators are:

  • For Double attributes:

    • average

    • count_distinct (the number of distinct values)

    • count_most_common (the number of occurrences of the most common value)

    • count (number of cases where the attribute is defined)

    • first (arbitrarily picks a value)

    • max

    • median

    • min

    • most_common

    • set (all the unique values, as a Set attribute)

    • std_deviation (standard deviation)

    • sum

    • vector (all the values, as a Vector attribute)

  • For String attributes:

    • count_distinct (the number of distinct values)

    • count_most_common (the number of occurrences of the most common value)

    • count (number of cases where the attribute is defined)

    • majority_100 (the value that 100% agree on, or empty string)

    • majority_50 (the value that 50% agree on, or empty string)

    • most_common

    • set (all the unique values, as a Set attribute)

    • vector (all the values, as a Vector attribute)

  • For other attributes:

    • count_distinct (the number of distinct values)

    • count_most_common (the number of occurrences of the most common value)

    • count (number of cases where the attribute is defined)

    • most_common

    • set (all the unique values, as a Set attribute)

Aggregate on neighbors

Aggregates across the vertices that are connected to each vertex. You can use the Aggregate on parameter to define how exactly this aggregation will take place: choosing one of the 'edges' settings can result in a neighboring vertex being taken into account several times (depending on the number of edges between the vertex and its neighboring vertex); whereas choosing one of the 'neighbors' settings will result in each neighboring vertex being taken into account once.

For example, it can calculate the average age of the friends of each person.

Generated name prefix

Save the aggregated attributes with this prefix.

Aggregate on
  • incoming edges: Aggregate across the edges coming in to each vertex.

  • outgoing edges: Aggregate across the edges going out of each vertex.

  • all edges: Aggregate across all the edges going in or out of each vertex.

  • symmetric edges: Aggregate across the 'symmetric' edges for each vertex: this means that if you have n edges going from A to B and k edges going from B to A, then min(n,k) edges will be taken into account for both A and B.

  • in-neighbors: For each vertex A, aggregate across those vertices that have an outgoing edge to A.

  • out-neighbors: For each vertex A, aggregate across those vertices that have an incoming edge from A.

  • all neighbors: For each vertex A, aggregate across those vertices that either have an outgoing edge to or an incoming edge from A.

  • symmetric neighbors: For each vertex A, aggregate across those vertices that have both an outgoing edge to and an incoming edge from A.

The available aggregators are:

  • For Double attributes:

    • average

    • count_distinct (the number of distinct values)

    • count_most_common (the number of occurrences of the most common value)

    • count (number of cases where the attribute is defined)

    • first (arbitrarily picks a value)

    • max

    • median

    • min

    • most_common

    • set (all the unique values, as a Set attribute)

    • std_deviation (standard deviation)

    • sum

    • vector (all the values, as a Vector attribute)

  • For String attributes:

    • count_distinct (the number of distinct values)

    • count_most_common (the number of occurrences of the most common value)

    • count (number of cases where the attribute is defined)

    • majority_100 (the value that 100% agree on, or empty string)

    • majority_50 (the value that 50% agree on, or empty string)

    • most_common

    • set (all the unique values, as a Set attribute)

    • vector (all the values, as a Vector attribute)

  • For other attributes:

    • count_distinct (the number of distinct values)

    • count_most_common (the number of occurrences of the most common value)

    • count (number of cases where the attribute is defined)

    • most_common

    • set (all the unique values, as a Set attribute)

Aggregate to segmentation

Aggregates vertex attributes across all the vertices that belong to a segment. For example, it can calculate the average age of each clique.

The available aggregators are:

  • For Double attributes:

    • average

    • count_distinct (the number of distinct values)

    • count_most_common (the number of occurrences of the most common value)

    • count (number of cases where the attribute is defined)

    • first (arbitrarily picks a value)

    • max

    • median

    • min

    • most_common

    • set (all the unique values, as a Set attribute)

    • std_deviation (standard deviation)

    • sum

    • vector (all the values, as a Vector attribute)

  • For String attributes:

    • count_distinct (the number of distinct values)

    • count_most_common (the number of occurrences of the most common value)

    • count (number of cases where the attribute is defined)

    • majority_100 (the value that 100% agree on, or empty string)

    • majority_50 (the value that 50% agree on, or empty string)

    • most_common

    • set (all the unique values, as a Set attribute)

    • vector (all the values, as a Vector attribute)

  • For other attributes:

    • count_distinct (the number of distinct values)

    • count_most_common (the number of occurrences of the most common value)

    • count (number of cases where the attribute is defined)

    • most_common

    • set (all the unique values, as a Set attribute)

Aggregate vertex attribute globally

Aggregates vertex attributes across the entire graph into one scalar for each attribute. For example you could use it to calculate the average age across an entire dataset of people.

Generated name prefix

Save the aggregated values with this prefix.

The available aggregators are:

  • For Double attributes:

    • average

    • count (number of cases where the attribute is defined)

    • first (arbitrarily picks a value)

    • max

    • min

    • std_deviation (standard deviation)

    • sum

  • For other attributes:

    • count (number of cases where the attribute is defined)

    • first (arbitrarily picks a value)

Anchor

This special box represents the workspace itself. There is always exactly one instance of it. It allows you to control workspace-wide settings as parameters on this box. It can also serve to anchor your workspace with a high-level description.

Description

An overall description of the purpose of this workspace.

Parameters

Workspaces containing output boxes can be used as custom boxes in other workspaces. Here you can define what parameters the custom box created from this workspace shall have.

Parameters can also be used as workspace-wide constants. For example if you want to import accounts-2017.csv and transactions-2017.csv, you could create a date parameter with default value set to 2017 and import the files as accounts-$date.csv and transactions-$date.csv. (Make sure to mark these parametric file names as parametric.) This makes it easy to change the date for all imported files at once later.

Approximate clustering coefficient

Scalable algorithm to calculate the approximate local clustering coefficient attribute for every vertex. It quantifies how close the vertex’s neighbors are to being a clique. In practice a high (close to 1.0) clustering coefficient means that the neighbors of a vertex are highly interconnected, 0.0 means there are no edges between the neighbors of the vertex.

Attribute name

The new attribute will be created under this name.

The precision of the algorithm

This algorithm is an approximation. This parameter sets the trade-off between the quality of the approximation and the memory and time consumption of the algorithm.

Approximate embeddedness

Scalable algorithm to calculate the approximate overlap size of vertex neighborhoods along the edges. If an A → B edge has an embeddedness of N, it means A and B have N common neighbors. The approximate embeddedness is undefined for loop edges.

Attribute name

The new attribute will be created under this name.

The precision of the algorithm

This algorithm is an approximation. This parameter sets the trade-off between the quality of the approximation and the memory and time consumption of the algorithm.

Check cliques

Validates that the segments of the segmentation are in fact cliques.

Creates a new invalid_cliques scalar, which lists non-clique segment IDs up to a certain number.

Segment IDs to check

The validation can be restricted to a subset of the segments, resulting in quicker operation.

Edges required in both directions

Whether edges have to exist in both directions between all members of a clique.

Classify with model

Creates classifications from a model and vertex attributes of the graph. For the classifications with nominal outputs, an additional probability is created to represent the corresponding outcome probability.

Classification vertex attribute name

The new attribute of the classification will be created under this name.

Name and parameters of the model

The model used for the classifications and a mapping from vertex attributes to the model’s features.

Every feature of the model needs to be mapped to a vertex attribute.

Find vertex coloring

Finds a coloring of the vertices of the graph with no two neighbors with the same color. The colors are represented by numbers. Tries to find a coloring with few colors.

Vertex coloring is used in scheduling problems to distribute resources among parties which simultaneously and asynchronously request them. https://en.wikipedia.org/wiki/Graph_coloring

Attribute name

The new attribute will be created under this name.

Combine segmentations

Creates a new segmentation from the selected existing segmentations. Each new segment corresponds to one original segment from each of the original segmentations, and the new segment is the intersection of all the corresponding segments. We keep non-empty resulting segments only. Edges between segmentations are discarded.

If you have segmentations A and B with two segments each, such as:

  • A = { "men", "women" }

  • B = { "people younger than 20", "people older than 20" }

then the combined segmentation will have four segments:

  • { "men younger than 20", "men older than 20", "women younger than 20", "women older than 20" }

New segmentation name

The new segmentation will be saved under this name.

Segmentations

The segmentations to combine. Select two or more.

Comment

Adds a comment to the workspace. As with any box, you can freely place your comment anywhere on the workspace. Adding comments does not have any effect on the computation but can potentially make your workflow easier to understand for others — or even for your future self.

Markdown can be used to present formatted text or embed links and images.

Comment

Markdown text to be displayed in the workspace.

Compare segmentation edges

Compares the edge sets of two segmentations and computes precision and recall. In order to make this work, the edges of the both segmentation graphs should be matchable against each other. Therefore, this operation only allows comparing segmentations which were created using the Use base project as segmentation operation from the same project. (More precisely, a one to one correspondence is needed between the vertices of both segmentations and the base project.)

You can use this operation for example to evaluate different colocation results against a reference result.

One of the input segmentations is the golden (or reference) graph, against which the other one, the test will be evaluated. The precision and recall values are computed the following way:

numGoldenEdges := number of edges in the golden segmentation graph
numTestEdges := number of edges in the test segmentation graph
numCommonEdges := number of common edges in the two segmentation graphs
precision := numCommonEdges / numTestEdges
recall := numCommonEdges / numGoldenEdges

The results will be created as scalars in the test segmentaion. Parallel edges are treated as one edge. Also, for each matching edge an edge attribute is created in both segmentation graphs.

Golden segmentation

Segmentation containing the golden edges.

Test segmentation

Segmentation containing the test edges.

Compute centrality

Calculates an approximation of the centrality for every vertex. Higher centrality means that the vertex is more embedded in the graph. Multiple different centrality measures have been defined in the literature. You can choose the specific centrality measure as a parameter to this operation.

Attribute name

The new attribute will be created under this name.

Maximal diameter to check

The algorithm works by counting the shortest paths up to a certain length in each iteration. This parameter sets the maximal length to check, so it has a strong influence over the run time of the operation.

A setting lower than the actual diameter of the graph can theoretically introduce unbounded error to the results. In typical small world graphs this effect may be acceptable, however.

The centrality algorithm to use
  • The harmonic centrality of the vertex A is the sum of the reciprocals of all shortest paths to A.

  • Lin’s centrality of the vertex A is the square of the size of its coreachable set divided by the sum of the shortest paths to A.

  • Average distance of the vertex A is the sum of the shortest paths to A divided by the size of its coreachable set.

The precision of the algorithm

The centrality algorithm is an approximation. This parameter sets the trade-off between the quality of the approximation and the memory and time consumption of the algorithm. In most cases the default value is good enough. On very large graphs it may help to use a lower number in order to speed up the algorithm or meet memory constraints.

Direction
  • incoming edges: Calculating paths from vertices.

  • outgoing edges: Calculating paths to vertices.

  • all edges: Calculating paths to both directions - effectively on an undirected graph.

Compute clustering coefficient

Calculates the local clustering coefficient attribute for every vertex. It quantifies how close the vertex’s neighbors are to being a clique. In practice a high (close to 1.0) clustering coefficient means that the neighbors of a vertex are highly interconnected, 0.0 means there are no edges between the neighbors of the vertex.

Attribute name

The new attribute will be created under this name.

Compute degree

For every vertex, this operation calculates either the number of edges it is connected to or the number of neighboring vertices it is connected to. You can use the Count parameter to control this calculation: choosing one of the 'edges' settings can result in a neighboring vertex being counted several times (depending on the number of edges between the vertex and the neighboring vertex); whereas choosing one of the 'neighbors' settings will result in each neighboring vertex counted once.

Attribute name

The new attribute will be created under this name.

Count
  • incoming edges: Count the edges coming in to each vertex.

  • outgoing edges: Count the edges going out of each vertex.

  • all edges: Count all the edges going in or out of each vertex.

  • symmetric edges: Count the 'symmetric' edges for each vertex: this means that if you have n edges going from A to B and k edges going from B to A, then min(n,k) edges will be taken into account for both A and B.

  • in-neighbors: For each vertex A, count those vertices that have an outgoing edge to A.

  • out-neighbors: For each vertex A, count those vertices that have an incoming edge from A.

  • all neighbors: For each vertex A, count those vertices that either have an outgoing edge to or an incoming edge from A.

  • symmetric neighbors: For each vertex A, count those vertices that have both an outgoing edge to and an incoming edge from A.

Compute dispersion

Calculates the extent to which two people’s mutual friends are not themselves well-connected. The dispersion attribute for an A → B edge is the number of pairs of nodes that are both connected to A and B but are not directly connected to each other.

Dispersion ignores edge directions.

It is a useful signal for identifying romantic partnerships — connections with high dispersion — according to Romantic Partnerships and the Dispersion of Social Ties: A Network Analysis of Relationship Status on Facebook.

A normalized dispersion metric is also generated by this operation. This is normalized against the embeddedness of the edge with the formula recommended in the cited article. (disp(u,v)0.61/(emb(u,v)+5)) It does not necessarily fall in the (0,1) range.

Attribute name

The new edge attribute will be created under this name.

Compute distance via shortest path

Calculates the length of the shortest path from a given set of vertices for every vertex. To use this operation, a set of starting vi vertices has to be specified, each with a starting distance sd(vi). Edges represent a unit distance by default, but this can be overridden using an attribute. This operation will compute for each vertex vi the smallest distance from a starting vertex, also counting the starting distance of the starting vertex: d(vi) = minj(sd(vj) + D(sj, vi, I)) where D(x, y, I) is the length of the shortest path between x and y using at most I edges.

For example, vertices can be cities and edges can be flights with a given cost between the cities. Given a set of starting cities, which might as well be only one city, this operation can compute the lowest cost for reaching each city with a given maximum number of flight changes. In addition to that, an optional base cost can be specified for each starting city, which will be counted into each path starting from that city. For example, that could be the price of getting to the given city by train.

If a city can be reached from more than one of the starting cities, then still only one cost value is computed: the one from the starting city where the route has the lowest cost. If a starting city can be reached from another starting city in a cheaper way than the starting cost, then the assigned cost of that city will be the cheaper cost.

Attribute name

The new attribute will be created under this name.

Edge distance attribute

The attribute containing the distances corresponding to edges. (Cost in the above example.)

Negative values are allowed but there must be no loops where the sum of distances is negative.

Starting distance attribute

A numeric attribute that specifies the initial distances of the vertices that we consider already reachable before starting this operation. (In the above example, specify this for the elements of the starting set, and leave this undefined for the rest of the vertices.)

Maximum number of iterations

The maximum number of edges considered for a shortest-distance path.

Compute embeddedness

Calculates the overlap size of vertex neighborhoods along the edges. If an A → B edge has an embeddedness of N, it means A and B have N common neighbors.

Attribute name

The new attribute will be created under this name.

Compute hyperbolic edge probability

Adds edge attribute hyperbolic edge probability based on hyperbolic distances between vertices. This indicates how likely that edge would be to exist if the input graph was probability x similarity-grown. On a general level it is a metric of edge strength. Probabilities are guaranteed to be 0 =< p =< 1 . Vertices must have two Double vertex attributes to be used as radial and angular coordinates.

Radial

The vertex attribute to be used as radial coordinates. Should not contain negative values.

Angular

The vertex attribute to be used as angular coordinates. Values should be 0 - 2 * Pi.

Compute in Python

Executes custom Python code to define new attributes or scalars.

The following example computes two new vertex attributes (with_title and age_squared), two new edge attributes (score and names), and two new scalars (hello and average_age). (You can try it on the example graph which has the attributes used in this code.)

vs['with_title'] = 'The Honorable ' + vs.name
vs['age_squared'] = vs.age ** 2
es['score'] = es.weight + es.comment.str.len()
es['names'] = 'from ' + vs.name[es.src].values + ' to ' + vs.name[es.dst].values
scalars.hello = scalars.greeting.lower()
scalars.average_age = vs.age.mean()

scalars is a SimpleNamespace that makes it easy to get and set scalars.

vs (for "vertices") and es (for "edges") are both Pandas DataFrames. You can write natural Python code and use the usual APIs and packages to compute new attributes. Pandas and Numpy are already imported as pd and np. es can have src and dst columns which are the indexes of the source and destination vertex for each edge. These can be used to index into vs as in the example.

Assign the new columns to these same DataFrames to output new vertex or edge attributes.

When you write this Python code, the input data may not be available yet. And you may want to keep building on the output of the box without having to wait for the Python code to execute. To make this possible, LynxKite has to know the inputs and outputs of your code without executing it. You can specify them through the Inputs and Outputs parameters. For outputs you must also declare their types.

The currently supported types for outputs are:

  • float to create a Double-typed attribute or scalar.

  • str to create a String-typed attribute or scalar.

In the previous example we would set:

  • Inputs: vs.name, vs.age, es.weight, es.comment, es.src, es.dst, scalars.greeting

  • Outputs: vs.with_title: str, vs.age_squared: float, es.score: float, es.names: str, scalars.hello: str, scalars.average_age: float

Code

The Python code you want to run. See the operation description for details.

Inputs

A comma-separated list of attributes and scalars that your code wants to use. For example, vs.my_attribute, vs.another_attribute, scalars.my_scalar.

Outputs

A comma-separated list of attributes and scalars that your code generates. These must be annotated with the type of the attribute or scalar. For example, vs.my_new_attribute: str, vs.another_new_attribute: float, scalars.my_new_scalar: str.

Compute inputs

Triggers the computations for all entities associated with its input.

  • For table inputs, it computes the table.

  • For project inputs, it computes the vertices and edges, their attributes, scalars, and the same transitively for all segments plus the segmentation links.

Compute PageRank

Calculates PageRank for every vertex. PageRank is calculated by simulating random walks on the graph. Its PageRank reflects the likelihood that the walk leads to a specific vertex.

Let’s imagine a social graph with information flowing along the edges. In this case high PageRank means that the vertex is more likely to be the target of the information.

Similarly, it may be useful to identify information sources in the reversed graph. Simply reverse the edges before running the operation to calculate the reverse PageRank.

Attribute name

The new attribute will be created under this name.

Weight attribute

The edge weights. Edges with greater weight correspond to higher probabilities in the theoretical random walk.

Number of iterations

PageRank is an iterative algorithm. More iterations take more time but can lead to more precise results. As a rule of thumb set the number of iterations to the diameter of the graph, or to the median shortest path.

Damping factor

The probability of continuing the random walk at each step. Higher damping factors lead to longer random walks.

Direction
  • incoming edges: Simulate random walk in the reverse edge direction. Finds the most influential sources.

  • outgoing edges: Simulate random walk in the original edge direction. Finds the most popular destinations.

  • all edges: Simulate random walk in both directions.

Connect vertices on attribute

Creates edges between vertices that are equal in a chosen attribute. If the source attribute of A equals the destination attribute of B, an A → B edge will be generated.

The two attributes must be of the same data type.

For example, if you connect nodes based on the "name" attribute, then everyone called "John Smith" will be connected to all the other "John Smiths".

Source attribute

An A → B edge is generated when this attribute on A matches the destination attribute on B.

Destination attribute

An A → B edge is generated when the source attribute on A matches this attribute on B.

Convert edge attribute to Double

Converts the selected String typed edge attributes to Double (floating point number) type.

The attributes will be converted in-place. If you want to keep the original String attribute as well, make a copy first!

Edge attribute

The attributes to be converted.

Convert edge attribute to String

Converts the selected edge attributes to String type.

The attributes will be converted in-place. If you want to keep the original String attribute as well, make a copy first!

Edge attribute

The attributes to be converted.

Convert vertex attribute to Double

Converts the selected String typed vertex attributes to Double (floating point number) type.

The attributes will be converted in-place. If you want to keep the original String attribute as well, make a copy first!

Vertex attribute

The attributes to be converted.

Convert vertex attribute to String

Converts the selected vertex attributes to String type.

The attributes will be converted in-place. If you want to keep the original attributes as well, make a copy first!

Vertex attribute

The attributes to be converted.

Convert vertex attributes to position

Creates an attribute of type (Double, Double) from two Double attributes. The created attribute can be used as an X-Y or latitude-longitude location.

Save as

The new attribute will be created under this name.

X or latitude

The attribute that makes up the first coordinate.

Y or longitude

The attribute that makes up the second coordinate.

Copy edge attribute

Creates a copy of an edge attribute.

Old name

The attribute to copy.

New name

The name of the copy.

Copy edges to base project

Copies the edges from a segmentation to the base project. The copy is performed along the links between the the segmentation and the base project. If two segments are connected with some edges, then each edge will be copied to each pairs of members of the segments.

The operation will create edges.

After opening this operation from the toolbox, you will be shown the number of edges that will be created.

This operation has a potential to create a very large number of edges. If the predicted number is too high, try to eliminate very large segments or filter the edges of the segmentation before running it!

Copy edges to segmentation

Copies the edges from the base project to the segmentation. The copy is performed along the links between the base project and the segmentation. If a base vertex belongs to no segments, its edges will not be found in the result. If a base vertex belongs to multiple segments, its edges will have multiple copies in the result.

Copy scalar from other project

This operation can take a scalar from an other project and copy it to the current project.

It can be useful if we trained a machine learning model in one project, and would like to apply this model in another project for predicting undefined attribute values.

Other project’s name

The name of the other project from where we want to copy a scalar.

Name of the scalar in the other project

The name of the scalar in the other project. If it is a simple string, then the scalar with that name has to be in the root of the other project. If it is a .-separated string, then it means a scalar in a segmentation of the other project. The syntax for this case is: seg_1.seg_2…​..seg_n.scalar.

Name for the scalar in this project

This will be the name of the copied scalar in this project.

Copy scalar

Creates a copy of a scalar.

Old name

The scalar to copy.

New name

The name of the copy.

Copy segmentation

Creates a copy of a segmentation.

Old name

The segmentation to copy.

New name

The name of the copy.

Copy vertex attribute

Creates a copy of a vertex attribute.

Old name

The attribute to copy.

New name

The name of the copy.

Copy vertex attributes from segmentation

Copies all vertex attributes from the segmentation to the parent.

This operation is only available when each vertex belongs to just one segment. (As in the case of connected components, for example.)

Example use case

You have performed Link project and segmentation by fingerprint. At this point there is a sparse one-to-one connection between the project vertices and the segmentation vertices. You can use Copy vertex attributes from segmentation and Copy vertex attributes to segmentation to copy all attributes from one side to the other.

Parameters

Attribute name prefix

A prefix for the new attribute names. Leave empty for no prefix.

Copy vertex attributes to segmentation

Copies all vertex attributes from the parent to the segmentation.

This operation available only when each segment contains just one vertex.

Example use case

You have performed Link project and segmentation by fingerprint. At this point there is a sparse one-to-one connection between the project vertices and the segmentation vertices. You can use Copy vertex attributes from segmentation and Copy vertex attributes to segmentation to copy all attributes from one side to the other.

Parameters

Attribute name prefix

A prefix for the new attribute names. Leave empty for no prefix.

Correlate two attributes

Calculates the Pearson correlation coefficient of two attributes. Only vertices where both attributes are defined are considered.

Note that correlation is undefined if at least one of the attributes is a constant.

First attribute

The correlation of these two attributes will be calculated.

Second attribute

The correlation of these two attributes will be calculated.

Create edges from co-occurrence

Connects vertices in the parent project if they co-occur in any segments. Multiple co-occurrences will result in multiple parallel edges. Loop edges are generated for each segment that a vertex belongs to. The attributes of the segment are copied to the edges created from it.

This operation will create edges.

After opening this operation from the toolbox, you will be shown the number of edges that will be created.

Co-occurrence has a potential to create a very large number of edges. If the predicted number is too high, try to eliminate very large segments before running the operation!

Create edges from set overlaps

Connects segments with large enough overlaps.

Example use case

Communities are generated as a set of vertices, with no edges between them. But you may be interested in looking for some structure there, to see which communities are connected to others. You can generate edges between the communities by looking at how many vertices of the base project they have in common.

Parameters

Minimal overlap for connecting two segments

Two segments will be connected if they have at least this many members in common.

Create example graph

Creates small test graph with 4 people and 4 edges between them.

The vertices and their attributes are:

name age gender income location

Adam

20.3

Male

1000

coordinates of New York

Eve

18.2

Female

undefined

coordinates of Budapest

Bob

50.3

Male

2000

coordinates of Singapore

Isolated Joe

2.0

Male

undefined

coordinates of Sydney

The edges and their attributes are:

src dst comment weight

Adam

Eve

Adam loves Eve

1

Eve

Adam

Eve loves Adam

2

Bob

Adam

Bob envies Adam

3

Bob

Eve

Bob loves Eve

4

As silly as this graph is, it is useful for quickly trying a wide range of features.

Create random edges

Creates edges randomly, so that each vertex will have a degree uniformly chosen between 0 and 2 × the provided parameter.

For example, you can create a random graph by first applying operation Create vertices and then creating the random edges.

Average degree

The degree of a vertex will be chosen uniformly between 0 and 2 × this number. This results in generating number of vertices × average degree edges.

Seed

The random seed.

LynxKite operations are typically deterministic. If you re-run an operation with the same random seed, you will get the same results as before. To get a truly independent random re-run, make sure you choose a different random seed.

The default value for random seed parameters is randomly picked, so only very rarely do you need to give random seeds any thought.

Create scale-free random edges

Creates edges randomly so that the resulting graph is scale-free.

This is an iterative algorithm. We start with one edge per vertex and in each iteration the number of edges gets approximately multiplied by Per iteration edge number multiplier.

Number of iterations

Each iteration increases the number of edges by the specified multiplier. A higher number of iteration will result in a more scale-free degree distribution, but also a slower performance.

Per iteration edge number multiplier

Each iteration increases the number of edges by the specified multiplier. The edge count starts from the number of vertices, so with N iterations and m as the multiplier you will have mN edges by the end.

Create vertices

Creates a new vertex set with no edges. Two attributes are generated: id and ordinal. id is the internal vertex ID, while ordinal is an index for the vertex: it goes from zero to the vertex set size.

Vertex set size

The number of vertices to create.

Custom plot

Creates a plot from the input table. The plot can be defined using the Vegas plotting API in Scala. This API makes it easy to define Vega-Lite plots in code.

You code has to evaluate to a vegas.Vegas object. For your convenience vegas._ is already imported. An example of a simple plot would be:

Vegas()
  .withData(table)
  .encodeX("name", Nom)
  .encodeY("age", Quant)
  .encodeColor("gender", Nom)
  .mark(Bar)

Vegas() is the entry point to the plotting API. You can provide a title if you like: Vegas("My Favorite Plot").

LynxKite fetches a sample of up to 10,000 rows from your table for the purpose of the plot. This data is made available in the table variable (as Seq[Map[String, Any]]). .withData(table) binds this data to the plot. You can transform the data before plotting if necessary:

val doubled = table.map(row =>
  row.updated("age", row("age").asInstanceOf[Double] * 2))

Vegas()
  .withData(doubled)
  .encodeX("name", Nom)
  .encodeY("age", Quant)

(The goals of this trivial example would be better achieved by other means. But the same approach can be used to build very intelligent graphs.)

.encodeX() and .encodeY() specify which fields of the table to visualize, and how to visualize them. X, Y, and Color are the most basic examples, but there are several more. See the Vega-Lite docs on Encodings for details.

At the simplest, you have the specify the data type of the field: Quantitative (for numbers), Temporal (for dates), Ordinal (for ranking), or Nominal (for categories).`

You can also specify details of the axis, such as switching it to logarithmic scale:

Vegas()
  .withData(table)
  .encodeX("age", Quant, scale=Scale(scaleType=ScaleType.Log))

By default each row in the table results in one visual element in the visualization. This is great for scatter plots, where you want to display each row as a dot. But it is not suitable for histograms, where you want each bar to represent the count of rows that fall within a range of values (a bin). This can also be specified as part of the encoding! For example, for a simple histogram by age:

Vegas()
  .withData(table)
  .encodeX("age", Quant, bin=Bin(maxbins=10.0))
  .encodeY(field="*", Quantitative, aggregate=AggOps.Count)
  .mark(Bar)

.mark(Bar) specifies the visual element to use. The default is Circle. Line, Area, and more are available and documented in the Vega-Lite docs on Marks.

For inspiration take a look at the Vega-Lite Example Gallery. Most of these can be easily reproduced in LynxKite. For example Becker’s Barley Trellis Plot can be specified as:

Vegas()
  .withData(table)
  .encodeRow("site", Ordinal)
  .encodeColor("year", Nom)
  .encodeX("yield", Quant,
    aggregate=AggOps.Median, scale=Scale(zero=false))
  .encodeY("variety", Ordinal,
    sortField=Sort("yield", op=AggOps.Median), scale=Scale(bandSize=12))
  .mark(Point)

LynxKite comes with several Built-ins, many of them based on the Custom plot box. You can dive into these custom boxes to see the code used to build them.

For details about the Scala API see the Vegas 0.3.9 DSL specification or review a collection of examples.

Plot code

Scala code for defining the plot.

Connect vertices in the base project with segments based on matching attributes.

This operation can be used (among other things) to create connections between two projects once one has been imported as a segmentation of the other. (See Use other project as segmentation.)

Identifying vertex attribute in base project

A vertex will be connected to a segment if the selected vertex attribute of the vertex matches the selected vertex attribute of the segment.

Identifying vertex attribute in the segmentation

A vertex will be connected to a segment if the selected vertex attribute of the vertex matches the selected vertex attribute of the segment.

Derive column

Derives a new column on a table input via an SQL expression. Outputs a table.

Name

The name of the new column.

Value

The SQL expression to define the new column.

Derive edge attribute

Generates a new attribute based on existing attributes. The value expression can be an arbitrary Scala expression, and it can refer to existing attributes on the edge as if they were local variables. It can also refer to attributes of the source and destination vertex of the edge using the format src$attribute and dst$attribute.

For example you can write weight * scala.math.abs(src$age - dst$age) to generate a new attribute that is the weighted age difference of the two endpoints of the edge.

You can also refer to graph attributes (aka scalars) in the Scala expression. For example, assuming that you have a graph attribute age_average, you can use the expression if (src$age < age_average / 2 && dst$age > age_average * 2) 1.0 else 0.0 to identify connections between relatively young and relatively old people.

Back quotes can be used to refer to attribute names that are not valid Scala identifiers.

The Scala expression can only return specific types: - Double, - String, - Int, - Long, - `Vector`s combined from the above.

In case you do not want to define the expression for every input, you can return an Option created from the above types. E.g. if (income > 1000) Some(age) else None.

Save as

The new attribute will be created under this name.

Only run on defined attributes
  • true: The new attribute will only be defined on edges for which all the attributes used in the expression are defined.

  • false: The new attribute is defined on all edges. In this case the Scala expression does not pass the attributes using their original types, but wraps them into Option`s. E.g. if you have an attribute `income: Double you would see it as income: Option[Double] making income.getOrElse(0.0) a valid expression.

Value

The Scala expression. You can enter multiple lines in the editor.

Persist result

If enabled, the output attribute will be saved to disk once it is calculated. If disabled, the attribute will be re-computed each time its output is used. Persistence can improve performance at the cost of disk space.

Derive scalar

Generates a new scalar (graph-attributes) based on existing scalars. The value expression can be an arbitrary Scala expression, and it can refer to existing scalars as if they were local variables.

For example you could derive a new scalar as something_sum / something_count to get the average of something.

Save as

The new scalar will be created under this name.

Value

The Scala expression. You can enter multiple lines in the editor.

Derive vertex attribute

Generates a new attribute based on existing vertex attributes. The value expression can be an arbitrary Scala expression, and it can refer to existing attributes as if they were local variables.

For example you can write age * 2 to generate a new attribute that is the double of the age attribute. Or you can write if (gender == "Male") "Mr " + name else "Ms " + name for a more complex example.

You can also refer to graph attributes (aka scalars) in the Scala expression. For example, assuming that you have a graph attribute income_average, you can use the expression if (income > income_average) 1.0 else 0.0 to identify people whose income is above average.

Back quotes can be used to refer to attribute names that are not valid Scala identifiers.

The Scala expression can only return specific types:

  • Double,

  • String,

  • Int,

  • Long,

  • Vectors combined from the above.

In case you do not want to define the expression for every input, you can return an Option created from the above types. E.g. if (income > 1000) Some(age) else None.

Save as

The new attribute will be created under this name.

Only run on defined attributes
  • true: The new attribute will only be defined on vertices for which all the attributes used in the expression are defined.

  • false: The new attribute is defined on all vertices. In this case the Scala expression does not pass the attributes using their original types, but wraps them into Option`s. E.g. if you have an attribute `income: Double you would see it as income: Option[Double] making income.getOrElse(0.0) a valid expression.

Value

The Scala expression. You can enter multiple lines in the editor.

Persist result

If enabled, the output attribute will be saved to disk once it is calculated. If disabled, the attribute will be re-computed each time its output is used. Persistence can improve performance at the cost of disk space.

Discard edge attributes

Throws away edge attributes.

Name

The attributes to discard.

Discard edges

Throws away all edges. This implies discarding all edge attributes too.

Discard loop edges

Discards edges that connect a vertex to itself.

Discard scalars

Throws away scalar values.

Name

The scalars to discard.

Discard segmentation

Throws away a segmentation value.

Name

The segmentation to discard.

Discard vertex attributes

Throws away vertex attributes.

Name

The vertex attributes to discard.

Embed with t-SNE

Embeds high-dimensional data into two dimensions using the scikit-learn implementation of t-SNE.

The name of the embedding

The new attribute will be created under this name.

Vector

The high-dimensional vertex attribute that we want to embed into 2D.

Perplexity

Size of the vertex neighborhood to consider.

Embed vertices

Creates a vertex embedding using the PyTorch Geometric implementation of the node2vec algorithm.

The name of the embedding

The new attribute will be created under this name.

Iterations

Number of training iterations.

Dimensions

The size of each embedding vector.

Walks per node

Number of random walks collected for each vertex.

Walk length

Length of the random walks collected for each vertex.

Context size

The random walks will be cut with a rolling window of this size. This allows reusing the same walk for multiple vertices.

Export to CSV

CSV stands for comma-separated values. It is a common human-readable file format where each record is on a separate line and fields of the record are simply separated with a comma or other delimiter. CSV does not store data types, so all fields become strings when importing from this format.

Path

The distributed file-system path of the output file. It defaults to <auto>, in which case the path is auto generated from the parameters and the type of export (e.g. Export to CSV). This means that the same export operation with the same parameters always generates the same path.

Delimiter

The delimiter separating the fields in each line.

Quote

The character used for quoting strings that contain the delimiter. If the string also contains the quote character, it will be escaped with a backslash (\).

Quote all strings

Quotes all string values if set. Only quotes in the necessary cases otherwise.

Include header

Whether or not to include the header in the CSV file. If the data is exported as multiple CSV files the header will be included in each of them. When such a data set is directly downloaded, the header will appear multiple times in the resulting file.

Escape character

The character used for escaping quotes inside an already quoted value.

Null value

The string representation of a null value. This is how null-s are going to be written in the CSV file.

Date format

The string that indicates a date format. Custom date formats follow the formats at java.text.SimpleDateFormat.

Timestamp format

The string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat.

Drop leading white space

A flag indicating whether or not leading whitespaces from values being written should be skipped.

Drop trailing white space

A flag indicating whether or not trailing whitespaces from values being written should be skipped.

Version

Version is the version number of the result of the export operation. It is a non negative integer. LynxKite treats export operations as other operations: it remembers the result (which in this case is the knowledge that the export was successfully done) and won’t repeat the calculation. However, there might be a need to export an already exported table with the same set of parameters (e.g. the exported file is lost). In this case you need to change the version number, making that parameters are not the same as in the previous export.

Export for download

Set this to "true" if the purpose of this export is file download: in this case LynxKite will repartition the data into one single file, which will be downloaded. The default "no" will result in no such repartition: this performs much better when other, partition-aware tools are used to import the exported data.

Save mode

The following modes can be used: "error if exists", "overwrite", "append", "ignore". In the last case already existing data will not be modified.

Export to Hive

Export a table directly to Apache Hive.

Table

The name of the database table to export to.

Mode

Describes whether LynxKite should expect a table to already exist and how to handle this case.

The table must not exist means the table will be created and it is an error if it already exists.

Drop the table if it already exists means the table will be deleted and re-created if it already exists. Use this mode with great care. This method cannot be used if you specify any fields to partition by, the reason being that the underlying Spark library will delete all other partitions in the table in this case.

Insert into an existing table requires the table to already exist and it will add the exported data at the end of the existing table.

Partition by

The list of column names (if any) which you wish the table to be partitioned by. This cannot be used in conjunction with the "Drop the table if it already exists" mode.

Export to JDBC

JDBC is used to connect to relational databases such as MySQL. See Database connections for setup steps required for connecting to a database.

JDBC URL

The connection URL for the database. This typically includes the username and password. The exact syntax entirely depends on the database type. Please consult the documentation of the database.

Table

The name of the database table to export to.

Mode

Describes whether LynxKite should expect a table to already exist and how to handle this case.

The table must not exist means the table will be created and it is an error if it already exists.

Drop the table if it already exists means the table will be deleted and re-created if it already exists. Use this mode with great care.

Insert into an existing table requires the table to already exist and it will add the exported data at the end of the existing table.

Export to JSON

JSON is a rich human-readable data format. It produces larger files than CSV but can represent data types. Each line of the file stores one record encoded as a JSON object.

Path

The distributed file-system path of the output file. It defaults to <auto>, in which case the path is auto generated from the parameters and the type of export (e.g. Export to CSV). This means that the same export operation with the same parameters always generates the same path.

Version

Version is the version number of the result of the export operation. It is a non negative integer. LynxKite treats export operations as other operations: it remembers the result (which in this case is the knowledge that the export was successfully done) and won’t repeat the calculation. However, there might be a need to export an already exported table with the same set of parameters (e.g. the exported file is lost). In this case you need to change the version number, making that parameters are not the same as in the previous export.

Export for download

Set this to "true" if the purpose of this export is file download: in this case LynxKite will repartition the data into one single file, which will be downloaded. The default "no" will result in no such repartition: this performs much better when other, partition-aware tools are used to import the exported data.

Save mode

The following modes can be used: "error if exists", "overwrite", "append", "ignore". In the last case already existing data will not be modified.

Export to ORC

Apache ORC is a columnar data storage format.

Path

The distributed file-system path of the output file. It defaults to <auto>, in which case the path is auto generated from the parameters and the type of export (e.g. Export to CSV). This means that the same export operation with the same parameters always generates the same path.

Version

Version is the version number of the result of the export operation. It is a non negative integer. LynxKite treats export operations as other operations: it remembers the result (which in this case is the knowledge that the export was successfully done) and won’t repeat the calculation. However, there might be a need to export an already exported table with the same set of parameters (e.g. the exported file is lost). In this case you need to change the version number, making that parameters are not the same as in the previous export.

Export for download

Set this to "true" if the purpose of this export is file download: in this case LynxKite will repartition the data into one single file, which will be downloaded. The default "no" will result in no such repartition: this performs much better when other, partition-aware tools are used to import the exported data.

Save mode

The following modes can be used: "error if exists", "overwrite", "append", "ignore". In the last case already existing data will not be modified.

Export to Parquet

Apache Parquet is a columnar data storage format.

Path

The distributed file-system path of the output file. It defaults to <auto>, in which case the path is auto generated from the parameters and the type of export (e.g. Export to CSV). This means that the same export operation with the same parameters always generates the same path.

Version

Version is the version number of the result of the export operation. It is a non negative integer. LynxKite treats export operations as other operations: it remembers the result (which in this case is the knowledge that the export was successfully done) and won’t repeat the calculation. However, there might be a need to export an already exported table with the same set of parameters (e.g. the exported file is lost). In this case you need to change the version number, making that parameters are not the same as in the previous export.

Export for download

Set this to "true" if the purpose of this export is file download: in this case LynxKite will repartition the data into one single file, which will be downloaded. The default "no" will result in no such repartition: this performs much better when other, partition-aware tools are used to import the exported data.

Save mode

The following modes can be used: "error if exists", "overwrite", "append", "ignore". In the last case already existing data will not be modified.

Expose internal edge ID

Exposes the internal edge ID as an attribute. Useful if you want to identify edges, for example in an exported dataset.

Attribute name

The ID attribute will be saved under this name.

Expose internal vertex ID

Exposes the internal vertex ID as an attribute. This attribute is automatically generated by operations that generate new vertex sets. (In most cases this is already available as attribute ‘id’.) But you can regenerate it with this operation if necessary.

Attribute name

The ID attribute will be saved under this name.

External computation 1

This box represents computation outside of LynxKite. See the @external decorator in the Python API.

Snapshot prefix

The external computation will save the results as a snapshot. This is the prefix of the name of that snapshot.

External computation 10

This box represents computation outside of LynxKite. See the @external decorator in the Python API.

Snapshot prefix

The external computation will save the results as a snapshot. This is the prefix of the name of that snapshot.

External computation 2

This box represents computation outside of LynxKite. See the @external decorator in the Python API.

Snapshot prefix

The external computation will save the results as a snapshot. This is the prefix of the name of that snapshot.

External computation 3

This box represents computation outside of LynxKite. See the @external decorator in the Python API.

Snapshot prefix

The external computation will save the results as a snapshot. This is the prefix of the name of that snapshot.

External computation 4

This box represents computation outside of LynxKite. See the @external decorator in the Python API.

Snapshot prefix

The external computation will save the results as a snapshot. This is the prefix of the name of that snapshot.

External computation 5

This box represents computation outside of LynxKite. See the @external decorator in the Python API.

Snapshot prefix

The external computation will save the results as a snapshot. This is the prefix of the name of that snapshot.

External computation 6

This box represents computation outside of LynxKite. See the @external decorator in the Python API.

Snapshot prefix

The external computation will save the results as a snapshot. This is the prefix of the name of that snapshot.

External computation 7

This box represents computation outside of LynxKite. See the @external decorator in the Python API.

Snapshot prefix

The external computation will save the results as a snapshot. This is the prefix of the name of that snapshot.

External computation 8

This box represents computation outside of LynxKite. See the @external decorator in the Python API.

Snapshot prefix

The external computation will save the results as a snapshot. This is the prefix of the name of that snapshot.

External computation 9

This box represents computation outside of LynxKite. See the @external decorator in the Python API.

Snapshot prefix

The external computation will save the results as a snapshot. This is the prefix of the name of that snapshot.

Fill edge attributes with constant default values

An attribute may not be defined on every edge. This operation sets a default value for the edges where it was not defined.

The default values for each attribute

The given value will be set for edges where the attribute is not defined. No change for attributes for which the default value is left empty. The default value must be numeric for Double attributes.

Fill vertex attributes with constant default values

An attribute may not be defined on every vertex. This operation sets a default value for the vertices where it was not defined.

The default values for each attribute

The given value will be set for vertices where the attribute is not defined. No change for attributes for which the default value is left empty. The default value must be numeric for Double attributes.

Filter by attributes

Keeps only vertices and edges that match the specified filters.

You can specify filters for multiple attributes at the same time, in which case you will be left with vertices/edges that match all of your filters.

Regardless of the exact the filter, whenever you specify a filter for an attribute you always restrict to those edges/vertices where the attribute is defined. E.g. if say you have a filter requiring age > 10, then you will only keep vertices where age attribute is defined and the value of age is more than ten.

The filtering syntax depends on the type of the attribute in most cases.

Match all filter

For every attribute type * matches all defined values. This is useful for discarding vertices/edges where a specific attribute is undefined.

Comma separated list

This filter is a comma-separated list of values you want to match. It can be used for String, Double, and Long types. For example medium,high would be a String filter to match these two values only, e.g., it would exclude low values. Another example is 19,20,30.

Comparison filters

These filters are available for String, Double, and Long types. You can specify bounds, with the <, >, <=, >= operators; furthermore, = and == are also accepted as operators, providing exact matching. For example >=12.5 will match values no less than 12.5. Another example is <=apple: this matches the word apple itself plus those words that come before apple in a lexicographic ordering.

Interval filters

For String, Double, and Long types you can specify intervals with brackets. The parenthesis (( )) denotes an exclusive boundary and the square bracket ([ ]) denotes an inclusive boundary. The lower and upper boundaries can be both inclusive or exclusive, or they can be different. For example, [0,10) will match x if 0 ≤ x < 10. Another example is [2018-03-01,2018-04-22]; this matches those dates that fall between the given dates (inclusively), assuming that the filtered attribute is question is a string representing a date in the given format (YYYY-MM-DD).

Regex filters

For String attributes, regex filters can also be applied. The following tips and examples can be useful:

  • regex(xyz) for finding strings that contain xyz.

  • regex(^Abc) for strings that start with Abc.

  • regex(Abc$) for strings that end with Abc.

  • regex((.)\1) for strings with double letters, like abbc.

  • regex(\d) or regex([0-9]) for strings that contain a digit, like a2c.

  • regex(^\d+$) for strings that are valid integer numbers, like 123.

  • regex(A|B) for strings that contain either A or B.

  • Regex is case sensitive.

  • For a more detailed explanation see https://en.wikipedia.org/wiki/Regular_expression

Pairwise interval filters

For the (Double, Double) type, you can use interval filters to filter the first and second coordinates. List the intervals for the first and second coordinates separated with a comma. Intervals can be specified with brackets, just like for the simple interval filters. For example [0,2), [3,4] will match (x, y) if 0 ≤ x < 2 and 3 ≤ y ≤ 4.

All and exists filters

These filters can be used for attributes whose type is Vector. The filter all(…​) will match the Vector only when the internal filter matches all elements of the Vector. You can also use forall and as synonyms. For example all(<0) for a Vector[Double] will match when the Vector contains no positive items. (This would include empty Vector values.) The second filter in this category is any(…​); this will will match the Vector only when the internal filter matches at least one element of the Vector. Synonyms are exists, some, and . For example any(male) for a Vector[String] will match when the Vector contains at least one male. (This would not include empty vectors, but would include those where all elements are male.)

Negation filter

Any filter can be prefixed with ! to negate it. For example !medium will exclude medium values. Another typical usecase for this is specifying ! (a single exclamation mark character) as the filter for a String attribute. This is interpreted as non-empty, so it will restrict to those vertices/edges where the String attribute is defined and its value is not empty string. Remember, all filters work on defined values only, so !* will not match any vertices/edges.

Quoted strings

If you need a string filter that contains a character with a special meaning (e.g., >), use double quotes around the string. E.g., >"=apple" matches exactly those strings that are lexicographically greater than the string =apple. All characters but quote (") and backslash (\) retain their verbatim meaning in such a quoted string. The quotation character is used to show the boundaries of the string and the backslash character can be used to provide a verbatim double quote or a backslash in the quoted string. Thus, the filter "=()\"\\" matches =()"\.

Find connected components

Creates a segment for every connected component of the graph.

Connected components are maximal vertex sets where a path exists between each pair of vertices.

Segmentation name

The new segmentation will be saved under this name.

Edge direction
Ignore directions

The algorithm adds reversed edges before calculating the components.

Require both directions

The algorithm discards non-symmetric edges before calculating the components.

Find infocom communities

Creates a segmentation of overlapping communities.

The algorithm finds maximal cliques then merges them to communities. Two cliques are merged if they sufficiently overlap. More details can be found in Information Communities: The Network Structure of Communication.

It often makes sense to filter out high degree vertices before detecting communities. In a social graph real people are unlikely to maintain thousands of connections. Filtering high degree vertices out is also known to speed up the algorithm significantly.

Name for maximal cliques segmentation

A new segmentation with the maximal cliques will be saved under this name.

Name for communities segmentation

The new segmentation with the infocom communities will be saved under this name.

Edges required in both directions

Whether edges have to exist in both directions between all members of a clique.

If the direction of the edges is not important, set this to false. This will allow placing two vertices into the same clique even if they are only connected in one direction.

Minimum clique size

Cliques smaller than this will not be collected.

This improves the performance of the algorithm, and small cliques are often not a good indicator anyway.

Adjacency threshold for clique overlaps

Clique overlap is a measure of the overlap between two cliques relative to their sizes. It is normalized to [0, 1). This parameter controls when to merge cliques into a community.

A lower threshold results in fewer, larger communities. If the threshold is low enough, a single giant community may emerge. Conversely, increasing the threshold eventually makes the giant community disassemble.

Find maximal cliques

Creates a segmentation of vertices based on the maximal cliques they are the member of. A maximal clique is a maximal set of vertices where there is an edge between every two vertex. Since one vertex can be part of multiple maximal cliques this segmentation might be overlapping.

Segmentation name

The new segmentation will be saved under this name.

Edges required in both directions

Whether edges have to exist in both directions between all members of a clique.

If the direction of the edges is not important, set this to false. This will allow placing two vertices into the same clique even if they are only connected in one direction.

Minimum clique size

Cliques smaller than this will not be collected.

This improves the performance of the algorithm, and small cliques are often not a good indicator anyway.

Find modular clustering

Tries to find a partitioning of the vertices with high modularity.

Edges that go between vertices in the same segment increase modularity, while edges that go from one segment to the other decrease modularity. The algorithm iteratively merges and splits segments and moves vertices between segments until it cannot find changes that would significantly improve the modularity score.

Segmentation name

The new segmentation will be saved under this name.

Weight attribute

The attribute to use as edge weights.

Maximum number of iterations to do

After this number of iterations we stop regardless of modularity increment. Use -1 for unlimited.

Minimal modularity increment in an iteration to keep going

If the average modularity increment in the last few iterations goes below this then we stop the algorithm and settle with the clustering found.

Find Steiner tree

Given a directed graph in which each vertex has two associated quantities, the "gain", and the "root cost", and each edge has an associated quantity, the "cost", this operation will yield a forest (a set of trees) that is a subgraph of the given graph. Furthermore, in this subgraph, the sum of the gains minus the sum of the (edge and root) costs approximate the maximal possible value.

The operation will result in four outputs: (1) A new edge attribute, which will specify which edges are part of the optimal solution. Its value will be 1.0 for edges that are part of the optimal forest and not defined otherwise; (2) A new vertex attribute, which will specify which vertices are part of the optimal solution. Its value will be 1.0 for vertices that are part of the optimal forest and not defined otherwise. (3) A new scalar value that contains the net gain, that is, the total sum of the gains minus the total sum of the (edge and root) costs; and (4) A new vertex attribute that will specify the root vertices in the optimal solution: it will be 1.0 for the root vertices and not defined otherwise.

Output edge attribute name

The new edge attribute will be created under this name, to pinpoint the edges in the solution.

Output vertex attribute name

The new vertex attribute will be created under this name, to pinpoint the vertices in the solution.

The profit scalar variable

This new scalar variable will be created under this name.

Output vertex attribute name for the solution root points

The new vertex attribute will be created under this name, to pinpoint the tree roots in the optimal solution.

Cost attribute

This edge attribute specified here will determine the cost for including the given edge in the solution. Negative and undefined values are treated as 0.

Cost for using the point as root

The vertex attribute specified here determines the cost for allowing the given vertex to be a starting point (the root) of a tree in the solution forest. Negative or undefined values mean that the vertex cannot be used as a root point.

Reward for reaching the vertex

This vertex attribute specifies the reward (gain) for including the given vertex in the solution. Negative or undefined values are treated as 0.

Find triangles

Creates a segment for every triangle in the graph. A triangle is defined as 3 pairwise connected vertices, regardless of the direction and number of edges between them. This means that triangles with one or more multiple edges are still only counted once, and the operation does not differentiate between directed and undirected triangles. Since one vertex can be part of multiple triangles this segmentation might be overlapping.

Segmentation name

The new segmentation will be saved under this name.

Edges required in both directions

Whether edges have to exist in both directions between all members of a triangle.

If the direction of the edges is not important, set this to false. This will allow placing two vertices into the same clique even if they are only connected in one direction.

Fingerprint based on attributes

In a graph that has two different String identifier attributes (e.g. Facebook ID and MSISDN) this operation will match the vertices that only have the first attribute defined with the vertices that only have the second attribute defined. For the well-matched vertices the new attributes will be added. (For example if a vertex only had an MSISDN and we found a matching Facebook ID, this will be saved as the Facebook ID of the vertex.)

The matched vertices will not be automatically merged, but this can easily be performed with the Merge vertices by attribute operation on either of the two identifier attributes.

The matches are identified by calculating a similarity score between vertices and picking a matching that ensures a high total similarity score across the matched pairs.

The similarity calculation is based on the network structure: the more alike their neighborhoods are, the more similar two vertices are considered. Vertex attributes are not considered in the calculation.

Parameters

First ID attribute

Two identifying attributes have to be selected.

Second ID attribute

Two identifying attributes have to be selected.

Edge weights

What Double edge attribute to use as edge weight. The edge weights are also considered when calculating the similarity between two vertices.

Minimum overlap

The number of common neighbors two vertices must have to be considered for matching. It must be at least 1. (If two vertices have no common neighbors their similarity would be zero anyway.)

Minimum similarity

The similarity threshold below which two vertices will not be considered a match even if there are no better matches for them. Similarity is normalized to [0, 1].

Fingerprinting algorithm additional parameters

You can use this box to further tweak how the fingerprinting operation works. Consult with a Lynx expert if you think you need this.

Graph visualization

Creates a visualization from the input project. You can use the state view of the project to define the parameters and layout of the visualization. See Graph visualizations for more details.

Grow segmentation

Grows the segmentation along edges of the parent graph.

This operation modifies this segmentation by growing each segment with the neighbors of its elements. For example if vertex A is a member of segment X and edge A → B exists in the original graph then B also becomes the member of X (depending on the value of the direction parameter).

This operation can be used together with Use base project as segmentation to create a segmentation of neighborhoods.

Direction

Adds the neighbors to the segments using this direction.

Hash vertex attribute

Uses the SHA-256 algorithm to hash an attribute: all values of the attribute get replaced by a seemingly random value. The same original values get replaced by the same new value and different original values get (almost certainly) replaced by different new values.

Treat the salt like a password for the data. Choose a long string that the recipient of the data has no chance of guessing. (Do not use the name of a person or project.)

The salt must begin with the prefix SECRET( and end with ), for example SECRET(qCXoC7l0VYiN8Qp). This is important, because LynxKite will replace such strings with three asterisks when writing log files. Thus, the salt cannot appear in log files. Caveat: Please note that the salt must still be saved to disk as part of the workspace; only the log files are filtered this way.

To illustrate the mechanics of irreversible hashing and the importance of a good salt string, consider the following example. We have a data set of phone calls and we have hashed the phone numbers. Arthur gets access to the hashed data and learns or guesses the salt. Arthur can now apply the same hashing to the phone number of Guinevere as was used on the original data set and look her up in the graph. He can also apply hashing to the phone numbers of all the knights of the round table and see which knight has Guinevere been making calls to.

Vertex attribute

The attribute(s) which will be hashed.

Salt

The value of the salt.

Import CSV

CSV stands for comma-separated values. It is a common human-readable file format where each record is on a separate line and fields of the record are simply separated with a comma or other delimiter. CSV does not store data types, so all fields become strings when importing from this format.

File

Upload a file by clicking the button or specify a path explicitly. Wildcard (foo/*.csv) and glob (foo/{bar,baz}.csv) patterns are accepted. See Prefixed paths for more details on specifying paths.

Columns in file

The names of all the columns in the file, as a comma-separated list. If empty, the column names will be read from the file. (Use this if the file has a header.)

Delimiter

The delimiter separating the fields in each line.

Quote character

The character used for escaping quoted values where the delimiter can be part of the value.

Escape character

The character used for escaping quotes inside an already quoted value.

Null value

The string representation of a null value in the CSV file. For example if set to undefined, every undefined value in the CSV file will be converted to Scala null-s. By default this is an empty string, so empty strings are converted to null-s upon import.

Date format

The string that indicates a date format. Custom date formats follow the formats at java.text.SimpleDateFormat.

Timestamp format

The string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat.

Ignore leading white space

A flag indicating whether or not leading whitespaces from values being read should be skipped.

Ignore trailing white space

A flag indicating whether or not trailing whitespaces from values being read should be skipped.

Comment character

Every line beginning with this character is skipped, if set. For example if the comment character is the following line is ignored in the CSV file: This is a comment.

Error handling

What should happen if a line has more or less fields than the number of columns?

Fail on any malformed line will cause the import to fail if there is such a line.

Ignore malformed lines will simply omit such lines from the table. In this case an erroneously defined column list can result in an empty table.

Salvage malformed lines: truncate or fill with nulls will still import the problematic lines, dropping some data or inserting undefined values.

Infer types

Automatically detects data types in the CSV. For example a column full of numbers will become a Double. If disabled, all columns are imported as Strings.

Columns to import

The columns to import. Leave empty to import all columns.

Limit

Number of rows to import at the most. Leave empty to import all rows.

SQL

Spark SQL query to execute before writing the imported data to storage. The input table can be referred to as this in the query. For example: SELECT * FROM this WHERE date = '2019-01-01'

Table GUID

Click this button to actually kick off the import. You can click it again later to repeat the import. (Useful if the source data has changed.)

Import from Hive

Import an Apache Hive table directly to LynxKite.

Hive table

The name of the Hive table to import.

Columns to import

The columns to import. Leave empty to import all columns.

Limit

Number of rows to import at the most. Leave empty to import all rows.

SQL

Spark SQL query to execute before writing the imported data to storage. The input table can be referred to as this in the query. For example: SELECT * FROM this WHERE date = '2019-01-01'

Table GUID

Click this button to actually kick off the import. You can click it again later to repeat the import. (Useful if the source data has changed.)

Import JDBC

JDBC is used to connect to relational databases such as MySQL. See Database connections for setup steps required for connecting to a database.

JDBC URL

The connection URL for the database. This typically includes the username and password. The exact syntax entirely depends on the database type. Please consult the documentation of the database.

Table

The name of the database table to import.

All identifiers have to be properly quoted according to the SQL syntax of the source database.

The following formats may work depending on the type of the source database:

  • TABLE_NAME

  • SCHEMA_NAME.TABLE_NAME

  • (SELECT * FROM TABLE_NAME WHERE <filter condition>) TABLE_ALIAS

    In the last example the filtering query runs on the source database, before the import. It can dramatically reduce network traffic needed for the import operation and it makes possible to use data source specific SQL dialects.

Key column

This column is used to partition the SQL query. The range from min(key) to max(key) will be split into a sub-range for each Spark worker, so they can each query a part of the data in parallel.

Pick a column that is uniformly distributed. Numerical identifiers will give the best performance. String (VARCHAR) columns are also supported but only work well if they mostly contain letters of the English alphabet and numbers.

If the partitioning column is left empty, only a fraction of the cluster resources will be used.

The column name has to be properly quoted according to the SQL syntax of the source database.

Number of partitions

LynxKite will perform this many SQL queries in parallel to get the data. Leave at zero to let LynxKite automatically decide. Set a specific value if the database cannot support that many queries.

Partition predicates

This advanced option provides even greater control over the partitioning. It is an alternative option to specifying the key column. Here you can specify a comma-separated list of WHERE clauses, which will be used as the partitions.

For example you could provide AGE < 30, AGE >= 30 AND AGE < 60, AGE >= 60 as the list of predicates. It would result in three partitions, each querying a different piece of the data, as specified.

Columns to import

The columns to import. Leave empty to import all columns.

Limit

Number of rows to import at the most. Leave empty to import all rows.

SQL

Spark SQL query to execute before writing the imported data to storage. The input table can be referred to as this in the query. For example: SELECT * FROM this WHERE date = '2019-01-01'

Table GUID

Click this button to actually kick off the import. You can click it again later to repeat the import. (Useful if the source data has changed.)

Import JSON

JSON is a rich human-readable data format. JSON files are larger than CSV files but can represent data types. Each line of the file in this format stores one record encoded as a JSON object.

File

Upload a file by clicking the button or specify a path explicitly. Wildcard (foo/*.json) and glob (foo/{bar,baz}.json) patterns are accepted. See Prefixed paths for more details on specifying paths.

Columns to import

The columns to import. Leave empty to import all columns.

Limit

Number of rows to import at the most. Leave empty to import all rows.

SQL

Spark SQL query to execute before writing the imported data to storage. The input table can be referred to as this in the query. For example: SELECT * FROM this WHERE date = '2019-01-01'

Table GUID

Click this button to actually kick off the import. You can click it again later to repeat the import. (Useful if the source data has changed.)

Import Neo4j

Import data from an existing Neo4j database. The connection can be configured through the following variables in the .kiterc file:

  • NEO4J_URI: URI to connect to Neo4j, only bolt protocol is supported. The URI has to follow the bolt://<host>:<port> structure.

  • NEO4J_PASSWORD: Password to connect to Neo4j. You can leave it empty in case no password is required

  • NEO4J_USER: User used to connect to Neo4j

In case you want to change the values of the variables, you will have to restart LynxKite for the changes to take effect.

Node Label

The label for the type of node that you want to import from Neo4j. All the nodes with that label will be imported as a table, with each property as a column. You can specify the properties to import using the Columns to import parameter. The id ( id() function of Neo4j) of the node will be automatically included in the import as the special variable id$. Only one of node label or relationship type can be specified.

Relationship type

The type of the relationship that you want to import from Neo4j. The relationship will be imported as a table, with each property as a column. You can specify the properties to import using the Columns to import parameter. If you want to import properties from the source or the destination (target) nodes you can do it by adding the prefix source_ or target_ to the property. The id ( id() function of Neo4j) of both the source and the destination nodes, will be automatically included in the import as the special variables source_id$ and target_id$. Only one of node label or relationship type can be specified.

Number of partitions

LynxKite will perform this many queries in parallel to get the data. Leave at zero to let LynxKite automatically decide. Set a specific value if you want to control the level of parallelism.

Infer types

Automatically tries to cast data types from Neo4j. For example a column full of numbers will become a Double. If disabled, all columns are imported as String. It is recommended to set this to false, as Neo4j types do not integrate very well with Spark (Eg. Date types from Neo4j are not supported).

Columns to import

The columns to import. Leave empty to import all columns.

Limit

Number of rows to import at the most. Leave empty to import all rows.

SQL

Spark SQL query to execute before writing the imported data to storage. The input table can be referred to as this in the query. For example: SELECT * FROM this WHERE date = '2019-01-01'

Table GUID

Click this button to actually kick off the import. You can click it again later to repeat the import. (Useful if the source data has changed.)

Import ORC

Apache ORC is a columnar data storage format.

File

The distributed file-system path of the file. See Prefixed paths for more details on specifying paths.

Columns to import

The columns to import. Leave empty to import all columns.

Limit

Number of rows to import at the most. Leave empty to import all rows.

SQL

Spark SQL query to execute before writing the imported data to storage. The input table can be referred to as this in the query. For example: SELECT * FROM this WHERE date = '2019-01-01'

Table GUID

Click this button to actually kick off the import. You can click it again later to repeat the import. (Useful if the source data has changed.)

Import Parquet

Apache Parquet is a columnar data storage format.

File

The distributed file-system path of the file. See Prefixed paths for more details on specifying paths.

Columns to import

The columns to import. Leave empty to import all columns.

Limit

Number of rows to import at the most. Leave empty to import all rows.

SQL

Spark SQL query to execute before writing the imported data to storage. The input table can be referred to as this in the query. For example: SELECT * FROM this WHERE date = '2019-01-01'

Table GUID

Click this button to actually kick off the import. You can click it again later to repeat the import. (Useful if the source data has changed.)

Import snapshot

Makes a previously saved snapshot accessible from the workspace.

Path

The full path to the snapshot in LynxKite’s virtual filesystem.

Import union of table snapshots

Makes the union of a list of previously saved table snapshots accessible from the workspace as a single table.

The union works as the UNION ALL command in SQL and does not remove duplicates.

Paths

The comma separated set of full paths to the snapshots in LynxKite’s virtual filesystem.

  • Each path has to refer to a table snapshot.

  • The tables have to have the same schema.

  • The output table will union the input tables in the same order as defined here.

Import well-known graph dataset

Gives easy access to graph datasets commonly used for benchmarks.

See the PyTorch Geometric documentation for details about the specific datasets.

Name

Which dataset to import.

Input

This special box represents an input that comes from outside of this workspace. This box will not have a valid output on its own. When this workspace is used as a custom box in another workspace, the custom box will have one input for each input box. When the inputs are connected, those input states will appear on the outputs of the input boxes.

Input boxes without a name are ignored. Each input box must have a different name.

See the section on Custom boxes on how to use this box.

Name

The name of the input, when the workspace is used as a custom box.

Finds the best matching between a project and a segmentation. It considers a base vertex A and a segment B a good "match" if the neighborhood of A (including A) is very connected to the neighborhood of B (including B) according to the current connections between project and segmentation.

The result of this operation is a new edge set between the project and the segmentation, that is a one-to-one matching.

The matches are identified by calculating a similarity score between vertices and picking a matching that ensures a high total similarity score across the matched pairs.

The similarity calculation is based on the network structure: the more alike their neighborhoods are, the more similar two vertices are considered. Vertex attributes are not considered in the calculation.

Example use case

Project M is an MSISDN graph based on call data. Project F is a Facebook graph. A CSV file contains a number of MSISDN → Facebook ID mappings, a many-to-many relationship. Connect the two projects with Use other project as segmentation and Use table as segmentation links, then use the fingerprinting operation to turn the mapping into a high-quality one-to-one relationship.

Parameters

Minimum overlap

The number of common neighbors two vertices must have to be considered for matching. It must be at least 1. (If two vertices have no common neighbors their similarity would be zero anyway.)

Minimum similarity

The similarity threshold below which two vertices will not be considered a match even if there are no better matches for them. Similarity is normalized to [0, 1].

Fingerprinting algorithm additional parameters

You can use this box to further tweak how the fingerprinting operation works. Consult with a Lynx expert if you think you need this.

Lookup region

For every position vertex attribute looks up features in a Shapefile and returns a specified attribute.

  • The lookup depends on the coordinate reference system of the feature. The input position must use the same coordinate reference system as the one specified in the Shapefile.

  • If there are no matching features the output is omitted.

  • If the specified attribute does not exist for any matching feature the output is omitted.

  • If there are multiple suitable features the algorithm picks the first one.

Shapefiles can be obtained from various sources, like OpenStreetMap.

Parameters

Position

The (latitude, longitude) location tuple.

Shapefile

The Shapefile used for the lookup. The list is created from the files in the KITE_META/resources/shapefiles directory. A Shapefile consist of a .shp, .shx and .dbf file of the same name.

Attribute from the Shapefile

The attribute in the Shapefile used for the output.

Ignore unsupported shape types

If set true, silently ignores unknown shape types potentially contained by the Shapefile. Otherwise throws an error.

Output

The name of the new vertex attribute.

Make all segments empty

Throws away all segmentation links.

Map hyperbolic coordinates

Map an undirected graph to a hyperbolic surface. Vertices get two attributes called "radial" and "angular" that can be used for edge strength evaluation or link prediction. Algorithm based on paper.

The coordinates are generated by simulating hyperbolic growth. The algorithm’s results are most useful when the graph to be mapped follows a power-law degree distribution and has high clustering.

Seed

The random seed.

LynxKite operations are typically deterministic. If you re-run an operation with the same random seed, you will get the same results as before. To get a truly independent random re-run, make sure you choose a different random seed.

The default value for random seed parameters is randomly picked, so only very rarely do you need to give random seeds any thought.

Merge parallel edges by attribute

Multiple edges going from A to B that share the same value of the given edge attribute will be merged into a single edge. The edges going from A to B are not merged with edges going from B to A.

Merge by

The edge attribute on which the merging will be based.

The available aggregators are:

  • For Double attributes:

    • average

    • count_distinct (the number of distinct values)

    • count_most_common (the number of occurrences of the most common value)

    • count (number of cases where the attribute is defined)

    • first (arbitrarily picks a value)

    • max

    • median

    • min

    • most_common

    • set (all the unique values, as a Set attribute)

    • std_deviation (standard deviation)

    • sum

    • vector (all the values, as a Vector attribute)

  • For String attributes:

    • count_distinct (the number of distinct values)

    • count_most_common (the number of occurrences of the most common value)

    • count (number of cases where the attribute is defined)

    • majority_100 (the value that 100% agree on, or empty string)

    • majority_50 (the value that 50% agree on, or empty string)

    • most_common

    • set (all the unique values, as a Set attribute)

    • vector (all the values, as a Vector attribute)

  • For other attributes:

    • count_distinct (the number of distinct values)

    • count_most_common (the number of occurrences of the most common value)

    • count (number of cases where the attribute is defined)

    • most_common

    • set (all the unique values, as a Set attribute)

Merge parallel edges

Multiple edges going from A to B will be merged into a single edge. The edges going from A to B are not merged with edges going from B to A.

Edge attributes can be aggregated across the merged edges.

Example use case

This operation can be used to turn a call data graph into a relationship graph. Multiple calls will will be merged into one relationship. To define the strength of this relationship, you can use the count of calls, or total duration, or the total cost, or some other aggregate metric.

Parameters

The available aggregators are:

  • For Double attributes:

    • average

    • count_distinct (the number of distinct values)

    • count_most_common (the number of occurrences of the most common value)

    • count (number of cases where the attribute is defined)

    • first (arbitrarily picks a value)

    • max

    • median

    • min

    • most_common

    • set (all the unique values, as a Set attribute)

    • std_deviation (standard deviation)

    • sum

    • vector (all the values, as a Vector attribute)

  • For String attributes:

    • count_distinct (the number of distinct values)

    • count_most_common (the number of occurrences of the most common value)

    • count (number of cases where the attribute is defined)

    • majority_100 (the value that 100% agree on, or empty string)

    • majority_50 (the value that 50% agree on, or empty string)

    • most_common

    • set (all the unique values, as a Set attribute)

    • vector (all the values, as a Vector attribute)

  • For other attributes:

    • count_distinct (the number of distinct values)

    • count_most_common (the number of occurrences of the most common value)

    • count (number of cases where the attribute is defined)

    • most_common

    • set (all the unique values, as a Set attribute)

Multiple segmentation links going from A base vertex to B segmentation vertex will be merged into a single link.

After performing a Merge vertices by attribute operation, there might be multiple parallel links going between some of the base project and segmentation vertices. This can cause unexpected behavior when aggregating to or from the segmentation. This operation addresses this behavior by merging parallel segmentation links.

Merge two edge attributes

An attribute may not be defined on every edge. This operation uses the secondary attribute to fill in the values where the primary attribute is undefined. If both are undefined on an edge then the result is undefined too.

New attribute name

The new attribute will be created under this name.

Primary attribute

If this attribute is defined on an edge, then its value will be copied to the output attribute.

Secondary attribute

If the primary attribute is not defined on an edge but the secondary attribute is, then the secondary attribute’s value will be copied to the output variable.

Merge two vertex attributes

An attribute may not be defined on every vertex. This operation uses the secondary attribute to fill in the values where the primary attribute is undefined. If both are undefined on a vertex then the result is undefined too.

New attribute name

The new attribute will be created under this name.

Primary attribute

If this attribute is defined on a vertex, then its value will be copied to the output attribute.

Secondary attribute

If the primary attribute is not defined on a vertex but the secondary attribute is, then the secondary attribute’s value will be copied to the output variable.

Merge vertices by attribute

Merges each set of vertices that are equal by the chosen attribute. Vertices where the chosen attribute is not defined are discarded. Aggregations can be specified for how to handle the rest of the attributes, which may be different among the merged vertices. Any edge that connected two vertices that are merged will become a loop.

Merge vertices by attributes might create parallel links between the base projects and its segmentations. If it is important that there are no such parallel links (e.g. when performing aggregations to and from segmentations), make sure to run the Merge parallel segmentation links operation on the segmentations in question.

Example use case

You merge phone numbers that have the same IMEI; each vertex then represents one mobile device. You can aggregate one attribute as count to have an attribute that represents the number of phone numbers merged into one vertex.

Parameters

Match by

If a set of vertices have the same value for the selected attribute, they will all be merged into a single vertex.

The available aggregators are:

  • For Double attributes:

    • average

    • count_distinct (the number of distinct values)

    • count_most_common (the number of occurrences of the most common value)

    • count (number of cases where the attribute is defined)

    • first (arbitrarily picks a value)

    • max

    • median

    • min

    • most_common

    • set (all the unique values, as a Set attribute)

    • std_deviation (standard deviation)

    • sum

    • vector (all the values, as a Vector attribute)

  • For String attributes:

    • count_distinct (the number of distinct values)

    • count_most_common (the number of occurrences of the most common value)

    • count (number of cases where the attribute is defined)

    • majority_100 (the value that 100% agree on, or empty string)

    • majority_50 (the value that 50% agree on, or empty string)

    • most_common

    • set (all the unique values, as a Set attribute)

    • vector (all the values, as a Vector attribute)

  • For other attributes:

    • count_distinct (the number of distinct values)

    • count_most_common (the number of occurrences of the most common value)

    • count (number of cases where the attribute is defined)

    • most_common

    • set (all the unique values, as a Set attribute)

Output

This special box represents an output that goes outside of this workspace. When this workspace is used as a custom box in another workspace, the custom box will have one output for each output box.

Output boxes without a name are ignored. Each output box must have a different name.

See the section on Custom boxes on how to use this box.

Name

The name of the output, when the workspace is used as a custom box.

Predict attribute by viral modeling

Viral modeling tries to predict unknown values of an attribute based on the known values of the attribute on peers that belong to the same segments.

The parameters make it possible to put restrictions on which segments to consider. For each vertex the segment with the lowest standard deviation will be picked. The prediction will be the average value across this segment.

The operation repeats this procedure multiple times. Each time the predictions from the last iteration are added to the accepted "truth", so it is possible to make predictions for vertices where it was not possible previously. The coverage and error of the predictions is expected to rise with the number of iterations.

Generated name prefix

All the outputs from the operation will have this prefix.

Target attribute

The attribute you want to predict.

Test set ratio

A test set is a random sample of the vertices. This parameter gives the size of the test set as a fraction of the total vertex count.

The error of the predictions is calculated on the test set. The attribute values in the test set are not used for making predictions.

Random seed for test set selection

Random seed.

LynxKite operations are typically deterministic. If you re-run an operation with the same random seed, you will get the same results as before. To get a truly independent random re-run, make sure you choose a different random seed.

The default value for random seed parameters is randomly picked, so only very rarely do you need to give random seeds any thought.

Maximal segment deviation

Segments where the standard deviation of the attribute value over its members is higher than this parameter will not be used for prediction.

Minimum number of defined attributes in a segment

Segments where the number of vertices upon which the attribute is defined is less than this parameter will not be used for prediction.

Minimal ratio of defined attributes in a segment

Segments where the fraction of vertices upon which the attribute is defined is less than this parameter will not be used for prediction.

Iterations

The number of iterations to perform. Each iteration builds upon the predictions of the previous iteration, so the coverage and error is expected to rise with the number of iterations.

Predict edges with hyperbolic positions

Creates additional edges in a graph based on hyperbolic distances between vertices. 2 * size edges will be added because the new edges are undirected. Vertices must have two Double vertex attributes to be used as radial and angular coordinates.

Number of predictions

The number of edges to generate. The total number will be 2 * size because every edge is added in two directions.

External degree

The number of edges a vertex creates from itself upon addition to the growth simulation graph.

Internal degree

The average number of edges created between older vertices whenever a new vertex is added to the growth simulation graph.

Exponent

The exponent of the power-law degree distribution. Values can be 0.5 - 1, endpoints excluded.

Radial

The vertex attribute to be used as radial coordinates. Should not contain negative values.

Angular

The vertex attribute to be used as angular coordinates. Values should be 0 - 2 * Pi.

Predict vertex attribute

If an attribute is defined for some vertices but not for others, machine learning can be used to fill in the blanks. A model is built from the vertices where the attribute is defined and the model predictions are generated for all the vertices.

The prediction is created in a new attribute named after the predicted attribute, such as age_prediction.

This operation only supports Double-typed (numeric) attributes. You can come up with ways to map other types to numbers to include them in the prediction. For example mapping gender to 0.0 and 1.0 makes sense.

It is a common practice to retain a test set which is not used for training the model. The test set can be used to evaluate the accuracy of the model’s predictions. You can do this by deriving a new vertex attribute that is undefined for the test set and using this restricted attribute as the basis of the prediction.

Attribute to predict

The partially defined attribute that you want to predict.

Predictors

The attributes that will be used as the input of the predictions. Predictions will be generated for vertices where all of the predictors are defined.

Method
  • Linear regression with no regularization.

  • Ridge regression (also known as Tikhonov regularization) with L2-regularization.

  • Lasso with L1-regularization.

  • Logistic regression for binary classification. (The predicted attribute must be 0 or 1.)

  • Naive Bayes classifier with multinomial event model.

  • Decision tree with maximum depth 5 and 32 bins for all features.

  • Random forest of 20 trees of depth 5 with 32 bins. One third of features are considered for splits at each node.

  • Gradient-boosted trees produce ensembles of decision trees with depth 5 and 32 bins.

Predict with a graph neural network

Trains a neural network using the graph’s vertex attributes and edges. Then uses this trained neural network to make a prediction on the same graph.

Currently the computation is not distributed, so please do not use it on really big graphs. It will be changed in the future.

Other significant changes are also possible in future versions. (The operation might be renamed or split into a separate training and prediction operation, some parameters might be added or removed, etc.)

Attribute to predict

The partially defined attribute that you want to predict. The current implementation only supports attributes between -1 and 1.

Save as

The prediction will be saved as an attribute created under this name.

Predictors

The attributes that will be used as the input of the prediction.

Network layout

There is a small network at every vertex. At first the input for these small networks is the label of the corresponding vertex and the sum of its neighbors' labels. Each vertex computes the output of the small network on this input. After this, the input for the small network will consist of the sum of its neighbors' outputs in the previous round and its own previous output. Here you can set the layout of the small networks.

  • MLP: The small network is simple, it contains only a single hidden layer. But the layers can have different weights in different rounds.

  • LSTM or GRU: In these layouts the small network is more complicated. Further information about LSTM and GRU.

Size of the network

The number of nodes in one layer of the neural network.

Iterations in prediction

The number of rounds when the vertices send information to their neighbors.

Hide own state

If it is set to true, then the vertices do not know their own label, but their neighbors can still see it.

Forget fraction

In every training iteration each vertex forgets its label with the probability given here. Neither the vertices themselves, nor their neighbors see the forgotten labels.

Weight for known labels

If the forget fraction is greater than 0, then the errors from the non-forgetting nodes are multiplied by this number. So if it is set to a small number, then the errors from non-forgetting vertices count less than the ones from the forgetting vertices.

Number of trainings

The training is performed on randomly chosen subgraphs. In the first round each node gets a small subgraph and performs a few iterations of training. After this the average of the learned weights is calculated. In the second round each node gets another small subgraph and performs a few iterations of training, starting from the average weights. After this, the average of the learned weights is calculated again, and so on. You can set here the number of these turns.

Iterations in training

The number of iterations in one round of training.

Subgraphs in training

The number of subgraphs chosen in one training round.

Minimum training subgraph size

The minimum size of subgraphs chosen for training.

Maximum training subgraph size

The maximum size of subgraphs chosen for training.

Radius for training subgraphs

If 0, the whole graph is used as one single training subgraph. Otherwise the subgraphs are chosen as follows. We choose a single vertex at random and get all the vertices whose distance from the chosen one is at most this number. If the number of these vertices is less than the minimum given above, then we choose another node and get its environment. We repeat this procedure until there are enough chosen vertices. If the number of these vertices is more than maximum given above, then we drop the last few points.

Seed

Random seed for initializing network weights and choosing subgraphs.

Learning rate

Determines the size of the steps in the gradient descent algorithm.

Predict with GCN

Uses a trained GCN to make predictions.

Save prediction as

The prediction will be saved as an attribute under this name.

Feature vector

Vector attribute containing the features to be used as inputs for the algorithm.

Attribute to predict

The attribute we want to predict. (This is used if the model was trained to use the target labels as additional inputs.)

Model

The model to use for the prediction.

Predict with model

Creates predictions from a model and vertex attributes of the graph.

Prediction vertex attribute name

The new attribute of the predictions will be created under this name.

Name and parameters of the model

The model used for the predictions and a mapping from vertex attributes to the model’s features.

Every feature of the model needs to be mapped to a vertex attribute.

Project rejoin

This operation allows the user to join (i.e., carry over) attributes from on project to another one. This is only allowed when the target of the join (where the attributes are taken to) and the source (where the attributes are taken from) are compatible. Compatibility in this context means that the source and the target have a "common ancestor", which makes it possible to perform the join. Suppose, for example, that operation Take edges as vertices have been applied, and then some new vertex attributes have been computed on the resulting project. These new vertex attributes can now be joined back to the original project (that was the input for Take edges as vertices), because there is a correspondence between the edges of the original project and the vertices that contain the newly computed vertex attributes.

Conversely, the edges and the vertices of a project will not be compatible (even if the number of edges is the same as the number of vertices), because no such correspondence can be established between the edges and the vertices in this case.

Additionally, it is possible to join segmentations from another project. This operation has an additional requirement (besides compatibility), namely, that both the target of the join (the left side) and the source be vertices (and not edges).

Please, bear it in mind that both attributes and segmentations will overwrite the original attributes and segmentations on the right side in case there is a name collision.

When vertex attributes are joined, it is also possible to copy over the edges from the source graph (provided that the source graph has edges). In this case, the original edges in the target graph are dropped, and the source edges (along with their attributes) will take their place.

Attributes

Attributes that should be joined to the project. They overwrite attributes in the target project which have identical names.

Segmentations

Segmentations to join to the project. They overwrite segmentations in the target side of the project which have identical names.

Copy edges

When set, the edges of the source project (and their attributes) will replace the edges of the target project.

Project union

The resulting graph is just a disconnected graph containing the vertices and edges of the two originating projects. All vertex and edge attributes are preserved. If an attribute exists in both projects, it must have the same data type in both.

The resulting graph will have as many vertices as the sum of the vertex counts in the two source graphs. The same with the edges.

Segmentations are discarded.

Example use case

You have imported a call data graph in one project and a Facebook graph in another. Some, but not all vertices have an email address associated with them. We want to merge the two graphs into a single graph that represents connections (either calls or Facebook friendships) between people.

A simple procedure for connecting the two graphs would be the following.

  1. Take the union of the two projects.

  2. Use Merge vertices by attribute to combine the vertices that can be exactly matched based on their email address.

  3. Use Fingerprint based on attributes to identify more matches based on neighborhood similarity.

Parameters

ID attribute name

The internal vertex IDs change after the union. The old ID attributes are preserved, but no longer reflect the internal IDs. The new internal IDs will be exposed through a new attribute. This parameter sets the name of this new attribute.

Pull segmentation one level up

Creates a copy of a segmentation in the parent of its parent segmentation. In the created segmentation, the set of segments will be the same as in the original. A vertex will be made member of a segment if it was transitively member of the corresponding segment in the original segmentation. The attributes, scalars and sub-segmentations of the segmentation are also copied.

Reduce vertex attributes to two dimensions

Transforms multiple Double attributes to a two-dimensional space (two Double attributes) by Principal Component Analysis. A pre-scaling on mean and standard deviation is performed.

Principal Component Analysis (PCA) is used to emphsize variation and bring out strong patterns in a dataset. It often makes data easy to explore and visualize. Principal Component Analysis

First dimension name

The first dimension will be stored as a Double attribute using this name.

Second dimension name

The second dimension will be stored as a Double attribute using this name.

Feature attributes

Attributes to be used as inputs for the dimension reduction.

Rename edge attributes

Changes the name of edge attributes.

The new names for each attribute

If the new name is empty, the attribute will be discarded.

Rename scalar

Changes the name of a scalar.

This operation is more easily accessed from the scalar’s dropdown menu in the project view.

Old name

The scalar to rename.

New name

The new name.

Rename segmentation

Changes the name of a segmentation.

This operation is more easily accessed from the segmentation’s dropdown menu in the project view.

Old name

The segmentation to rename.

New name

The new name.

Rename vertex attributes

Changes the name of vertex attributes.

The new names for each attribute

If the new name is empty, the attribute will be discarded.

Replace edges with triadic closure

For every A → B → C triplet, creates an A → C edge. The original edges are discarded. The new A → C edge gets the attributes of the original A → B and B → C edges with prefixes "ab_" and "bc_".

Be aware, in dense graphs a plenty of new edges can be generated.

Possible use case: we are looking for connections between vertices, like same subscriber with multiple devices. We have an edge metric that we think is a good indicator, or we have a model that gives predictions for edges. If we want to calculate this metric, and pick the edges with high values, it is possible that the edge that would be the winner does not exist. Often we think that a transitive closure would add the missing edge. For example, I don’t call my second phone, but I call a lot of the same people from the two phones.

Replace with edge graph

Creates the edge graph (aka line graph), where each vertex corresponds to an edge in the current graph. The vertices will be connected, if one corresponding edge is the continuation of the other.

Reverse edge direction

Replaces every A → B edge with its reverse edge (B → A).

Attributes are preserved. Running this operation twice gets back the original graph.

Sample edges from co-occurrence

Connects vertices in the parent project with a given probability if they co-occur in any segments. Multiple co-occurrences will have the same chance of being selected as single ones. Loop edges are also included with the same probability.

There are edges to sample from including parallel edges.

Vertex pair selection probability

The probability of choosing a vertex pair. The expected value of the number of created vertices will be probability * number of edges without parallel edges.

Random seed

The random seed.

LynxKite operations are typically deterministic. If you re-run an operation with the same random seed, you will get the same results as before. To get a truly independent random re-run, make sure you choose a different random seed.

The default value for random seed parameters is randomly picked, so only very rarely do you need to give random seeds any thought.

Sample graph by random walks

This operation realizes a random walk on the graph which can be used as a small smart sample to test your model on. The walk starts from a randomly selected vertex and at every step either aborts the current walk (with probability Walk abortion probability) and jumps back to the start point or moves to a randomly selected (directed sense) neighbor of the current vertex. After Number of walks from each start point restarts it selects a new start vertex. After Number of start points new start points were selected, it stops. The performance of this algorithm according to different metrics can be found in the following publication, https://cs.stanford.edu/people/jure/pubs/sampling-kdd06.pdf.

The output of the operation is a vertex and an edge attribute which describes which was the first step that ended at the given vertex / traversed the given edge. The attributes are not defined on vertices that were never reached or edges that were never traversed.

If the resulting sample is still too large, it can be quickly reduced by keeping only the low index nodes and edges. Obtaining a sample with exactly n vertices is also possible with the following procedure.

  1. Run this operation. Let us denote the computed vertex attribute by first_reached and edge attribute by first_traversed.

  2. Rank the vertices by first_reached.

  3. Filter the vertices by the rank attribute to keep the only vertex of rank n.

  4. Aggregate first_reached to a scalar on the filtered graph (use either average, first, max, min, or most_common - there is only one vertex in the filtered graph).

  5. Filter the vertices and edges of the original graph and keep the ones that has smaller or equal first_reached or first_traversed values than the value of the derived scalar.

Number of start points

The number of times a new start point is selected.

Number of walks from each start point

The number of times the random walk restarts from the same start point before selecting a new start point.

Walk abortion probability

The probability of aborting a walk instead of moving along an edge. Therefore the length of the parts of the walk between two abortions follows a geometric distribution with parameter Walk abortion probability.

Save vertex indices as

The name of the attribute which shows which step reached the given vertex first. It is not defined on vertices that were never reached.

Save edge indices as

The name of the attribute which shows which step traversed the given edge first. It is not defined on edges that were never traversed.

Seed

The random seed.

LynxKite operations are typically deterministic. If you re-run an operation with the same random seed, you will get the same results as before. To get a truly independent random re-run, make sure you choose a different random seed.

The default value for random seed parameters is randomly picked, so only very rarely do you need to give random seeds any thought.

Save to snapshot

Saves the input to a snapshot. The location of the snapshot has to be specified as a full path.

A full path in the LynxKite directory system has the following form: top_folder/subfolder_1/subfolder_2/…​/subfolder_n/name
Keep in mind that there is no leading slash at the beginning of the path.

Path

The full path of the target snapshot in the LynxKite directory system.

Segment by Double attribute

Segments the vertices by a Double vertex attribute.

The domain of the attribute is split into intervals of the given size and every vertex that belongs to a given interval will belong to one segment. Empty segments are not created.

Segmentation name

The new segmentation will be saved under this name.

Attribute

The Double attribute to segment by.

Interval size

The attribute’s domain will be split into intervals of this size. The splitting always starts at zero.

Overlap

If you enable overlapping intervals, then each interval will have a 50% overlap with both the previous and the next interval. As a result each vertex will belong to two segments, guaranteeing that any vertices with an attribute value difference less than half the interval size will share at least one segment.

Segment by event sequence

Treat vertices as people attending events, and segment them by attendance of sequences of events. There are several algorithms for generating event sequences, see under Algorithm.

This operation runs on a segmentation which contains events as vertices, and it is a segmentation over a graph containing people as vertices.

segmentation name

The new segmentation will be saved under this name.

Time attribute

The Double attribute corresponding the time of events.

Location

A segmentation over events or an attribute corresponding to the location of events.

Algorithm
  • Take continuous event sequences: Merges subsequent events of the same location, and then takes all the continuous event sequences of length Time window length, with maximal timespan of Time window length. For each of these events, a segment is created for each time bucket the starting event falls into. Time buckets are defined by Time window step and bucketing starts from 0.0 time.

  • Allow gaps in event sequences: Takes all event sequences that are no longer than Time window length and then creates a segment for each subsequence with Sequence length.

Sequence length

Number of events in each segment.

Time window step

Bucket size used for discretizing events.

Time window length

Maximum time difference between first and last event in a segment.

Segment by geographical proximity

Creates a segmentation from the features in a Shapefile. A vertex is connected to a segment if the the position vertex attribute is within a specified distance from the segment’s geometry attribute. Feature attributes from the Shapefile become segmentation attributes.

  • The lookup depends on the coordinate reference system and distance metric of the feature. All inputs must use the same coordinate reference system and distance metric.

  • This algorithm creates an overlapping segmentation since one vertex can be sufficiently close to multiple GEO segments.

Shapefiles can be obtained from various sources, like OpenStreetMap.

Parameters

Name

The name of the new geographical segmentation.

Position

The (latitude, longitude) location tuple.

Shapefile

The Shapefile used for the lookup. The list is created from the files in the KITE_META/resources/shapefiles directory. A Shapefile consist of a .shp, .shx and .dbf file of the same name.

Distance

Vertices are connected to geographical segments if within this distance. The distance has to use the same metric and coordinate reference system as the features within the Shapefile.

Ignore unsupported shape types

If set true, silently ignores unknown shape types potentially contained by the Shapefile. Otherwise throws an error.

Segment by interval

Segments the vertices by a pair of Double vertex attributes representing intervals.

The domain of the attributes is split into intervals of the given size. Each of these intervals will represent a segment. Each vertex will belong to each segment whose interval intersects with the interval of the vertex. Empty segments are not created.

segmentation name

The new segmentation will be saved under this name.

Begin attribute

The Double attribute corresponding the beginning of intervals to segment by.

End attribute

The Double attribute corresponding the end of intervals to segment by.

interval size

The attribute’s domain will be split into intervals of this size. The splitting always starts at zero.

overlap

If you enable overlapping intervals, then each interval will have a 50% overlap with both the previous and the next interval.

Segment by String attribute

Segments the vertices by a String vertex attribute.

Every vertex with the same attribute value will belong to one segment.

Segmentation name

The new segmentation will be saved under this name.

Attribute

The String attribute to segment by.

Segment by Vector attribute

Segments the vertices by a vector vertex attribute.

Segments are created from the values in all of the vector attributes. A vertex is connected to every segment corresponding to the elements in the vector.

Segmentation name

The new segmentation will be saved under this name.

Attribute

The vector attribute to segment by.

Set edge attribute icons

Associates icons with edge attributes. It has no effect beyond highlighting something on the user interface.

The icons are a subset of the Unicode characters in the "emoji" range, as provided by the Google Noto Font.

The icons for each attribute

Leave empty to remove the icon for the corresponding attribute or add one of the supported icon names, such as snowman_without_snow.

Set scalar icon

Associates an icon with a scalar. It has no effect beyond highlighting something on the user interface.

The icons are a subset of the Unicode characters in the "emoji" range, as provided by the Google Noto Font.

This operation is more easily accessed from the scalar’s dropdown menu in the project view.

Scalar

The scalar to highlight.

Icon

One of the supported icon names, such as snowman_without_snow. Leave empty to remove the icon.

Set segmentation icon

Associates an icon with a segmentation. It has no effect beyond highlighting something on the user interface.

The icons are a subset of the Unicode characters in the "emoji" range, as provided by the Google Noto Font.

This operation is more easily accessed from the segmentation’s dropdown menu in the project view.

Segmentation

The segmentation to highlight.

Icon

One of the supported icon names, such as snowman_without_snow. Leave empty to remove the icon.

Set vertex attribute icons

Associates icons vertex attributes. It has no effect beyond highlighting something on the user interface.

The icons are a subset of the Unicode characters in the "emoji" range, as provided by the Google Noto Font.

The icons for each attribute

Leave empty to remove the icon for the corresponding attribute or add one of the supported icon names, such as snowman_without_snow.

Create snowball sample

This operation creates a small smart sample of a graph. First, a subset of the original vertices is chosen for start points; the ratio of the size of this subset to the size of the original vertex set is the first parameter for the operation. Then a certain neighbourhood of each start point is added to the sample; the radius of this neighborhood is controlled by another parameter. The result of the operation is a subgraph of the original graph consisting of the vertices of the sample and the edges between them. This operation also creates a new attribute which shows how far the sample vertices are from the closest start point. (One vertex can be in more than one neighborhood.) This attribute can be used to decide whether a sample vertex is near to a start point or not.

For example, you can create a random sample of the project’s graph to test your model on smaller data set.

Start point ratio

The (approximate) fraction of vertices to use as starting points.

Radius

Limits the size of the neighborhoods of the start points.

Attribute name

The name of the attribute which shows how far the the sample vertices are from the closest start point.

Seed

The random seed.

LynxKite operations are typically deterministic. If you re-run an operation with the same random seed, you will get the same results as before. To get a truly independent random re-run, make sure you choose a different random seed.

The default value for random seed parameters is randomly picked, so only very rarely do you need to give random seeds any thought.

Split edges

Split (multiply) edges in a graph. A Double edge attribute controls how many copies of the edge should exist after the operation. If this attribute is 1, the edge will be kept as it is. If this attribute is zero, the edge will be discarded entirely. Higher values (e.g., 2) will result in more identical copies of the given edge.

After the operation, all previous edge attributes will be preserved; in particular, copies of one edge will have the same values for the previous edge attributes. A new edge attribute (the so called index attribute) will also be created so that you can differentiate between copies of the same edge. If a given edge was multiplied by n times, the n new edges will have n different index attribute values running from 0 to n-1.

Repetition attribute

A Double edge attribute that specifies how many copies of the edge should exist after the operation. (The Double value is rounded to the nearest integer, so 1.8 will mean 2 copies.)

Index attribute name

The name of the attribute that will contain unique identifiers for the otherwise identical copies of the edge.

Split to train and test set

Based on the source attribute, 2 new attributes are created, source_train and source_test. The attribute is partitioned, so every instance is copied to either the training or the test set.

Parameters

Source attribute

The attribute you want to create train and test sets from.

Test set ratio

A test set is a random sample of the vertices. This parameter gives the size of the test set as a fraction of the total vertex count.

Random seed for test selection

Random seed.

LynxKite operations are typically deterministic. If you re-run an operation with the same random seed, you will get the same results as before. To get a truly independent random re-run, make sure you choose a different random seed.

The default value for random seed parameters is randomly picked, so only very rarely do you need to give random seeds any thought.

Split vertices

Split (multiply) vertices in a graph. A Double vertex attribute controls how many copies of the vertex should exist after the operation. If this attribute is 1, the vertex will be kept as it is. If this attribute is zero, the vertex will be discarded entirely. Higher values (e.g., 2) will result in more identical copies of the given vertex. All edges coming from and going to this vertex are multiplied (or discarded) appropriately.

After the operation, all previous vertex and edge attributes will be preserved; in particular, copies of one vertex will have the same values for the previous vertex attributes. A new vertex attribute (the so called index attribute) will also be created so that you can differentiate between copies of the same vertex. If a given vertex was multiplied by n times, the n new vertices will have n different index attribute values running from 0 to n-1.

This operation assigns new vertex ids to the vertices; these will be accessible via a new vertex attribute.

Repetition attribute

A Double vertex attribute that specifies how many copies of the vertex should exist after the operation. (The Double value is rounded to the nearest integer, so 1.8 will mean 2 copies.)

ID attribute name

The name of the vertex attribute that will hold the new vertex ids.

Index attribute name

The name of the attribute that will contain unique identifiers for the otherwise identical copies of the vertex.

SQL1

Executes a SQL query on a single input, which can be either a project or a table. Outputs a table. If the input is a table, it is available in the query as input. For example:

select * from input

If the input is a project, its internal tables are available directly.

See the SQL syntax section for more.

The following tables are available for SQL access for project inputs:

  • All the vertex attributes can be accessed in the vertices table.

    Example: select count(*) from vertices where age < 30

  • All the edge attributes can be accessed in the edge_attributes table.

    Example: select max(weight) from edge_attributes

    You can not query the edge_attributes table if there are no edge attributes, even if the edges themselves are defined.

  • All the scalars can be accessed in the scalars table.

    Example: select `!vertex_count` from scalars

  • All the edge and vertex attributes can be accessed in the edges table. Each row of this table represents an edge. The attributes of the edge are prefixed with edge_, while the attributes of the source and destination vertices are prefixed with src_ and dst_ respectively.

    Example: select max(edge_weight) from edges where src_age < dst_age

  • The belongs_to table is defined for each segmentation of a project or a segmentation. It contains the vertex attributes for the connected pairs of base and segmentation vertices prefixed with base_ and segment_ respectively.

    Examples:

    • select count(*) from `communities.belongs_to` group by segment_id

    • select base_name from `communities.belongs_to` where segment_name = "COOKING"

Backticks (`) are used for escaping table and column names with special characters.

For single-input SQL boxes the edges, vertices, etc. tables can be accessed with or without the input name prefix.

You can browse the list of available tables and columns by clicking on the button.

Summary

This summary will be displayed below the box in the workspace. Distinguishing SQL boxes this way can make the workspace easier to understand.

Input names

Comma-separated list of names used to refer to the inputs of the box.

For example, you can set it to accounts (for a single-input SQL box) and then write select count(*) from accounts as the query.

SQL query

The query. Press Ctrl-Enter to save your changes while staying in the editor.

Persist result

If enabled, the output table will be saved to disk once it is calculated. If disabled, the query will be re-executed each time its output is used. Persistence can improve performance at the cost of disk space.

SQL10

Executes an SQL query on its ten inputs, which can be either projects or tables. Outputs a table. The inputs are available in the query as one, two, three, four, five, six, seven, eight, nine, ten. For example:

select * from one
union select * from two
union select * from three
union select * from four
union select * from five
union select * from six
union select * from seven
union select * from eight
union select * from nine
union select * from ten

See the SQL syntax section for more.

The following tables are available for SQL access for project inputs:

  • All the vertex attributes can be accessed in the vertices table.

    Example: select count(*) from `one.vertices` where age < 30

  • All the edge attributes can be accessed in the edge_attributes table.

    Example: select max(weight) from `one.edge_attributes`

    You can not query the edge_attributes table if there are no edge attributes, even if the edges themselves are defined.

  • All the scalars can be accessed in the scalars table.

    Example: select `!vertex_count` from `one.scalars`

  • All the edge and vertex attributes can be accessed in the edges table. Each row of this table represents an edge. The attributes of the edge are prefixed with edge_, while the attributes of the source and destination vertices are prefixed with src_ and dst_ respectively.

    Example: select max(edge_weight) from `one.edges` where src_age < dst_age

  • The belongs_to table is defined for each segmentation of a project or a segmentation. It contains the vertex attributes for the connected pairs of base and segmentation vertices prefixed with base_ and segment_ respectively.

    Examples:

    • select count(*) from `one.communities.belongs_to` group by segment_id

    • select base_name from `one.communities.belongs_to` where segment_name = "COOKING"

Backticks (`) are used for escaping table and column names with special characters.

For single-input SQL boxes the edges, vertices, etc. tables can be accessed with or without the input name prefix.

You can browse the list of available tables and columns by clicking on the button.

Summary

This summary will be displayed below the box in the workspace. Distinguishing SQL boxes this way can make the workspace easier to understand.

Input names

Comma-separated list of names used to refer to the inputs of the box.

For example, you can set it to accounts (for a single-input SQL box) and then write select count(*) from accounts as the query.

SQL query

The query. Press Ctrl-Enter to save your changes while staying in the editor.

Persist result

If enabled, the output table will be saved to disk once it is calculated. If disabled, the query will be re-executed each time its output is used. Persistence can improve performance at the cost of disk space.

SQL2

Executes an SQL query on its two inputs, which can be either projects or tables. Outputs a table. The inputs are available in the query as one and two. For example:

select one.*, two.*
from one
join two
on one.id = two.id

See the SQL syntax section for more.

The following tables are available for SQL access for project inputs:

  • All the vertex attributes can be accessed in the vertices table.

    Example: select count(*) from `one.vertices` where age < 30

  • All the edge attributes can be accessed in the edge_attributes table.

    Example: select max(weight) from `one.edge_attributes`

    You can not query the edge_attributes table if there are no edge attributes, even if the edges themselves are defined.

  • All the scalars can be accessed in the scalars table.

    Example: select `!vertex_count` from `one.scalars`

  • All the edge and vertex attributes can be accessed in the edges table. Each row of this table represents an edge. The attributes of the edge are prefixed with edge_, while the attributes of the source and destination vertices are prefixed with src_ and dst_ respectively.

    Example: select max(edge_weight) from `one.edges` where src_age < dst_age

  • The belongs_to table is defined for each segmentation of a project or a segmentation. It contains the vertex attributes for the connected pairs of base and segmentation vertices prefixed with base_ and segment_ respectively.

    Examples:

    • select count(*) from `one.communities.belongs_to` group by segment_id

    • select base_name from `one.communities.belongs_to` where segment_name = "COOKING"

Backticks (`) are used for escaping table and column names with special characters.

For single-input SQL boxes the edges, vertices, etc. tables can be accessed with or without the input name prefix.

You can browse the list of available tables and columns by clicking on the button.

Summary

This summary will be displayed below the box in the workspace. Distinguishing SQL boxes this way can make the workspace easier to understand.

Input names

Comma-separated list of names used to refer to the inputs of the box.

For example, you can set it to accounts (for a single-input SQL box) and then write select count(*) from accounts as the query.

SQL query

The query. Press Ctrl-Enter to save your changes while staying in the editor.

Persist result

If enabled, the output table will be saved to disk once it is calculated. If disabled, the query will be re-executed each time its output is used. Persistence can improve performance at the cost of disk space.

SQL3

Executes an SQL query on its three inputs, which can be either projects or tables. Outputs a table. The inputs are available in the query as one, two, three. For example:

select one.*, two.*, three.*
from one
join two
join three
on one.id = two.id and one.id = three.id

See the SQL syntax section for more.

The following tables are available for SQL access for project inputs:

  • All the vertex attributes can be accessed in the vertices table.

    Example: select count(*) from `one.vertices` where age < 30

  • All the edge attributes can be accessed in the edge_attributes table.

    Example: select max(weight) from `one.edge_attributes`

    You can not query the edge_attributes table if there are no edge attributes, even if the edges themselves are defined.

  • All the scalars can be accessed in the scalars table.

    Example: select `!vertex_count` from `one.scalars`

  • All the edge and vertex attributes can be accessed in the edges table. Each row of this table represents an edge. The attributes of the edge are prefixed with edge_, while the attributes of the source and destination vertices are prefixed with src_ and dst_ respectively.

    Example: select max(edge_weight) from `one.edges` where src_age < dst_age

  • The belongs_to table is defined for each segmentation of a project or a segmentation. It contains the vertex attributes for the connected pairs of base and segmentation vertices prefixed with base_ and segment_ respectively.

    Examples:

    • select count(*) from `one.communities.belongs_to` group by segment_id

    • select base_name from `one.communities.belongs_to` where segment_name = "COOKING"

Backticks (`) are used for escaping table and column names with special characters.

For single-input SQL boxes the edges, vertices, etc. tables can be accessed with or without the input name prefix.

You can browse the list of available tables and columns by clicking on the button.

Summary

This summary will be displayed below the box in the workspace. Distinguishing SQL boxes this way can make the workspace easier to understand.

Input names

Comma-separated list of names used to refer to the inputs of the box.

For example, you can set it to accounts (for a single-input SQL box) and then write select count(*) from accounts as the query.

SQL query

The query. Press Ctrl-Enter to save your changes while staying in the editor.

Persist result

If enabled, the output table will be saved to disk once it is calculated. If disabled, the query will be re-executed each time its output is used. Persistence can improve performance at the cost of disk space.

SQL4

Executes an SQL query on its four inputs, which can be either projects or tables. Outputs a table. The inputs are available in the query as one, two, three, four. For example:

select * from one
union select * from two
union select * from three
union select * from four

See the SQL syntax section for more.

The following tables are available for SQL access for project inputs:

  • All the vertex attributes can be accessed in the vertices table.

    Example: select count(*) from `one.vertices` where age < 30

  • All the edge attributes can be accessed in the edge_attributes table.

    Example: select max(weight) from `one.edge_attributes`

    You can not query the edge_attributes table if there are no edge attributes, even if the edges themselves are defined.

  • All the scalars can be accessed in the scalars table.

    Example: select `!vertex_count` from `one.scalars`

  • All the edge and vertex attributes can be accessed in the edges table. Each row of this table represents an edge. The attributes of the edge are prefixed with edge_, while the attributes of the source and destination vertices are prefixed with src_ and dst_ respectively.

    Example: select max(edge_weight) from `one.edges` where src_age < dst_age

  • The belongs_to table is defined for each segmentation of a project or a segmentation. It contains the vertex attributes for the connected pairs of base and segmentation vertices prefixed with base_ and segment_ respectively.

    Examples:

    • select count(*) from `one.communities.belongs_to` group by segment_id

    • select base_name from `one.communities.belongs_to` where segment_name = "COOKING"

Backticks (`) are used for escaping table and column names with special characters.

For single-input SQL boxes the edges, vertices, etc. tables can be accessed with or without the input name prefix.

You can browse the list of available tables and columns by clicking on the button.

Summary

This summary will be displayed below the box in the workspace. Distinguishing SQL boxes this way can make the workspace easier to understand.

Input names

Comma-separated list of names used to refer to the inputs of the box.

For example, you can set it to accounts (for a single-input SQL box) and then write select count(*) from accounts as the query.

SQL query

The query. Press Ctrl-Enter to save your changes while staying in the editor.

Persist result

If enabled, the output table will be saved to disk once it is calculated. If disabled, the query will be re-executed each time its output is used. Persistence can improve performance at the cost of disk space.

SQL5

Executes an SQL query on its five inputs, which can be either projects or tables. Outputs a table. The inputs are available in the query as one, two, three, four, five. For example:

select * from one
union select * from two
union select * from three
union select * from four
union select * from five

See the SQL syntax section for more.

The following tables are available for SQL access for project inputs:

  • All the vertex attributes can be accessed in the vertices table.

    Example: select count(*) from `one.vertices` where age < 30

  • All the edge attributes can be accessed in the edge_attributes table.

    Example: select max(weight) from `one.edge_attributes`

    You can not query the edge_attributes table if there are no edge attributes, even if the edges themselves are defined.

  • All the scalars can be accessed in the scalars table.

    Example: select `!vertex_count` from `one.scalars`

  • All the edge and vertex attributes can be accessed in the edges table. Each row of this table represents an edge. The attributes of the edge are prefixed with edge_, while the attributes of the source and destination vertices are prefixed with src_ and dst_ respectively.

    Example: select max(edge_weight) from `one.edges` where src_age < dst_age

  • The belongs_to table is defined for each segmentation of a project or a segmentation. It contains the vertex attributes for the connected pairs of base and segmentation vertices prefixed with base_ and segment_ respectively.

    Examples:

    • select count(*) from `one.communities.belongs_to` group by segment_id

    • select base_name from `one.communities.belongs_to` where segment_name = "COOKING"

Backticks (`) are used for escaping table and column names with special characters.

For single-input SQL boxes the edges, vertices, etc. tables can be accessed with or without the input name prefix.

You can browse the list of available tables and columns by clicking on the button.

Summary

This summary will be displayed below the box in the workspace. Distinguishing SQL boxes this way can make the workspace easier to understand.

Input names

Comma-separated list of names used to refer to the inputs of the box.

For example, you can set it to accounts (for a single-input SQL box) and then write select count(*) from accounts as the query.

SQL query

The query. Press Ctrl-Enter to save your changes while staying in the editor.

Persist result

If enabled, the output table will be saved to disk once it is calculated. If disabled, the query will be re-executed each time its output is used. Persistence can improve performance at the cost of disk space.

SQL6

Executes an SQL query on its six inputs, which can be either projects or tables. Outputs a table. The inputs are available in the query as one, two, three, four, five, six. For example:

select * from one
union select * from two
union select * from three
union select * from four
union select * from five
union select * from six

See the SQL syntax section for more.

The following tables are available for SQL access for project inputs:

  • All the vertex attributes can be accessed in the vertices table.

    Example: select count(*) from `one.vertices` where age < 30

  • All the edge attributes can be accessed in the edge_attributes table.

    Example: select max(weight) from `one.edge_attributes`

    You can not query the edge_attributes table if there are no edge attributes, even if the edges themselves are defined.

  • All the scalars can be accessed in the scalars table.

    Example: select `!vertex_count` from `one.scalars`

  • All the edge and vertex attributes can be accessed in the edges table. Each row of this table represents an edge. The attributes of the edge are prefixed with edge_, while the attributes of the source and destination vertices are prefixed with src_ and dst_ respectively.

    Example: select max(edge_weight) from `one.edges` where src_age < dst_age

  • The belongs_to table is defined for each segmentation of a project or a segmentation. It contains the vertex attributes for the connected pairs of base and segmentation vertices prefixed with base_ and segment_ respectively.

    Examples:

    • select count(*) from `one.communities.belongs_to` group by segment_id

    • select base_name from `one.communities.belongs_to` where segment_name = "COOKING"

Backticks (`) are used for escaping table and column names with special characters.

For single-input SQL boxes the edges, vertices, etc. tables can be accessed with or without the input name prefix.

You can browse the list of available tables and columns by clicking on the button.

Summary

This summary will be displayed below the box in the workspace. Distinguishing SQL boxes this way can make the workspace easier to understand.

Input names

Comma-separated list of names used to refer to the inputs of the box.

For example, you can set it to accounts (for a single-input SQL box) and then write select count(*) from accounts as the query.

SQL query

The query. Press Ctrl-Enter to save your changes while staying in the editor.

Persist result

If enabled, the output table will be saved to disk once it is calculated. If disabled, the query will be re-executed each time its output is used. Persistence can improve performance at the cost of disk space.

SQL7

Executes an SQL query on its seven inputs, which can be either projects or tables. Outputs a table. The inputs are available in the query as one, two, three, four, five, six, seven. For example:

select * from one
union select * from two
union select * from three
union select * from four
union select * from five
union select * from six
union select * from seven

See the SQL syntax section for more.

The following tables are available for SQL access for project inputs:

  • All the vertex attributes can be accessed in the vertices table.

    Example: select count(*) from `one.vertices` where age < 30

  • All the edge attributes can be accessed in the edge_attributes table.

    Example: select max(weight) from `one.edge_attributes`

    You can not query the edge_attributes table if there are no edge attributes, even if the edges themselves are defined.

  • All the scalars can be accessed in the scalars table.

    Example: select `!vertex_count` from `one.scalars`

  • All the edge and vertex attributes can be accessed in the edges table. Each row of this table represents an edge. The attributes of the edge are prefixed with edge_, while the attributes of the source and destination vertices are prefixed with src_ and dst_ respectively.

    Example: select max(edge_weight) from `one.edges` where src_age < dst_age

  • The belongs_to table is defined for each segmentation of a project or a segmentation. It contains the vertex attributes for the connected pairs of base and segmentation vertices prefixed with base_ and segment_ respectively.

    Examples:

    • select count(*) from `one.communities.belongs_to` group by segment_id

    • select base_name from `one.communities.belongs_to` where segment_name = "COOKING"

Backticks (`) are used for escaping table and column names with special characters.

For single-input SQL boxes the edges, vertices, etc. tables can be accessed with or without the input name prefix.

You can browse the list of available tables and columns by clicking on the button.

Summary

This summary will be displayed below the box in the workspace. Distinguishing SQL boxes this way can make the workspace easier to understand.

Input names

Comma-separated list of names used to refer to the inputs of the box.

For example, you can set it to accounts (for a single-input SQL box) and then write select count(*) from accounts as the query.

SQL query

The query. Press Ctrl-Enter to save your changes while staying in the editor.

Persist result

If enabled, the output table will be saved to disk once it is calculated. If disabled, the query will be re-executed each time its output is used. Persistence can improve performance at the cost of disk space.

SQL8

Executes an SQL query on its eight inputs, which can be either projects or tables. Outputs a table. The inputs are available in the query as one, two, three, four, five, six, seven, eight. For example:

select * from one
union select * from two
union select * from three
union select * from four
union select * from five
union select * from six
union select * from seven
union select * from eight

See the SQL syntax section for more.

The following tables are available for SQL access for project inputs:

  • All the vertex attributes can be accessed in the vertices table.

    Example: select count(*) from `one.vertices` where age < 30

  • All the edge attributes can be accessed in the edge_attributes table.

    Example: select max(weight) from `one.edge_attributes`

    You can not query the edge_attributes table if there are no edge attributes, even if the edges themselves are defined.

  • All the scalars can be accessed in the scalars table.

    Example: select `!vertex_count` from `one.scalars`

  • All the edge and vertex attributes can be accessed in the edges table. Each row of this table represents an edge. The attributes of the edge are prefixed with edge_, while the attributes of the source and destination vertices are prefixed with src_ and dst_ respectively.

    Example: select max(edge_weight) from `one.edges` where src_age < dst_age

  • The belongs_to table is defined for each segmentation of a project or a segmentation. It contains the vertex attributes for the connected pairs of base and segmentation vertices prefixed with base_ and segment_ respectively.

    Examples:

    • select count(*) from `one.communities.belongs_to` group by segment_id

    • select base_name from `one.communities.belongs_to` where segment_name = "COOKING"

Backticks (`) are used for escaping table and column names with special characters.

For single-input SQL boxes the edges, vertices, etc. tables can be accessed with or without the input name prefix.

You can browse the list of available tables and columns by clicking on the button.

Summary

This summary will be displayed below the box in the workspace. Distinguishing SQL boxes this way can make the workspace easier to understand.

Input names

Comma-separated list of names used to refer to the inputs of the box.

For example, you can set it to accounts (for a single-input SQL box) and then write select count(*) from accounts as the query.

SQL query

The query. Press Ctrl-Enter to save your changes while staying in the editor.

Persist result

If enabled, the output table will be saved to disk once it is calculated. If disabled, the query will be re-executed each time its output is used. Persistence can improve performance at the cost of disk space.

SQL9

Executes an SQL query on its nine inputs, which can be either projects or tables. Outputs a table. The inputs are available in the query as one, two, three, four, five, six, seven, eight, nine. For example:

select * from one
union select * from two
union select * from three
union select * from four
union select * from five
union select * from six
union select * from seven
union select * from eight
union select * from nine

See the SQL syntax section for more.

The following tables are available for SQL access for project inputs:

  • All the vertex attributes can be accessed in the vertices table.

    Example: select count(*) from `one.vertices` where age < 30

  • All the edge attributes can be accessed in the edge_attributes table.

    Example: select max(weight) from `one.edge_attributes`

    You can not query the edge_attributes table if there are no edge attributes, even if the edges themselves are defined.

  • All the scalars can be accessed in the scalars table.

    Example: select `!vertex_count` from `one.scalars`

  • All the edge and vertex attributes can be accessed in the edges table. Each row of this table represents an edge. The attributes of the edge are prefixed with edge_, while the attributes of the source and destination vertices are prefixed with src_ and dst_ respectively.

    Example: select max(edge_weight) from `one.edges` where src_age < dst_age

  • The belongs_to table is defined for each segmentation of a project or a segmentation. It contains the vertex attributes for the connected pairs of base and segmentation vertices prefixed with base_ and segment_ respectively.

    Examples:

    • select count(*) from `one.communities.belongs_to` group by segment_id

    • select base_name from `one.communities.belongs_to` where segment_name = "COOKING"

Backticks (`) are used for escaping table and column names with special characters.

For single-input SQL boxes the edges, vertices, etc. tables can be accessed with or without the input name prefix.

You can browse the list of available tables and columns by clicking on the button.

Summary

This summary will be displayed below the box in the workspace. Distinguishing SQL boxes this way can make the workspace easier to understand.

Input names

Comma-separated list of names used to refer to the inputs of the box.

For example, you can set it to accounts (for a single-input SQL box) and then write select count(*) from accounts as the query.

SQL query

The query. Press Ctrl-Enter to save your changes while staying in the editor.

Persist result

If enabled, the output table will be saved to disk once it is calculated. If disabled, the query will be re-executed each time its output is used. Persistence can improve performance at the cost of disk space.

Take edges as vertices

Takes a project and creates a new one where the vertices correspond to the original project’s edges. All edge attributes in the original project are converted to vertex attributes in the new project with the edge_ prefix. All vertex attributes are converted to two vertex attributes with src_ and dst_ prefixes. Scalars and segmentations of the original project are lost.

Take segmentation as base project

Takes a segmentation of a project and returns the segmentation as a base project itself.

Replaces the current project with the links from its base to the selected segmentation, represented as vertices. The vertices will have base_ and segment_ prefixed attributes generated for the attributes on the base project and the segmentation respectively.

Train a decision tree classification model

Trains a decision tree classifier model using the graph’s vertex attributes. The algorithm recursively partitions the feature space into two parts. The tree predicts the same label for each bottommost (leaf) partition. Each binary partitioning is chosen from a set of possible splits in order to maximize the information gain at the corresponding tree node. For calculating the information gain the impurity of the nodes is used (read more about impurity at the description of the impurity parameter): the information gain is the difference between the parent node impurity and the weighted sum of the two child node impurities. More information about the parameters.

Model name

The model will be stored as a scalar using this name.

Label attribute

The vertex attribute the model is trained to predict.

Feature attributes

The attributes the model learns to use for making predictions.

Impurity measure

Node impurity is a measure of homogeneity of the labels at the node and is used for calculating the information gain. There are two impurity measures provided.

  • Gini: Let S denote the set of training examples in this node. Gini impurity is the probability of a randomly chosen element of S to get an incorrect label, if it was randomly labeled according to the distribution of labels in S.

  • Entropy: Let S denote the set of training examples in this node, and let fi be the ratio of the i th label in S. The entropy of the node is the sum of the -pilog(pi) values.

Maximum number of bins

Number of bins used when discretizing continuous features.

Maximum depth

Maximum depth of the tree.

Minimum information gain

Minimum information gain for a split to be considered as a tree node.

Minimum instances per node

For a node to be split further, the split must improve at least this much (in terms of information gain).

Seed

We maximize the information gain only among a subset of the possible splits. This random seed is used for selecting the set of splits we consider at a node.

Train a decision tree regression model

Trains a decision tree regression model using the graph’s vertex attributes. The algorithm recursively partitions the feature space into two parts. The tree predicts the same label for each bottommost (leaf) partition. Each binary partitioning is chosen from a set of possible splits in order to maximize the information gain at the corresponding tree node. For calculating the information gain the variance of the nodes is used: the information gain is the difference between the parent node variance and the weighted sum of the two child node variances. More information about the parameters.

Note: Once the tree is trained there is only a finite number of possible predictions. Because of this, the regression model might seem like a classification. The main difference is that these buckets ("classes") are invented by the algorithm during the training in order to minimize the variance.

Model name

The model will be stored as a scalar using this name.

Label attribute

The vertex attribute the model is trained to predict.

Feature attributes

The attributes the model learns to use for making predictions.

Maximum number of bins

Number of bins used when discretizing continuous features.

Maximum depth

Maximum depth of the tree.

Minimum information gain

Minimum information gain for a split to be considered as a tree node.

Minimum instances per node

For a node to be split further, the split must improve at least this much (in terms of information gain).

Seed

We maximize the information gain only among a subset of the possible splits. This random seed is used for selecting the set of splits we consider at a node.

Train a GCN classifier

Trains a Graph Convolutional Network using Pytorch Geometric. Applicable for classification problems.

Save model as

The resulting model will be saved as a Scalar using this name.

Iterations

Number of training iterations.

Feature vector

Vector attribute containing the features to be used as inputs for the training algorithm.

Attribute to predict

The attribute we want to predict.

Use labels as inputs

Set true to allow a vertex to see the labels of its neighbors and use them for predicting its own label.

Batch size

In each iteration of the training, we compute the error only on a subset of the vertices. Batch size specifies the size of this subset.

Learning rate

Value of the learning rate.

Hidden size

Size of the hidden layers.

Number of convolution layers

Number of convolution layers.

Convolution operator

The type of graph convolution to use. GCNConv or GatedGraphConv.

Random seed

Random seed for initializing network weights and choosing training batches.

Train a GCN regressor

Trains a Graph Convolutional Network using Pytorch Geometric. Applicable for regression problems.

Save model as

The resulting model will be saved as a Scalar using this name.

Iterations

Number of training iterations.

Feature vector

Vector attribute containing the features to be used as inputs for the training algorithm.

Attribute to predict

The attribute we want to predict.

Use labels as inputs

Set true to allow a vertex to see the labels of its neighbors and use them for predicting its own label.

Batch size

In each iteration of the training, we compute the error only on a subset of the vertices. Batch size specifies the size of this subset.

Learning rate

Value of the learning rate.

Hidden size

Size of the hidden layers.

Number of convolution layers

Number of convolution layers.

Convolution operator

The type of graph convolution to use. GCNConv or GatedGraphConv.

Random seed

Random seed for initializing network weights and choosing training batches.

Train a k-means clustering model

Trains a k-means clustering model using the graph’s vertex attributes. The algorithm converges when the maximum number of iterations is reached or every cluster center does not move in the last iteration.

KMeans clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

For best results it may be necessary to scale the features before training the model.

Model name

The model will be stored as a scalar using this name.

Feature attributes

Attributes to be used as inputs for the training algorithm. The trained model will have a list of features with the same names and semantics.

K clusters

The number of clusters to be created.

Maximum iterations

The maximum number of iterations (>=0).

Seed

The random seed.

Train a logistic regression model

Trains a logistic regression model using the graph’s vertex attributes. The algorithm converges when the maximum number of iterations is reached or no coefficient has changed in the last iteration. The threshold of the model is chosen to maximize the F-score.

Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function.

The current implementation of logistic regression only supports binary classes.

Model name

The model will be stored as a scalar using this name.

Label attribute

The vertex attribute for which the model is trained to classify. The attribute should be binary label of either 0.0 or 1.0.

Feature attributes

Attributes to be used as inputs for the training algorithm.

Maximum iterations

The maximum number of iterations (>=0).

Train linear regression model

Trains a linear regression model using the graph’s vertex attributes.

Model name

The model will be stored as a scalar using this name.

Label attribute

The vertex attribute for which the model is trained.

Feature attributes

Attributes to be used as inputs for the training algorithm. The trained model will have a list of features with the same names and semantics.

Training method

The algorithm used to train the linear regression model.

Transform

Transforms all columns of a table input via SQL expressions. Outputs a table.

An input parameter is generated for every table column. The parameters are SQL expressions interpreted on the input table. The default value leaves the column alone.

Use base project as segmentation

Creates a new segmentation which is a copy of the base project. Also creates segmentation links between the original vertices and their corresponding vertices in the segmentation.

For example, let’s say we have a social network and we want to make a segmentation containing a selected group of people and the segmentation links should represent the original connections between the members of this selected group and other people.

We can do this by first using this operation to copy the base project to segmentation then using the Grow segmentation operation to add the necessary segmentation links. Finally, using the Filter by attributes operation, we can ensure that the segmentation contains only members of the selected group.

Segmentation name

The name assigned to the new segmentation. It defaults to the project’s name.

Use metagraph as graph

Loads the relationships between LynxKite entities such as attributes and operations as a graph. This complex graph can be useful for debugging or demonstration purposes. Because it exposes data about all projects, it is only accessible to administrator users.

Current timestamp

This number will be used to identify the current state of the metagraph. If you edit the history and leave the timestamp unchanged, you will get the same metagraph as before. If you change the timestamp, you will get the latest version of the metagraph.

Use other project as segmentation

Copies another project into a new segmentation for this one. There will be no connections between the segments and the base vertices. You can import/create those via other operations. (See Use table as segmentation links and Define segmentation links from matching attributes.)

It is possible to import the project itself as segmentation. But even in this special case, there will be no connections between the segments and the base vertices. Another operation, Use base project as segmentation can be used if edges are desired.

Use table as edge attributes

Imports edge attributes for existing edges from a table. This is useful when you already have edges and just want to import one or more attributes.

There are two different use cases for this operation: - Import using unique edge attribute values. For example if the edges represent relationships between people (identified by src and dst IDs) we can import the number of total calls between each two people. In this case the operation fails for duplicate attribute values - i.e. parallel edges. - Import using a normal edge attribute. For example if each edge represents a call and the location of the person making the call is an edge attribute (cell tower ID) we can import latitudes and longitudes for those towers. Here the tower IDs still have to be unique in the lookup table.

Table

The table to import from.

Edge attribute

The edge attribute which is used to join with the table’s ID column.

ID column

The ID column name in the table. This should be a String column that uses the values of the chosen edge attribute as IDs.

Name prefix for the imported edge attributes

Prepend this prefix string to the new edge attribute names. This can be used to avoid accidentally overwriting existing attributes.

Assert unique edge attribute values

Assert that the edge attribute values have to be unique if set true. The values of the matching ID column in the table have to be unique in both cases.

Use table as edges

Imports edges from a table. Your vertices must have an identifying attribute, by which the edges can be attached to them.

Example use case

If you have one table for the vertices (e.g. subscribers) and another for the edges (e.g., calls), you import the first table with the Use table as vertices operation and then use this operation to add the edges.

Parameters

Table

The table to import from.

Vertex ID attribute

The IDs that are used in the file when defining the edges.

Source ID column

The table column that specifies the source of the edge.

Destination ID column

The table column that specifies the destination of the edge.

Use table as graph

Imports edges from a table. Each line in the table represents one edge. Each column in the table will be accessible as an edge attribute.

Vertices will be generated for the endpoints of the edges with two vertex attributes:

  • stringId will contain the ID string that was used in the table.

  • id will contain the internal vertex ID.

This is useful when your table contains edges (e.g., calls) and there is no separate table for vertices. This operation makes it possible to load edges and use them as a graph. Note that this graph will never have zero-degree vertices.

Table

The table to import from.

Source ID column

The table column that contains the edge source ID.

Destination ID column

The table column that contains the edge destination ID.

Import the connection between the main project and this segmentation from a table. Each row in the table represents a connection between one base vertex and one segment.

Table

The table to import from.

Identifying vertex attribute in base project

The String vertex attribute that can be joined to the identifying column in the table.

Identifying column for base project

The table column that can be joined to the identifying attribute on the base project.

Identifying vertex attribute in segmentation

The String vertex attribute that can be joined to the identifying column in the table.

Identifying column for segmentation

The table column that can be joined to the identifying attribute on the segmentation.

Use table as segmentation

Imports a segmentation from a table. The table must have a column identifying an existing vertex by a String attribute and another column that specifies the segment it belongs to. Each vertex may belong to any number of segments.

The rest of the columns in the table are ignored.

Table

The table to import from.

Name of new segmentation

The imported segmentation will be created under this name.

Vertex ID attribute

The String vertex attribute that identifies the base vertices.

Vertex ID column

The table column that identifies vertices.

Segment ID column

The table column that identifies segments.

Use table as vertex attributes

Imports vertex attributes for existing vertices from a table. This is useful when you already have vertices and just want to import one or more attributes.

There are two different use cases for this operation: - Import using unique vertex attribute values. For example if the vertices represent people this attribute can be a personal ID. In this case the operation fails in case of duplicate attribute values (either among vertices or in the table). - Import using a normal vertex attribute. For example this can be a city of residence (vertices are people) and we can import census data for those cities for each person. Here the operation allows duplications of cities among vertices (but not in the lookup table).

Table

The table to import from.

Vertex attribute

The String vertex attribute which is used to join with the table’s ID column.

ID column

The ID column name in the table. This should be a String column that uses the values of the chosen vertex attribute as IDs.

Name prefix for the imported vertex attributes

Prepend this prefix string to the new vertex attribute names. This can be used to avoid accidentally overwriting existing attributes.

Assert unique vertex attribute values

Assert that the vertex attribute values have to be unique if set true. The values of the matching ID column in the table have to be unique in both cases.

Use table as vertices

Imports vertices (no edges) from a table. Each column in the table will be accessible as a vertex attribute. An extra vertex attribute is generated to hold the internal vertex ID.

Table

The table to import from.

Save internal ID as

The name of the extra vertex attribute that is generated for the internal vertex ID. Set it to empty string if you don’t want the internal id exposed as an attribute.

Weighted aggregate edge attribute globally

Aggregates edge attributes across the entire graph into one scalar for each attribute. For example you could use it to calculate the total income as the sum of call durations weighted by the rates across an entire call dataset.

Generated name prefix

Save the aggregated values with this prefix.

Weight

The Double attribute to use as weight.

The available weighted aggregators are:

  • For Double attributes:

    • by_max_weight (picks a value for which the corresponding weight value is maximal)

    • by_min_weight (picks a value for which the corresponding weight value is minimal)

    • weighted_average

    • weighted_sum

  • For other attributes:

    • by_max_weight (picks a value for which the corresponding weight value is maximal)

    • by_min_weight (picks a value for which the corresponding weight value is minimal)

Weighted aggregate edge attribute to vertices

Aggregates an attribute on all the edges going in or out of vertices. For example it can calculate the average cost per second of calls for each person.

Generated name prefix

Save the aggregated attributes with this prefix.

Weight

The Double attribute to use as weight.

Aggregate on
  • incoming edges: Aggregate across the edges coming in to each vertex.

  • outgoing edges: Aggregate across the edges going out of each vertex.

  • all edges: Aggregate across all the edges going in or out of each vertex.

The available weighted aggregators are:

  • For Double attributes:

    • by_max_weight (picks a value for which the corresponding weight value is maximal)

    • by_min_weight (picks a value for which the corresponding weight value is minimal)

    • weighted_average

    • weighted_sum

  • For other attributes:

    • by_max_weight (picks a value for which the corresponding weight value is maximal)

    • by_min_weight (picks a value for which the corresponding weight value is minimal)

Weighted aggregate from segmentation

Aggregates vertex attributes across all the segments that a vertex in the base project belongs to. For example, it can calculate an average over the cliques a person belongs to, weighted by the size of the cliques.

Generated name prefix

Save the aggregated attributes with this prefix.

Weight

The Double attribute to use as weight.

The available weighted aggregators are:

  • For Double attributes:

    • by_max_weight (picks a value for which the corresponding weight value is maximal)

    • by_min_weight (picks a value for which the corresponding weight value is minimal)

    • weighted_average

    • weighted_sum

  • For other attributes:

    • by_max_weight (picks a value for which the corresponding weight value is maximal)

    • by_min_weight (picks a value for which the corresponding weight value is minimal)

Weighted aggregate on neighbors

Aggregates across the vertices that are connected to each vertex. You can use the Aggregate on parameter to define how exactly this aggregation will take place: choosing one of the 'edges' settings can result in a neighboring vertex being taken into account several times (depending on the number of edges between the vertex and its neighboring vertex); whereas choosing one of the 'neighbors' settings will result in each neighboring vertex being taken into account once.

For example, it can calculate the average age per kilogram of the friends of each person.

Generated name prefix

Save the aggregated attributes with this prefix.

Weight

The Double attribute to use as weight.

Aggregate on
  • incoming edges: Aggregate across the edges coming in to each vertex.

  • outgoing edges: Aggregate across the edges going out of each vertex.

  • all edges: Aggregate across all the edges going in or out of each vertex.

  • symmetric edges: Aggregate across the 'symmetric' edges for each vertex: this means that if you have n edges going from A to B and k edges going from B to A, then min(n,k) edges will be taken into account for both A and B.

  • in-neighbors: For each vertex A, aggregate across those vertices that have an outgoing edge to A.

  • out-neighbors: For each vertex A, aggregate across those vertices that have an incoming edge from A.

  • all neighbors: For each vertex A, aggregate across those vertices that either have an outgoing edge to or an incoming edge from A.

  • symmetric neighbors: For each vertex A, aggregate across those vertices that have both an outgoing edge to and an incoming edge from A.

The available weighted aggregators are:

  • For Double attributes:

    • by_max_weight (picks a value for which the corresponding weight value is maximal)

    • by_min_weight (picks a value for which the corresponding weight value is minimal)

    • weighted_average

    • weighted_sum

  • For other attributes:

    • by_max_weight (picks a value for which the corresponding weight value is maximal)

    • by_min_weight (picks a value for which the corresponding weight value is minimal)

Weighted aggregate to segmentation

Aggregates vertex attributes across all the vertices that belong to a segment. For example, it can calculate the average age per kilogram of each clique.

Weight

The Double attribute to use as weight.

The available weighted aggregators are:

  • For Double attributes:

    • by_max_weight (picks a value for which the corresponding weight value is maximal)

    • by_min_weight (picks a value for which the corresponding weight value is minimal)

    • weighted_average

    • weighted_sum

  • For other attributes:

    • by_max_weight (picks a value for which the corresponding weight value is maximal)

    • by_min_weight (picks a value for which the corresponding weight value is minimal)

Weighted aggregate vertex attribute globally

Aggregates vertex attributes across the entire graph into one scalar for each attribute. For example you could use it to calculate the average age across an entire dataset of people weighted by their PageRank.

Generated name prefix

Save the aggregated values with this prefix.

Weight

The Double attribute to use as weight.

The available weighted aggregators are:

  • For Double attributes:

    • by_max_weight (picks a value for which the corresponding weight value is maximal)

    • by_min_weight (picks a value for which the corresponding weight value is minimal)

    • weighted_average

    • weighted_sum

  • For other attributes:

    • by_max_weight (picks a value for which the corresponding weight value is maximal)

    • by_min_weight (picks a value for which the corresponding weight value is minimal)