Demography Prediction Tutorial

See this analysis in action! It's one of our working examples in the cloud demo.
Try LynxKite

Handling missing data in a social network database

Intro to this tutorial

Intro to the dataset

Importing and sampling the data

Build the graph

screenshot

Filter the vertices

screenshot

screenshot

screenshot

Check that you end up with 17777 vertices and 212368 edges.

Divide into train and test set

screenshot

Model description

Now that we have a clean dataset divided into train and test set, we can sart building different models to predict the user’s age. In this example we will use three different models, evaluate them and pick the best one.

We will use the training age set to build the models and with these models predict the test set age. Then, compare it with the real test set age values and choose the best model to use it with the previously separated users.

Linear regression with non-graph features prediction

screenshot

screenshot

Neighbor prediction

screenshot

screenshot

The resulting prediction average error is 3.71 years. It predicts the age for 17,556 nodes (98.75%), it leaves out those users whose neighbor’s don’t have a defined age attribute.

Linear regression with graph features prediction

Calculate attributes

screenshot

screenshot

screenshot

Estimate Age

screenshot

screenshot

The calculated error goes down to 3.60 years, which is not a lot better than the simple attribute propagation. Let’s try a decision tree regression.

screenshot

The decision tree error went down to 3.52 which is still not a big improvement.

Community prediction

Find Communities

Estimate Age

screenshot

screenshot

The error is 1.89 years but it doesn’t predict for outliers (users that only belong to a two person community).

Method selection

Out of the methods selected, the one with lower mean absolute error was the last one (viral modeling according to communities). The problem of this model is that it does not predict the age for every user, it leaves out the ones that are not part of a community. We can combine this model with the second one (neighbor’s average) to get a prediction for most users. For the remaining users, we can use the overall age average.

Real age prediction

screenshot

screenshot

screenshot

screenshot

Results

We tackled a common data science problem of imputing missing data. We tried four different models and found that a combination of two of them would actually be the best approach to solve it. The final dataset has no missing data and is downloaded directly as a CSV file.