Using Data to Predict Meetup Group Joins
The Data team here at Meetup is working on some really neat stuff to improve our Meetup Group recommendation engine.
Currently, our algorithm for suggesting you Meetup Groups is rather basic. It looks at all of the Meetups within a certain radius of you that also have the same topics and then takes out the Meetups that don’t match your age or gender. This topic-centric approach really isn’t predicting anything, let alone taking advantage of the numerous other indicative factors that go into a good Meetup suggestion.
We thought it would be fun (and useful) to make our recommendation engine smarter using a model based approach.
In machine learning, this type of problem can be classified as one of supervised learning. In other words, we can phrase the problem as the following: given this member X, would some Meetup Y be interesting enough for X to join - yes or no?
While there are many fancy supervised learning algorithms out there, we chose to use the simpler logistic regression to get our feet wet. Logistic regression is really similar to linear regression, but the output is conveniently between 0 and 1 which makes it really useful when you only want binary results. This number can be interpreted as the probability member X joins some Meetup Y. In our case, 1 (or anything close to 1) is yes and 0 (or close to 0) is no - perfect.
Our New Model
For our Alpha Model, we chose to include a very limited set of features (aka quantifiable characteristics we are looking to incorporate into our predictive model). They are:
- number of topic matches (1.77)
- number of facebook friends (1.5)
- zip match (.66)
- city match (.57)
- distance (-0.1)
- gender unmatch (-0.38)
- age unmatch (-1.24)
The numbers in the parentheses are the feature weights computed as a side-effect of training our model using logistic regression. These weight values make intuitive sense because for example, having a high number of topic matches with Meetups will contribute positively (a signal that you are more likely to have an interest in the content of the Meetup). Not being the same gender specification as the Meetup (aka an “unmatch”) correlates negatively to the likelihood of you joining.
While these results look great, the real challenge comes when trying to implement this model in production where results have to be served on the fly and at scale. Take our worse case scenario for example, when serving recommendations for a person in New York City where there are approximately 700,000 members on Meetup and nearly 20,000 Meetup Groups. Since recommendations are calculated on a pairwise per member per Meetup basis, that’s nearly (700,000 x 20,000) 14 million combinations!
Even if we are calculating recommendations for one person in NYC, that’s still 20,000 Meetups to go through. If each Meetup took even 1 millisecond to compute, that’s still 20 seconds. Can you imagine waiting 20 seconds for a web page to load? Our recommendations could be 100% accurate but no one would ever see them because people would simply get impatient and browse away within those 20 seconds.
Obviously, whatever we decided to do, this had to be fast.
Given the amount of computation, there were several engineering decisions we made to maximize speed. Since many of the feature calculations required going to the database, and database queries are slow, we preloaded most of the data we needed into memory. To avoid calculating a probability for every Meetup in the world, we use an R-Tree (http://en.wikipedia.org/wiki/R-tree) to quickly narrow down the Meetups to those close to the member.
The second optimization we made was to parallelize any and every possible computation possible. This worked out rather well because we are using Scala for some of the backend implementation. Scala makes parallelizing certain collection operations like map and filter ridiculously easy. Simply adding “.par” to a list object will do the trick.
To our pleasant surprise, the mean fetch times are a blazing 70ms!! Not too shabby at all.
We’ll be launching our new recommendations sometime in January. In the meantime, we’ll be looking to improve our model by adding additional features such as Meetup friends, second degree Facebook friends, and other metrics that suggest a Meetup’s quality.
With the augmented infrastructure, we’ve also set ourselves up nicely to tackle other types of recommendations such as events and topics.
While the benefits of using data to power decisions and product features are certainly well-appreciated, it is easy to get caught up in the implementation details and engineering challenges that arise. What’s important is taking a step back to verify whether the produced results make sense. The seemingly smaller decisions such as dealing with missing data or uneven number of positive and negative labels can ultimately have a profound impact on how the results turn out. Being able to catch these discrepancies and properly interpret the results requires thorough understanding of not only the algorithm themselves, but also the nature of the data, and how it is used in modelling.
Take a look for yourself here!