The Business Solutions Series is a compilation of solutions to various business challenges that I have encountered throughout my professional journey.
Context
Improve a service to detect bots that engage in online ad fraud.
Problem
This problem is - for the most part - an unsupervised ML problem (e.g. we don’t have ground truth for bots). I like this project for this reason, and also because we ended up building different solutions that complemented each other. This allowed us to improve the solution over time.
Objective
Identify online bots. This is valuable for internet companies because it can substantially reduce the money that brands lose to ad fraud and also protect their services from malicious players without affecting their real users.
Solution
The first model our team developed was very simple and elegant. Some important things to understand: bots tend to be programmed to visit the same set of fraudulent websites (to commit ad fraud). And, these fraudulent websites are usually not visited by real human beings. This creates an interesting conditional probability distribution for those fraudulent websites.
First, we computed independent scores for all websites by calculating the probability that any random user will visit the site at any point of time. Popular websites visited by many users (e.g. facebook.com) would get a high score (this is the probability that any random user will visit it) and obscure websites would have a very low score (a random user is very unlikely to ever visit it). Then, for all pair of websites, we also computed the conditional probability of a random user visiting a website given they have already visited the other website in that pair.
Even though fraudulent websites will have very low probability of being visited by a random user, their conditional probability will be very high if those users have already visited other fraudulent websites before. These kinds of bots tend to visit mostly fraudulent websites so their browsing history will be heavily represented by website pairs like this one (pairs where each site has low scores independently but very high conditional probability when seen together).
For each user, we looked at all their browsing websites and also at all of their website pairs. If a user had mostly visited websites that had low probability scores but high conditional probability scores given most seen pairs, there was a very high chance that that user is actually a fraudulent bot. We encoded this logic into the data warehouse, and that helped us easily flag these users moving forward.
Initially, another easy way to catch bots was to look at browser type and version distributions between known legit websites and known fraudulent websites. Many bots are actually operating inside infected machines and use malware built-in browsers to do their fraudulent browsing under the hood, without the machine user even noticing this is even happening. By monitoring known fraudulent websites we could flag browser types and versions that seem suspicious compared to browsers types and versions seen in legit websites. However, fraudsters rapidly realized this was giving them up so they started faking (spoofing) their browser user agents to avoid being flagged so easily.
We then decided to start collecting a lot of information (hundreds of signals) using custom JavaScript code on those browsers. This was so we could detect browsers that were pretending to be some other browser that they were not. We initially started this project by hard-coding rules on those signals but very fast realized it was a management nightmare. We needed to keep up with new rules for new browser versions all the time. So, instead, we replaced that process with a machine learning model, where we pre-filtered browsing examples that we were very certain to be clean and trained a model to classify the browser type and version based on the hundreds of signals (features) that we were able to collect form each browser using our JavaScript probe. We then used an AutoML managed service to train this model. Given how comprehensive the amount and diversity of signals we were able to collect from browsers running our code, the ML model managed to do an amazing job at predicting the browser type and version - with 100% accuracy - on the “clean browsing” hold out set. We then used the same model to predict the browser type and version in the wild and started flagging those browsers where the user agent didn’t match the predicted browser type or version.
Lastly, we also complemented the solution with one more model that helped us separate cloud computers that were being used legitimately to browse the Internet (e.g. a VPN service) from those being used fraudulently (e.g. bots using cloud VMs). For this use case, we got our hands on some labeled data to train our model. We found a big list of cloud IPs that we knew for sure were being used by VPN-like services and some other big list of cloud IPs that we were very confident we shouldn’t see any real human traffic on them (but we were still seeing it).
Both kinds of cloud IPs would show a high volume of browsing coming from those IPs but we were able to find a few features that helped us train an almost perfect model. We realized that cloud IPs for which no human traffic was expected would have many short lived non-overlapping cookie sessions (bots were likely clearing all cookies after every website visit), while VPN-like IPs would show many concurrent cookie sessions coming from the same IP with a unique distribution of how long different cookie sessions lasted (as real people were actually behind those sessions). By computing counts of total cookies, maximum number of concurrent cookies during some time window, and some metrics about the distribution of the cookie’s lifetime, we got an AutoML managed service to learn to differentiate these two kinds of Cloud IPs beyond the smaller sample we had labels for.
Impact
By combining many kinds of models to tackle this problem we were able to build one of the most effective bot detection solutions in the industry.