[ad_1]
As a Data Scientist or a Machine Learning Engineer, metrics like the AUC of the ROC, the partial AUC, and the F score are everyday essential tools for evaluating the performance of your models. While you know how these metrics reflect the value of your models, explaining that value to the organization at large can be a challenge.
Communicating your machine learning work to teammates is a vital part of a data scientist’s job because your work impacts many areas of your organization. That said, the meaning of your work to teams outside of Data Science can get lost in translation as each function has its specific terminology. For example, increasing the recall of the fraud blocking model from 50% to 60% resonates with Data Scientists. However, in the finance realm, those metrics don’t highlight the financial value to a CFO. In this post, I’ll walk you through how you can translate your machine learning performance metrics into tangible insights your coworkers can appreciate.
A meeting of the minds
At Patreon, data scientists report within a centralized organization but are systematically embedded in cross-functional teams to develop close working relationships with coworkers across various disciplines. This allows us to create a holistic lens when approaching our work. When one of our Data Scientists thinks about improving our anti-fraud model, they think about how it’ll affect the Trust & Safety team, what Engineering might think of its time to execute in prod, and how it’ll impact the plan Finance put together. We know that our partners’ clear understanding of our work is essential to our collective success.
The Three Key Principles
When designing a metric to evaluate a machine learning model and communicate to your teammates:
- The metric must take into account the operating thresholds of your model when it is in production.
- The metric must be true in the real world, including the effects of systems and rules outside of your model.
- The metric should reflect empathy for your colleagues, cast in a language they use on a day-to-day basis.
1. Configurations like thresholds matter
Consider a fraud model that puts large, suspicious transactions into a queue for manual review by Trust & Safety experts. Suppose that model gives a good user’s transaction a score of 0.93 — this specific value is not meaningful to the user. They care about whether their order will go through. The Trust & Safety expert cares about whether they’ll need to review the transaction. And your CFO cares about whether the transaction will lead to revenue or not.
If the score is 0.93 and the threshold for review is ≥0.92, then the user is blocked, the T&S expert has more work to do, and the CFO doesn’t see the money. But if the score is 0.93 and the threshold for review is ≥0.94, it’s very different: the user completes their task, the T&S expert can work on more important things, and the money is added to the bottom line. Taking the time to understand your coworkers’ business goals will help you share your findings in a way that resonates with them, so everyone can benefit from the numbers.
When we put a model into production and integrate it with other systems, we must choose a threshold to operate at. The only thing that matters is how your model performs at that threshold. If the production system that your model connects flags a transaction when your model scores that transaction ≥0.92, the only thing that matters is how your model performs at a score of 0.92.
This principle shows why the AUC doesn’t reflect the reality of model performance. A fraud model would never run at a false positive rate of 60% (your company wouldn’t make any money!). At least in a fraud context, it’s a flaw that the integral used to compute AUC takes into account a model’s performance at every possible false positive rate.
What should you use instead? Any of the standard menu of confusion matrix-based metrics do take into account the threshold because any confusion matrix will be calculated for a specific threshold. Precision, recall, false positive rate — all good choices.
You might object: when you’re deep in the trenches of model development, feature engineering, and hyperparameter tuning, you won’t know what the final threshold will be! That’s when you can borrow the spirit of this principle and use the partial AUC. By integrating the ROC curve from 0 up to a maximum false positive rate, it gives sensitivity to the general area of a model’s performance that will matter, without locking you into a specific threshold. In the example above, the generic AUC shows the two models performing equally well, but the use of a modified AUC will reveal that the red model is a better choice for a low-FPR environment while the purple model is a better choice for a high-recall environment.
2. The real world affects your model’s results; it should affect your metric too
It’s rare for a machine learning model to run out in production all alone, sending its output directly to users. Think about a recommendation algorithm: does it simply send its top five picks to the viewer, displayed in order? No, what’s displayed is probably mixed in with some business logic first. Maybe your company doesn’t want to recommend certain controversial content, or it wants to include ads, or the in-house product is getting boosted.
Your system probably doesn’t actually look like this:
But more like this:
If you ignore those real-world effects, then the performance metrics you’re sharing will be wrong. While you’re building the best model you can, it can make sense to narrow your scope to just its direct output. But your customers don’t care about what your model did when you ran it offline in your Jupyter Notebook; your customers care about customer-facing content. And your colleagues on other teams focus on what your customers care about.
The solution is to include the surrounding business rules in the whole package of your model as the object of analysis and to compute all the important metrics on the output of that whole package.
3. Use a metric relevant to what your audience is already an expert in
We like it when people speak to us in a language we understand and about topics we care about. In that regard, frame the conversation about your model in those terms.
Here are four ways you might describe four models that stop fraudsters from withdrawing money:
- “The AUC on the OOT test set is 0.902.”
- “The insult rate is 0.13%.”
- “The precision after review is 44%.”
- “The loss directly prevented each month is $29,000.”
Plot twist: they’re all describing the same model! Double twist: they’re all the best description for the model.
To another data scientist, “the AUC is 0.902,” succinctly summarizes the overall performance of the model. They know what AUC is, they have a sense for what a “good” or “bad” value might be, and they’ve used that measure themselves.
To a member of the Customer Support Team, “the insult rate is 0.13%,” tells them how many inbound complaints they can expect to hear from good users who’ve been incorrectly blocked. Notice this might actually be harder for some data scientists to understand — what’s an insult rate? It’s another name for the false positive rate, favored in domains where being identified as positive could be literally “insulting.” Tailoring the conversation to your audience creates shared understanding.
To a member of the Trust & Safety team, “the precision after review is 44%,” tells them what they care most about in words they use all the time. They’re the ones doing the review, and they know that if the precision is really low they’ll be wasting their time.
To a member of the Finance team, “the loss directly prevented each month is $29,000,” instantly gives them the bottom line on their top concern: how much money we can save each month. It’s not that they don’t care about the potentially insulting experiences of good users, but their role in the company means that the information they need from you is the information they can plug into a financial forecast spreadsheet.
So if you’ve just got one sentence to explain how your model’s doing to a colleague, carefully choose which aspect of the model to convey so that they can instantly see how it relates to their work. And, when you can, choose language they use in their day-to-day.
If this is a challenge, ask your coworkers for candid feedback on your machine learning updates: are they useful to them? How do they want to think about the relation between their work and your work?
Putting it all together
The final report we generate at Patreon when retraining our anti-fraud models looks something like this:
*Numbers are for illustration purposes only.
This brings together all three principles. All the metrics are computed at the recommended threshold. Behind the scenes, the offline script estimates the effects of production code and business logic. And there’s a metric for each of our key stakeholder teams, showing precisely the way the model relates to their expertise.
At Patreon, we work hard to build products and systems that help creators and patrons. In this specific example of understanding and improving the accuracy of our anti-fraud ML, these systems are helping protect creators from bad actors on the platform. While these ML models protect creators from hundreds of thousands of dollars of fraudulent charges throughout the year, they also provide the opportunity for technical teams like data science to forge deeper working relationships with other teams. As a Data Scientist, these collaborations translate our language of ML into the languages of business, Trust & Safety operations, and user experience. In doing so, we’re strengthening our Data Science empathy muscle and ensuring that the value of our models is articulated in the world outside of data and code.
Are you a data science enthusiast who wants to impact the next era of the creator economy? We’re hiring!
[ad_2]