Jan 31, 2023
We regularly invite ML practitioners and industry leaders to share their experiences with our Community. Want to ask questions to our next guest? Join BentoML Community Slack
We recently invited Alessya Visnjic. Alessya is the CEO and co-founder of WhyLabs, the AI Observability company on a mission to build the interface between AI and human operators. Prior to WhyLabs, Alessya was a CTO-in-residence at the Allen Institute for AI (AI2), where she evaluated the commercial potential for the latest advancements in AI research. Earlier in her career, Alessya spent 9 years at Amazon leading Machine Learning adoption and tooling efforts. She was a founding member of Amazon’s first ML research center in Berlin, Germany. Alessya is also the founder of Rsqrd AI, a global community of 1,000+ AI practitioners who are committed to making AI technology Robust & Responsible.
• The biggest challenge in ML observability is cost-effectiveness, which WhyLabs solves by scanning data where it already is, avoiding data duplication, and increasing accuracy and cost-effectiveness.
• Home-rolled observability solutions can break because of the ongoing maintenance required and the continuous need to add features to support growing use cases. It is often better to use a software like WhyLabs.
How did I get into ML and MLOps? Well as a little girl I was always dreaming about… jk!
Early in my career, I spent 9 years working at Amazon, deploying some of Amazon’s first ML applications to production, carrying a pager to respond to failures, and building tools to make my pager days less painful. At that time, we didn’t have a concept of MLOps, just “tools that help me sleep more during on-call”. Building those tools was my gateway to what we call MLOps today. Primarily, I focused on building tools for reproducible ML development, testing, and monitoring. Monitoring was my favorite because of its complexity and utility in my on-call
Reason 1: While at the AI2 (allenai.org) I had the opportunity to talk to hundreds of ML teams at enterprises. I called them “complaining sessions” because we’d complain about challenges with taking ML out of experimentation and operating it in production. One of the questions I always asked was “tell me about a recent production model failure”. Everybody had a story or two (some people had a dozen). There were a lot of patterns across these stories, patterns around inability to figure out what caused a failure, when exactly did the issue start happening, was it caused by a code change or a natural change in data. A lot of the problems pointed to the lack of observability. That was the sign of a clear need in the market.
Reason 2: The complexity of the solution - monitoring is a really hard problem, a kind of problem that you can dedicate your career to as an engineer and continue finding challenges. Monitoring has to be more accurate, scalable, reliable, etc than your actual production system =) So it’s the kind of problem I am thrilled to dedicate myself to!
So observability is composed of two activities: (1) gather telemetry, (2) identify change in the telemetry. Our approach for gathering telemetry is unique - we use the whylogs agent that runs in the same environment where the data is flowing (i.e. inference API or Spark pipeline), it computes all the statistics needed for observability and then this statistics (we call it profiles) get centralized in our platform. This approach is different from all other solutions which require you to send all the data that flows through inference to a data store (i.e. Druid cluster) that centralizes raw data and then computes telemetry. This approach gives huge benefits to our customers as I mentioned in this answer (https://bentoml.slack.com/archives/C03U2PT7UUQ/p1668720563105909)
ML Observability is in its early stages of development today - I think the challenges are ever-evolving. But today’s biggest challenge is how to do observability in a cost-effective way. If you think about it - one of the key features is detecting drift, which means that you need to figure out the distribution of the data that is flowing through inference, which typically means you have to duplicate this data into a data store to calculate this distribution. So, observability potentially would require as much storage and compute as your feature pipeline for the inference… That’s insane! We at WhyLabs came up with a way to capture data distributions (and all key observability statistics) w/o moving data. We built an agent (https://github.com/whylabs/whylogs) that scans data wherever it’s already flowing, which has a few huge benefits:
• Privacy: No data duplication needed
• Accuracy: No sampling, data stats are always based on 100% of the data, so no sampling biases (because agent uses streaming algos to calculate the stats, so it handles data super effectively and doesn’t need to see all the data to build the stats)
• Cost effectiveness: No need to move, store, compute data in a separate environment.
Typically the need for a monitoring solution arises in the ML Platform team or in a ML/Data Science team that builds models. These teams are composed of very talented engineers who love building tools. So the first instinct of the team is to build their own monitoring solution, because they believe only the home-built solution can accommodate their unique infrastructure and requirements. There are two challenges that come up once the barebones solution is built:
• Ongoing maintenance: your monitoring system should be more reliable than your production system. Because monitoring has to catch issues in production, so it can’t “fail” along with the “production”, or it’s useless. This is where teams begin to need a dedicated group that would be supporting this monitoring solution, which gets expensive very fast.
• Features: Observability needs are evolving fast, driven by both infra changes and types of ML models that the organization is launching. So you find yourself needing to add features continuously to support the growing use cases. Again, that requires the internal team to grow fast.
Some of our community implements response monitoring using Prometheus as the first step to model observability. How far do you think this solution will scale until they have to integrate with more complete AI observability tools like Whylogs.
Prometheus is a great first step! It is built for monitoring traditional software applications, so it doesn’t support things like:
• Data distributions for continuous: there is limited support that relies on static bucketing and you have to pick the right bucketing mechanism
• Data distributions for discrete: there is no support for tracking your discrete distribution
• Performance metrics:things like confusion matrices are not supported, which is a common use case
I would say, once you have demonstrated the value of monitoring through Prometheus, it’s a good time to look into dedicated ML monitoring =)
* The discussion was lightly edited for better readability.