Can a bought solution help you deliver better models faster than ever?
My team at Etsy recently made a bold decision to overhaul the machine learning systems that power ad ranking and bidding. Ads is a staple product for Etsy, and as engineers, our ability to continue to deliver a high quality product hinges on continually improving its underlying machine learning models.
A machine learning platform comprises the software and infrastructure that support end-to-end machine learning development, including feature engineering, model training, evaluation, and inference. Our company has historically spent much of its machine learning engineering investment building custom tooling to facilitate many of these tasks. Yet, as our product has matured and our team has grown, we found ourselves wanting more out of our platform. We made one goal the pinnacle of our technical strategy in 2020: abandon our internal platform in favor of third-party libraries and tooling. In this article, I will share lessons from our journey in the hope that it will help you in your decision-making when your product has outgrown its platform.
What makes an ideal machine learning platform?
Before taking the big leap to third-party, it is important to carefully reflect on what constitutes the platform of your dreams, which will depend on the stage of your product and the scale of your team. In our case, we had already explored standard model architectures but saw that the ads industry is trending towards sophisticated custom models, such as deep neural networks with advanced embedding features. We were really excited to explore these ideas, but we realized that in order to do so, we needed a highly flexible, scalable platform to build on. Here are some of the factors we considered that may help you decide if a third-party platform is right for you.
Model architecture flexibility. Do you need the ability to quickly benchmark a wide variety of architectures? In our case, we had already explored linear regression and tree-based models in the early stages of the product but wanted to add custom neural network architectures to our repertoire. For comparability and speed of development, you may want a single library and set of tooling that will let you prototype any model you can think of.
Closeness of featurization and training code. If you wish to consider new features, there has to be low friction between developing a new feature or transformation and trying it out in a model. For us, it was important that our model featurization and training code be in the same language and easily compatible.
Distributed model training. In order to get a fast read on the accuracy of new models when the scale of your training data is large, you need distributed feature processing and model training to speed up the training and evaluation feedback loop.
Low onboarding time for new team members. If your company is in a stage of growth you’ll want new engineers joining your team to be able to rely on tooling they are already familiar with so that they can get models to production quickly. In our hiring, we saw that candidates frequently had experience with a few standard libraries like TensorFlow and PyTorch.
Mechanism for experiment tracking. As your team scales, you may want to leverage your people resources by having multiple engineers test out different ideas on the same model. It may be helpful to have a centralized place to coordinate canonical datasets, standardize evaluation metrics, and compare model experiment results.
Automatic experiment orchestration. Trying out a single modeling idea offline requires multiple steps: sampling data, performing feature transformations, training, and evaluation. Automatic orchestration ensures engineers are able to spend more time modeling and less time manually coordinating each of these steps.
With the lengthy list of features on our wishlist, the idea of adapting or overhauling our existing platform was daunting. We decided that the best path forward for our product was to adopt a pre-existing third-party platform that already addressed most of our needs. Given Etsy’s commitment to the Google Cloud Platform, we chose TensorFlow and TensorFlow Extended (TFX) libraries running on fully managed services DataFlow and AI Platform. To reach the best path forward for your organization, consider whether there are any discounts you can leverage with existing third-party partners like Google or AWS and whether there’s a platform your engineers may already be familiar with.
Making our dream platform a reality
While our team felt confident in adopting TensorFlow, we had a number of hurdles to address before we could begin building on it. If you’re a product team like us rather than a platform team, you may have more leeway to make a bold decision, as the immediate impact of your decision is likely to be limited. However, if your product has high visibility, you must be especially diligent about identifying and mitigating risks proactively.
The risks of adopting a third-party machine learning platform
As the tech lead of this ambitious project, it was my responsibility to be transparent about its risks. Beyond leveraging the expertise of engineers on our team, I cast a wide net internally by sharing our proposal with other teams in our domain. We also reached out to external teams who had undergone a similar journey, one of which was particularly generous in calling out unexpected gotchas and calibrating our expectations. Through this process, some key areas of risk came to light:
- Model accuracy. The accuracy of models built on the new platform may not immediately match or exceed that of those on the internal platform, delaying the impact of the project on product KPIs.
- Model serving performance. An external platform may not be exactly suited to the scale and latency requirements of the product, limiting the business and customer impact of even the most accurate models.
- Cost. The resource cost of fully managed services for model featurization and training could wipe out the gains of modeling improvements they yield.
Risk mitigation strategies
Given these risks, my team and I devised a project plan with the following strategies top of mind.
Aim for neutral model accuracy initially and set expectations accordingly. Changing your machine learning platform is equivalent to changing both the key infrastructure and algorithms that power your production system at the same time. If you also add tons of new features and complexify your model architecture in pursuit of huge modeling gains, the path to production becomes exceedingly difficult, and the impact of each of the many factors that have changed can be impossible to untangle. We limited our feature set to what was already available in production, aimed for neutral model performance, and communicated that to stakeholders from the start.
Build end-to-end as quickly as possible. To eliminate risks in serving performance and cost, it’s necessary to timebox (or better yet, parallelize) model development. Once you have a reasonable trained model, get it wired up to your system, load test for serving feasibility, and carefully measure the cost of feature generation, training, evaluation, and serving.
Identify pivot opportunities proactively. To justify the project to your stakeholders in light of the potential setbacks, identify ways that you could extract value from it even if your worst fears around model accuracy and serving performance are realized. For example, if serving performance is shaky, you could first migrate models that you batch inference offline and thereby reap all of the benefits of the new platform, just on a smaller set of models.
Be willing to accept compromises and transition incrementally. Even if getting to your dream platform requires a total overhaul, it would be time-consuming and risky to change your libraries, training pipelines, model architecture, serving application, and orchestration system all at once. You can’t tackle everything, so take a hard look at the highest priority improvements on your wishlist and where the biggest risks lie. In our case, we cared most about model architecture flexibility and iteration speed, so we first replaced our modeling library and training pipelines and followed up later with featurization and serving.
How it’s going
Influencing platform direction as a product team
After securing buy-in from leadership, we set out on the first phase of our plan to replace our modeling library and training pipelines with TensorFlow and Google Cloud services, leaving our feature set and serving application intact. Building end-to-end quickly paid off: in just a few months, we eliminated our biggest concerns around serving performance and cloud cost. In fact, our first online experiment with the new platform delivered both KPI and performance wins. Even better, our model iteration velocity skyrocketed, as we were able to explore the impact of a wide range of model hyperparameters with a single command.
The success of the project helped influence our company’s machine learning platform vision. As other teams learned about the new types of models we explored and how our biggest concerns proved to be non-issues, they began to share our vision of a third-party platform. The growing excitement for TensorFlow has encouraged our machine learning platform team to consider making TensorFlow/TFX a first-class supported platform going forward.
Don’t underestimate migration fatigue
The six months that have followed our initial experiment win haven’t been all sunshine. The machine learning world moves quickly, and one thing we hadn’t planned on is that just a few months after we built our first models on TensorFlow, the platform would change the major version and several minor versions recommended for high-scale production use cases. We encountered library and platform bugs as we explored new model architectures that were only addressed in later library releases, some of which brought subtle changes in model learning algorithms or evaluation metrics that drastically increased the scope of upgrading. The compromises we had made by incrementally upgrading our platform, combined with the unplanned library upgrades, caused the team to feel like we were constantly in migration mode.
Conclusion
Moving from an in-house to a third-party machine learning platform can carry substantial risks and incur migration fatigue, but it can also allow your modeling capabilities and iteration speed to skyrocket as ours did. Before taking such a big leap, make sure to carefully consider whether a third-party platform is appropriate for your product’s maturity, your team size, and for how you plan to mitigate risks surrounding serving feasibility and cost. Making such a decision a reality requires significant investment and organizational support, but with it, you may be able to deliver better models faster than ever.