A few caveats with online inference

Introduction

Online inference is a technique used to deploy machine learning models in production. However, there are several considerations to keep in mind when deploying models, especially when using FastAPI/gunicorn along with libraries such as NumPy and PyTorch. This article highlights a few of these caveats.

Model Loading

When using multiple gunicorn workers to serve the model and loading the model into memory using on_event hooks, each worker ends up loading its own copy of the model. This can lead to unnecessary memory consumption. To optimize memory usage, load the model globally within the app instead of using on_event hooks and specify the gunicorn --preload flag.

This optimization is particularly beneficial for large models, especially when multiple workers are used and the model is I/O bound i.e. it both fetches the features and does the compute.

# app.py
from fastapi import FastAPI
from model import Model

app = FastAPI()
model = load_my_model()

@app.get("/predict")
def predict():
    return model.predict()

Multiple gunicorn Workers: A Bane?

Consider whether you really need multiple workers for your model. If your model is CPU-bound, meaning it performs computations on features passed via a proxy app without making additional API calls, having multiple workers can result in CPU contention with libraries like NumPy and PyTorch.

What Happens Underneath When Using NumPy/PyTorch

Libraries such as NumPy and PyTorch use OpenMP to parallelize computations by spawning multiple threads. The number of threads is determined by the number of physical cores on the host machine, not the number of cores reserved for your container. This can lead to problems, especially if you have reserved fewer CPU cores than the total number of threads spawned by the workers. Additionally, if multiple containers running different or identical models are deployed on the same host, the issue worsens.

Let's look at a concrete example where you reserved a single cpu for your container, and are using 3 workers.

each worker will spawn 8 threads, and you will end up with 24 threads against the reservation of 1 cpu, you can already see this is problematic, and
this tends to get worse if multiple containers, each running different/same models are deployed on the same host.

What should you do?

To avoid CPU contention and ensure stable latencies, consider the following recommendations:

Use a single worker.
Start with reserving a single core and set OMP_NUM_THREADS to 1. This configuration ensures that NumPy and PyTorch only use a single thread for computation.
Benchmark the latency using the above configuration and assess its acceptability.
Increase the core reservation and OMP_NUM_THREADS if the observed latency is not acceptable.

In practice, you can experiment with higher values of OMP_NUM_THREADS without increasing the core reservation to potentially increase utilization. However, note that this approach may still lead to CPU contention, so it should only be pursued if latency is not a major concern or if you are constrained by the number of cores you can reserve for your container.

Go Top

Ramandeep Singh

Published