Post

The Port Was Wrong: A Docker Migration Story

Thu Apr 30 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

$ curl -k --noproxy '*' https://[2001:db8::1]:30076/health
{"status":"healthy","model_loaded":true,"version":"0.1.0"}

That response took three days and seven deploys to produce.

I'm an engineer at Meta working on a video understanding pipeline. One of the services, a ResNet-based camera motion classifier that categorizes video clips into 16 motion types like "pan left" or "zoom in", runs on an internal ML serving framework. The framework works, but it's tightly coupled to internal infrastructure, making it hard to iterate on. My task: migrate it to a standalone Docker container with a FastAPI server, deploy it to GPU hardware, and verify it end-to-end.

The model code was straightforward. The FastAPI server was maybe 130 lines. The Dockerfile was 47 lines. Everything between "it works locally" and "it works in production" had a surprise waiting.


The Registry That Didn't Exist

Step one: build the image, push it to a container registry, point the deployment config at it. Simple.

I built the image on my laptop. It ran fine. I pushed it to the company's internal registry. Then I tried to build on a cloud development server — the kind of beefy remote machine you SSH into for heavy work. The registry wasn't reachable. DNS couldn't resolve it. I added a network proxy. Still nothing. The internal registry was only accessible from certain network segments, and my dev server wasn't on one of them.

Fine. I'd pull the base NVIDIA CUDA image from Docker Hub instead and push my built image to AWS ECR, which the deployment platform could reach.

But ECR has its own authentication dance. The company's AWS credential CLI wasn't installed on my dev server. I installed it. Then podman — the container runtime — wasn't installed either. I installed that. Then the ECR login command, copied from a teammate's instructions, was a multi-line shell pipe with trailing whitespace after a backslash. The shell interpreted the space-then-backslash as a literal character, not a line continuation. The command split into two fragments, both failing silently. I collapsed it to one line. Login succeeded.

Total time from "push to registry" to actually pushing to a registry: about two hours and four tool installations. Production infrastructure is never one tool. It's a chain of tools, each with its own auth model, network assumptions, and failure modes. The skill isn't knowing Docker. It's knowing how to methodically unblock yourself when the toolchain fights back.


The Port Was Wrong

The deployment orchestrator spun up my container on a machine with an H100 GPU. I watched the logs scroll by. CUDA available: true. GPU count: 1. Model checkpoint downloaded. Model loaded in 0.84 seconds. Application startup complete.

The health check timed out.

The service was running. The model was loaded. But the platform couldn't reach it. I pulled up the full container logs and started comparing what I expected with what actually happened.

The deployment platform assigns a dynamic HTTPS port (say, 30751) and passes it to the container as a command-line argument: --bind=[::]:30751. My entrypoint script ignored command-line arguments entirely. It constructed its own startup command using an environment variable, which defaulted to port 8000. Gunicorn happily bound to 8000. The platform's health check hit 30751. Nobody was listening.

While staring at the logs, I noticed a second problem: "Starting without TLS (HTTP mode)." The platform expected HTTPS. It had mounted a TLS certificate at a well-known filesystem path inside the container. But my server only enabled TLS if specific environment variables were set — TLS_CERTFILE and TLS_KEYFILE. The platform didn't set those variables. It just mounted the file and expected the application to find it.

Two fixes. First, parse --bind from the arguments the container receives at startup. Second, instead of waiting for environment variables to enable TLS, check whether the certificate file exists at the expected path. If it's there, use it.

Rebuild. Push. Redeploy. This time: TLS enabled, correct port, health check passes.

The lesson here stayed with me. When your service runs fine locally but fails in production, the answer is almost always in the contract between your code and the platform: the assumptions about how configuration gets passed. I assumed environment variables. The platform assumed command-line arguments and file mounts. Neither assumption was documented anywhere. The logs told me everything, but only after I knew what to look for.


The Proxy That Blocked Everything

The third deploy succeeded. The logs confirmed the model loaded. I wanted to verify with a quick curl from my dev server.

$ curl -k https://[the-service-address]:30076/health
Received HTTP code 403 from proxy after CONNECT

My dev server routes outbound HTTPS traffic through a corporate forward proxy. The proxy wouldn't relay connections to internal service addresses. The fix was adding --noproxy '*' to bypass it. Two words. Maybe four seconds to type.

But I didn't know that immediately. My first instinct was that TLS was misconfigured — that was the last thing that burned me. Then I wondered if the port was wrong again. I was halfway through re-reading the deployment logs before I noticed the error said "from proxy," not "connection refused" or "certificate invalid."

When something fails, your brain jumps to the last thing that broke. The port was wrong last time, so the port must be wrong this time. The hardest part of debugging isn't finding the answer. It's not anchoring on the previous answer.


The Response

{"status":"healthy","model_loaded":true,"version":"0.1.0"}

One ResNet model. Forty-four megabytes. Loads in under a second. The entire migration touched maybe fifteen files. And yet it took three days of navigating registry access, shell quoting, port binding contracts, TLS detection, and proxy configuration.

The interesting engineering wasn't the model or the FastAPI server; anyone can wrap a PyTorch model in an HTTP endpoint. It was the integration surface: the dozens of implicit contracts between my code and the deployment platform, each one undocumented, each one a potential deploy failure. This is the work that doesn't show up in architecture diagrams but determines whether your service actually runs in production.

This was the first of several services to migrate. The second one took four hours.