Use custom Docker images and GPUs
Run Burla workers with custom images, native tools, and GPUs.
Use custom Docker images and GPUs
Use this when your worker needs CUDA, native binaries, pinned system packages, large model weights, or a private runtime.
Do not build an image for a small pure-Python job unless package install time is already the problem.
The unit of work stays the same: one file, batch, tile, sample, or shard per input.
Each worker runs your function inside the image you pass to remote_parallel_map.
The output should be small metadata or a path to files written by the worker.
An image changes the worker environment. It should not change the shape of the job.
When to use an image
Use a custom image when the worker needs:
- native tools such as
bwa,samtools,gdal,ffmpeg, or OCR libraries - CUDA libraries for PyTorch, TensorFlow, CLIP, YOLO, or embedding models
- large model weights that should not download on every worker startup
- system packages that
pip installcannot provide - a pinned Python environment that must match production
For ordinary Python packages, start without a custom image. Burla can install many Python dependencies at runtime.
Build the smallest image that contains the slow parts
For native command-line tools, start from an image that already has the system packages you need.
FROM python:3.12-slim
RUN apt-get update && apt-get install -y --no-install-recommends \
bwa \
samtools \
&& rm -rf /var/lib/apt/lists/*
RUN pip install boto3 awscli
Build and push it to a registry your Burla workers can pull from.
docker build -t us-docker.pkg.dev/my-project/burla/bio-worker:latest .
docker push us-docker.pkg.dev/my-project/burla/bio-worker:latest
Run native tools from a worker
Plan one sample per input.
with open("manifest.tsv") as f:
samples = [line.strip().split("\t") for line in f if line.strip()]
sample_jobs = [
{"sample_id": sample_id, "fq1": fq1, "fq2": fq2}
for sample_id, fq1, fq2 in samples
]
The worker can call the tools directly.
def align_sample(job):
import os
import subprocess
import time
sample_id = job["sample_id"]
work_dir = f"/tmp/{sample_id}"
os.makedirs(work_dir, exist_ok=True)
start = time.time()
subprocess.run(f"aws s3 cp {job['fq1']} {work_dir}/R1.fastq.gz", shell=True, check=True)
subprocess.run(f"aws s3 cp {job['fq2']} {work_dir}/R2.fastq.gz", shell=True, check=True)
subprocess.run(
f"bwa mem -t 4 /refs/hg38.fa {work_dir}/R1.fastq.gz {work_dir}/R2.fastq.gz "
f"| samtools sort -@ 4 -o {work_dir}/{sample_id}.bam -",
shell=True,
check=True,
executable="/bin/bash",
)
subprocess.run(f"aws s3 cp {work_dir}/{sample_id}.bam s3://my-bam-bucket/bams/{sample_id}.bam", shell=True, check=True)
return {"sample_id": sample_id, "elapsed_s": round(time.time() - start, 1)}
Run the job with the image and the resources one sample needs.
from burla import remote_parallel_map
IMAGE = "us-docker.pkg.dev/my-project/burla/bio-worker:latest"
reports = remote_parallel_map(
align_sample,
sample_jobs,
image=IMAGE,
func_cpu=4,
func_ram=16,
grow=True,
)
The output is a report. The BAM files are written to object storage by the worker.
Use a CUDA image for GPU work
For GPU jobs, start from a CUDA runtime image or a framework image with CUDA already installed.
FROM pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime
RUN pip install sentence-transformers pyarrow numpy
RUN python - <<'PY'
from sentence_transformers import SentenceTransformer
SentenceTransformer("BAAI/bge-large-en-v1.5")
PY
Baking model weights into the image makes startup slower at build time and faster at job time. That is usually the right trade when many GPU workers load the same model.
Keep heavy imports inside the worker
Plan text shards or document batches on the client.
from pathlib import Path
shard_paths = [str(path) for path in Path("/workspace/shared/texts").glob("*.jsonl")]
Load the model inside the worker. Cache it on the worker process so later inputs on the same worker do not reload it.
def embed_shard(shard_path):
import json
from pathlib import Path
import numpy as np
from sentence_transformers import SentenceTransformer
if not hasattr(embed_shard, "_model"):
embed_shard._model = SentenceTransformer("BAAI/bge-large-en-v1.5", device="cuda")
rows = [json.loads(line) for line in Path(shard_path).read_text().splitlines()]
texts = [row["text"] for row in rows]
vectors = embed_shard._model.encode(texts, batch_size=64, normalize_embeddings=True).astype("float32")
output_path = Path("/workspace/shared/embeddings") / f"{Path(shard_path).stem}.npy"
output_path.parent.mkdir(parents=True, exist_ok=True)
np.save(output_path, vectors)
return {"shard": shard_path, "rows": len(rows), "output_path": str(output_path)}
Ask for a GPU and cap parallelism to your GPU quota.
GPU_IMAGE = "us-docker.pkg.dev/my-project/burla/embedder:latest"
embedding_reports = remote_parallel_map(
embed_shard,
shard_paths,
image=GPU_IMAGE,
func_gpu="A100",
func_cpu=4,
func_ram=32,
max_parallelism=8,
grow=True,
)
Then reduce paths, not arrays.
embedding_paths = [row["output_path"] for row in embedding_reports]
Match Python versions
The Python version in your client and the image should match.
If your image runs Python 3.12, run your local script with Python 3.12. Version drift can look like a Burla or Docker problem when it is really a serialization problem.
Keep credentials out of the image
Do not bake API keys, database passwords, or cloud credentials into an image.
Use runtime environment variables, workload identity, service accounts, or the cloud permissions already attached to the worker.
The image should contain code dependencies. Runtime credentials should stay runtime credentials.
Choose resources from the worker
Set resources from what one worker does:
func_cpu: threads used by one taskfunc_ram: peak memory for one taskfunc_gpu: GPU type needed by one taskmax_parallelism: quota or external bottleneckimage: environment needed by one task
Do not ask for a GPU because the whole pipeline has a GPU step. Ask for a GPU only on the call whose worker uses CUDA.