Millions of image resizes

Resize the whole image corpus before training on it

In this example we:

  • List 5,000,000 source images from S3.
  • Resize each image into 256, 512, and 1024 pixel variants.
  • Stream a manifest while workers write outputs back to S3.

A preview folder always looks fine. The full corpus is where the EXIF rotations, corrupt PNGs, CMYK JPEGs, and odd aspect ratios live.

Step 1: Chunk the image keys

The client lists source keys and batches them into 1,000-image chunks.

import boto3

keys = []
paginator = boto3.client("s3").get_paginator("list_objects_v2")
for page in paginator.paginate(Bucket="my-photos", Prefix="originals/"):
    keys += [obj["Key"] for obj in page.get("Contents", []) if obj["Key"].lower().endswith((".jpg", ".jpeg", ".png"))]

chunks = [keys[i:i + 1000] for i in range(0, len(keys), 1000)]

Step 2: Resize inside the worker

The worker opens each image, fixes EXIF orientation, writes every target size, and returns a small report.

def resize_chunk(image_keys: list[str]) -> list[dict]:
    import io, os, boto3
    from PIL import Image, ImageOps

    s3 = boto3.client("s3")
    out = []
    for key in image_keys:
        body = s3.get_object(Bucket="my-photos", Key=key)["Body"].read()
        img = ImageOps.exif_transpose(Image.open(io.BytesIO(body))).convert("RGB")
        stem = os.path.splitext(os.path.basename(key))[0]
        for size in [256, 512, 1024]:
            resized = img.copy()
            resized.thumbnail((size, size), Image.Resampling.LANCZOS)
            buf = io.BytesIO()
            resized.save(buf, format="JPEG", quality=85, optimize=True, progressive=True)
            s3.put_object(Bucket="my-photos-resized", Key=f"resized/{size}/{stem}.jpg", Body=buf.getvalue())
        out.append({"key": key, "orig_w": img.size[0], "orig_h": img.size[1], "ok": True})
    return out

Step 3: Stream the manifest

Workers write images directly to S3. The client writes the report as chunks finish.

from burla import remote_parallel_map

for chunk_result in remote_parallel_map(
    resize_chunk,
    chunks,
    func_cpu=1,
    func_ram=4,
    generator=True,
    grow=True,
):
    for row in chunk_result:
        f.write(json.dumps(row) + "\n")

What's the point?

The resized images are only half the result. The manifest tells you which files worked, what dimensions they had, and which ones need a retry.

If I were about to train on this dataset, I would want that manifest before training starts. Otherwise the model can silently skip the weird slice of the corpus, and you only find out later when the training data looks cleaner than reality.