How To

Choose how to split your work

Pick the input unit for a Burla job.

Choose how to split your work

Use this when you know what code should run, but not what to pass as the input list. Do not use this to tune a job whose input unit is already obvious. The unit of work is one item from the list you pass to remote_parallel_map. Each worker should own enough work to be worth starting, but not so much that one failure wastes an hour. The output should be small, or a path to a file written by the worker.

The main decision in a Burla job is not the cluster size. It is the shape of the input list.

The rule

Pick an input unit that is:

independent from the other inputs
large enough to amortize startup and setup
small enough to fit in worker memory
cheap enough to retry
aligned with the output you need

If the input unit is wrong, more machines usually make the wrong thing happen faster.

Common split patterns

Use one file per input when files are already the natural boundary.

from pathlib import Path

file_paths = [str(path) for path in Path("/workspace/shared/raw").glob("*.parquet")]

Use several files per input when each file is tiny and startup would dominate.

def chunks(items, size):
    return [items[i:i + size] for i in range(0, len(items), size)]


file_groups = chunks(file_paths, 100)

Use row ranges when a database table has an indexed numeric column.

def build_id_ranges(start_id, end_id, rows_per_range):
    return [
        (range_start, min(range_start + rows_per_range - 1, end_id))
        for range_start in range(start_id, end_id + 1, rows_per_range)
    ]


id_ranges = build_id_ranges(1, 10_000_000, 50_000)

Use byte ranges when one line-oriented file is too large to read on one machine.

def build_byte_ranges(total_bytes, bytes_per_range):
    return [
        (start, min(start + bytes_per_range, total_bytes))
        for start in range(0, total_bytes, bytes_per_range)
    ]


ranges = build_byte_ranges(total_bytes=275_000_000_000, bytes_per_range=500_000_000)

Use chunks of URLs, IDs, or prompts when the bottleneck is an API, website, database, or model provider.

with open("user_ids.txt") as f:
    user_ids = [line.strip() for line in f if line.strip()]

id_chunks = chunks(user_ids, 1000)

Use one tile, scene, sample, or shard when the source system already has useful boundaries.

with open("sentinel_tiles.txt") as f:
    tile_ids = [line.strip() for line in f if line.strip()]

Write the worker around one input

The worker should make one input useful. Keep source reads, local work, and output writes together.

def scan_one_parquet_file(path):
    import pyarrow.parquet as pq

    table = pq.read_table(path)
    return {
        "path": path,
        "rows": table.num_rows,
        "columns": len(table.column_names),
    }

Then run that worker over the input list.

from burla import remote_parallel_map

reports = remote_parallel_map(
    scan_one_parquet_file,
    file_paths,
    func_cpu=1,
    func_ram=4,
    grow=True,
)

Return small outputs

Returning small dicts, numbers, or short strings is fine.

import pandas as pd

report = pd.DataFrame(reports)
report.to_csv("scan-report.csv", index=False)

For large outputs, write a file from inside the worker and return the path.

def transform_one_file(path):
    from pathlib import Path
    import pyarrow.parquet as pq

    table = pq.read_table(path)
    output_path = Path("/workspace/shared/processed") / Path(path).name
    output_path.parent.mkdir(parents=True, exist_ok=True)
    pq.write_table(table, output_path)
    return str(output_path)

This avoids sending large dataframes or arrays back through the client.

Pick chunk size from the bottleneck

If startup or model load time is expensive, make each input bigger.

If failures are common, make each input smaller.

If memory is tight, make each input smaller or ask for more func_ram.

If the output is large, write files to /workspace/shared and reduce paths later.

If the bottleneck is outside Burla, start with the external limit. API quotas, website politeness, database connections, object storage bandwidth, and GPU memory matter more than CPU count.

When to reduce

Reduce when the final answer needs a global view.

Examples:

many file reports into one CSV
many JSONL shards into one table
many heaps into one top-K list
many image manifests into one training manifest

For small outputs, reduce locally after remote_parallel_map returns.

For large outputs, run a second Burla call over output paths.

def combine_reports(paths):
    import pandas as pd

    frames = [pd.read_parquet(path) for path in paths]
    out_path = "/workspace/shared/final/report.parquet"
    pd.concat(frames, ignore_index=True).to_parquet(out_path)
    return out_path


[final_report_path] = remote_parallel_map(
    combine_reports,
    [processed_paths],
    func_cpu=8,
    func_ram=32,
)

Start with this checklist

Before you run the full job, write down:

What is one input?
What does one worker read?
What does one worker write or return?
What is the external bottleneck?
What resource does one worker need?
What happens if one input fails?
Does the final answer need a reduce step?

That checklist catches most bad Burla job shapes before you spend money on them.

Choose how to split your work

#Choose how to split your work

#The rule

#Common split patterns

#Write the worker around one input

#Return small outputs

#Pick chunk size from the bottleneck

#When to reduce

#Start with this checklist

#Examples that use this pattern

Choose how to split your work

The rule

Common split patterns

Write the worker around one input

Return small outputs

Pick chunk size from the bottleneck

When to reduce

Start with this checklist

Examples that use this pattern