How To
Combine many results/files into one. (Map-Reduce)
A beginner-friendly map-reduce pattern for combining many outputs into one file.
Combine many results/files into one. (Map-Reduce)
Map-reduce means:
- map: run many function calls in parallel
- reduce: combine their outputs into one result
Why you might need this
Use this when you want to do lots of work in parallel, but end with one final output.
- Many files → one report
- Many small results → one total (this example)
Map writes outputs to /workspace/shared. Reduce reads them back and combines them.
Before you start
Make sure you have already:
- installed Burla:
pip install burla - connected your machine:
burla login - started your cluster in the Burla dashboard
If you’re new to /workspace/shared, start with Read and Write GCS Files. If you’re new to func_cpu and func_ram, start with Run code on one big cloud machine.
Step 1 (Map): Write one file per input
from pathlib import Path
from burla import remote_parallel_map
def write_part_file(number):
part_file_path = f"/workspace/shared/map-reduce-demo/parts/number-{number}.txt"
Path(part_file_path).parent.mkdir(parents=True, exist_ok=True)
Path(part_file_path).write_text(f"{number}\n")
return part_file_path
inputs = list(range(5))
part_file_paths = remote_parallel_map(write_part_file, inputs)
print(part_file_paths)
This creates 5 files in /workspace/shared/map-reduce-demo/parts/.
Step 2 (Reduce): Combine all files into one file
The reduce step runs once, so it is a common place to use a bigger machine (more CPU / RAM).
from pathlib import Path
from burla import remote_parallel_map
def combine_part_files(part_paths):
total = 0
for path in part_paths:
total += int(Path(path).read_text().strip())
output_file_path = "/workspace/shared/map-reduce-demo/final/total.txt"
Path(output_file_path).parent.mkdir(parents=True, exist_ok=True)
Path(output_file_path).write_text(f"{total}\n")
return output_file_path
output_file_paths = remote_parallel_map(
combine_part_files,
[part_file_paths],
func_cpu=8,
func_ram=32,
)
print(output_file_paths[0])
The final combined file path is:
/workspace/shared/map-reduce-demo/final/total.txt