Use Cases
Process thousands of files quickly.
A pattern for processing many files in parallel.
Process thousands of files quickly.
If you have a lot of files, the fastest pattern is usually:
- give each file to one function call
- run many function calls in parallel
- save each output to
/workspace/shared
This keeps your script simple and makes it easy to scale up.
Before you start
Make sure you have already:
- installed Burla:
pip install burla - connected your machine:
burla login - started your cluster in the Burla dashboard
For shared filesystem details, read Read and Write GCS Files.
Step 1: Build a list of file paths
Start with a list of input files.
from pathlib import Path
input_file_paths = [str(path) for path in Path("/workspace/shared/logs/raw").glob("*.txt")]
Each path in the list becomes one parallel function call.
Step 2: Write one file-processing function
This function reads one input file and writes one output file.
from pathlib import Path
def count_error_lines(input_file_path):
input_path = Path(input_file_path)
output_path = Path("/workspace/shared/logs/processed") / f"{input_path.stem}.txt"
output_path.parent.mkdir(parents=True, exist_ok=True)
error_count = sum("ERROR" in line for line in input_path.read_text().splitlines())
output_path.write_text(f"{error_count}\n")
return str(output_path)
Step 3: Run all files in parallel
Use remote_parallel_map with your function and file path list.
from burla import remote_parallel_map
processed_file_paths = remote_parallel_map(count_error_lines, input_file_paths)
Step 4: Combine the per-file outputs into one report
You can keep this final combine step simple.
from pathlib import Path
total_error_count = 0
for processed_file_path in processed_file_paths:
total_error_count += int(Path(processed_file_path).read_text().strip())
final_report_path = Path("/workspace/shared/logs/final/error-report.txt")
final_report_path.parent.mkdir(parents=True, exist_ok=True)
final_report_path.write_text(f"total_error_count={total_error_count}\n")
print(final_report_path)
Step 5: Scale up safely
Before you run thousands of files, test with a small subset first.
from burla import remote_parallel_map
remote_parallel_map(count_error_lines, input_file_paths[:20])
When the small test works, run the full list.
What to do next
If you have one very large file instead of many small files, continue with Process one giant file quickly.