Reproducibility
How do I define a common entrypoint for my workload?
Section titled “How do I define a common entrypoint for my workload?”What is the main entrypoint to your project? Is it a bash script, a notebook, a Python file, or something else? A clearly defined entrypoint gives users and collaborators a quick way to understand and run your project - often more effectively than reading a potentially outdated README since focusing on an entrypoint you use yourself is more likely to stay up to date.
With the release of PEP 621, it became customary to define common entrypoints in the pyproject.toml
file under the [project.scripts]
section.
[project.scripts]# Single Entrypointworkload = "workload.main" # $ workload --help to see available subcommands
# Multiple Scripts/Entrypoints (module:function)train = "workload.main:train"download = "workload.main:download"
[build-system]requires = ["hatchling"]build-backend = "hatchling.build"
Note: We must include the [build-system]
section to specify how the project should be built, ensuring that tools know how to process its metadata and source files.
Namespace Organization
Section titled “Namespace Organization”Consider building your workload as a package with a CLI entrypoint to signal intent and have a common entrypoint for easier collaboration. As an additional step, you can organize your workload commands under a single namespace with subcommands:
# Single namespace with subcommandsworkload download # Download the dataworkload preprocess # Preprocess the dataworkload train # Train the model
# Alternative: Multiple separate scriptsworkload_downloadworkload_preprocessworkload_train
Using subcommands often avoids redundant boilerplate code and makes it easier to maintain and collaborate on.
How do I manage complex configurations?
Section titled “How do I manage complex configurations?”In research and experimental codebases, it’s common to manage a large number of parameters. To keep things organized, it’s helpful to separate configuration into two broad categories:
- Application Configuration: Static settings that define the overall behavior of the project. These are often tied to the environment (e.g., file paths, hardware options, logging preferences) and typically don’t change between runs.
- Task Configuration: Dynamic settings that are specific to a single experiment or run - such as model architecture, training hyperparameters, or evaluation strategies. A popular and useful package for this is
hydra
.
Application Configuration
Section titled “Application Configuration”When loading application configuration, it’s useful to establish a clear order of precedence and load configuration consistently:
- Environment variables with a prefix
WORKLOAD_
- Command-line arguments
This order of precedence is very common and you can define it in a simple settings.py
or config.py
module that can be imported across your codebase.
Make sure to assign reasonable defaults to your configuration to illustrate how to override them.
import os
from pathlib import Pathfrom typing import List, Literal, Optional, Union
XDG_CONFIG_HOME = os.getenv("XDG_CONFIG_HOME", str(Path.home() / ".config"))XDG_CACHE_HOME = os.getenv("XDG_CACHE_HOME", str(Path.home() / ".cache"))XDG_DATA_HOME = os.getenv("XDG_DATA_HOME", str(Path.home() / ".local/share"))XDG_STATE_HOME = os.getenv("XDG_STATE_HOME", str(Path.home() / ".local/state"))
WORKLOAD_CONFIG_DIR = os.getenv("WORKLOAD_CONFIG_DIR", str(Path(XDG_CONFIG_HOME) / "workload_name"))WORKLOAD_CACHE_DIR = os.getenv("WORKLOAD_CACHE_DIR", str(Path(XDG_CACHE_HOME) / "workload_name"))WORKLOAD_STATE_DIR = os.getenv("WORKLOAD_STATE_DIR", str(Path(XDG_STATE_HOME) / "workload_name"))WORKLOAD_DATA_DIR = os.getenv("WORKLOAD_DATA_DIR", str(Path(XDG_DATA_HOME) / "workload_name"))
WORKLOAD_LOG_LEVEL = os.getenv("WORKLOAD_LOG_LEVEL", "INFO")WORKLOAD_EXPERIMENT_TRACKER = os.getenv("WORKLOAD_EXPERIMENT_TRACKER", "csv")WORKLOAD_MULTITHREADING = os.getenv("WORKLOAD_MULTITHREADING", "true")WORKLOAD_MULTIPROCESSING = os.getenv("WORKLOAD_MULTIPROCESSING", "true")
Note: Here we illustrate the use of the XDG Base Directory Specification as fallback to make your project more portable across operating systems. However, be aware that HPC environments may not set these variables by default or may change them during job submission unpredictably, so it’s okay to fall back to standard paths.
This approach is also directly compatible with the use of .env
files where you assign a namespace for the project using a prefix like WORKLOAD_*
.
When you then source the .env
file, you can easily override the configuration as part of your shell augmentation which is especially useful on HPC systems where you may tell the application to store large files in another location.
# Directory StructureWORKLOAD_CONFIG_DIR=/dtu/p1/${USER}/workload_name/config # Configuration filesWORKLOAD_CACHE_DIR=/dtu/p1/${USER}/workload_name/cache # Temporary files, useful to avoid filling home directories with limited capacity on HPC systemsWORKLOAD_STATE_DIR=/dtu/p1/${USER}/workload_name/state # Model checkpoints, intermediate processing, etc.WORKLOAD_DATA_DIR=/dtu/p1/${USER}/workload_name/data # On HPC systems, this often points to shared storage or some fast storage
# Feature FlagsWORKLOAD_LOG_LEVEL=INFO # DEBUG, INFO, WARNING, ERROR, CRITICALWORKLOAD_EXPERIMENT_TRACKER=mlflow # csv, wandb, etc.WORKLOAD_MULTITHREADING=trueWORKLOAD_MULTIPROCESSING=true
Task Configuration
Section titled “Task Configuration”For constructs with many variable parameters - like PyTorch modules, data loaders, training loops or when wrapping the scheduler - it’s better to treat these as task-specific configuration. These values often change per experiment.
Here, command-line configuration tools like hydra
shine, letting you override settings dynamically while keeping defaults in version-controlled YAML files to be shared with the team.
Example: Wrapping the scheduler
Section titled “Example: Wrapping the scheduler”Batch job scripts with embedded scheduler directives (e.g., #BSUB
) are the standard way to submit workloads.
While this approach works well for fixed, repeatable jobs, it tends to be rigid: any change in resources or runtime configuration often requires editing the script directly. This can lead to a mismatch between the scheduler’s resource requests and the parameters used in the execution script.
A more maintainable and flexible approach is to wrap the scheduler submission logic in a reusable interface.
Instead of hardcoding directives, you can encapsulate the scheduler command (e.g., bsub
for LSF) and its arguments inside a small Python utility. Which will help ensure more consistent settings.
Here’s an example on how to map the scheduler options with sensible defaults with a lightweight wrapper for LSF submissions.
import osfrom dataclasses import dataclass, field
@dataclassclass SchedulerOptions: id: str = field(default_factory=lambda: os.getenv("SCHEDULER_ID", "workload_name")) cores: int = field(default_factory=lambda: int(os.getenv("SCHEDULER_CORES", "4"))) walltime: str = field(default_factory=lambda: os.getenv("SCHEDULER_WALLTIME", "7:00")) queue: str = field(default_factory=lambda: os.getenv("SCHEDULER_QUEUE", "p1")) memory: str = field(default_factory=lambda: os.getenv("SCHEDULER_MEM", "16GB")) gpus: int = field(default_factory=lambda: int(os.getenv("SCHEDULER_GPUS", "2"))) email: str = field(default_factory=lambda: os.getenv("SCHEDULER_EMAIL", "username@institution.dk"))
Here we allow overrides via environment variables, while also providing default fallback values.
import subprocess
from datetime import datetimefrom pathlib import Pathfrom typing import Optional
class LSFScheduler:
def __init__(self, options: Optional[SchedulerOptions] = None): self.options = options or SchedulerOptions()
def submit(self, cmd: list[str], logs_dir: Optional[Path] = None) -> subprocess.CompletedProcess: if subprocess.run(["command", "-v", "bsub"], capture_output=True, check=False).returncode != 0: raise RuntimeError("bsub command not found. LSF scheduler is not available.")
logs_dir.mkdir(parents=True, exist_ok=True)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") job_id = f"{self.options.id.replace(' ', '_')}_{timestamp}"
bsub_cmd = [ "bsub", "-J", self.options.id, "-n", str(self.options.cores), "-W", self.options.walltime, "-q", self.options.queue, "-R", f"rusage[mem={self.options.memory}]", "-R", "span[hosts=1]", "-gpu", f"num={self.options.gpus}:mode=exclusive_process", "-u", self.options.email, "-B", "-N", "-o", str(logs_dir / "%J.out"), "-e", str(logs_dir / "%J.err"), ]
bsub_cmd.extend(cmd) return subprocess.run(bsub_cmd, capture_output=True, text=True, check=False)
With a simple conditional check, you can add a CLI flag like: workload train --submit
to trigger job submission through the wrapper and if you use hydra
we can compose the scheduler structured config like so: workload train --submit scheduler.cores=8 scheduler.gpus=1
.