Reproducibility

How do I define a common entrypoint for my workload?

What is the main entrypoint to your project? Is it a bash script, a notebook, a Python file, or something else? A clearly defined entrypoint gives users and collaborators a quick way to understand and run your project - often more effectively than reading a potentially outdated README since focusing on an entrypoint you use yourself is more likely to stay up to date.

With the release of PEP 621, it became customary to define common entrypoints in the pyproject.toml file under the [project.scripts] section.

[project.scripts]
# Single Entrypoint
workload = "workload.main" # $ workload --help to see available subcommands

# Multiple Scripts/Entrypoints (module:function)
train = "workload.main:train"
download = "workload.main:download"

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

Note: We must include the [build-system] section to specify how the project should be built, ensuring that tools know how to process its metadata and source files.

Namespace Organization

Consider building your workload as a package with a CLI entrypoint to signal intent and have a common entrypoint for easier collaboration. As an additional step, you can organize your workload commands under a single namespace with subcommands:

# Single namespace with subcommands
workload download      # Download the data
workload preprocess    # Preprocess the data
workload train         # Train the model

# Alternative: Multiple separate scripts
workload_download
workload_preprocess
workload_train

Using subcommands often avoids redundant boilerplate code and makes it easier to maintain and collaborate on.

How do I manage complex configurations?

In research and experimental codebases, it’s common to manage a large number of parameters. To keep things organized, it’s helpful to separate configuration into two broad categories:

Application Configuration: Static settings that define the overall behavior of the project. These are often tied to the environment (e.g., file paths, hardware options, logging preferences) and typically don’t change between runs.
Task Configuration: Dynamic settings that are specific to a single experiment or run - such as model architecture, training hyperparameters, or evaluation strategies. A popular and useful package for this is hydra.

Application Configuration

When loading application configuration, it’s useful to establish a clear order of precedence and load configuration consistently:

Environment variables with a prefix WORKLOAD_
Command-line arguments

This order of precedence is very common and you can define it in a simple settings.py or config.py module that can be imported across your codebase. Make sure to assign reasonable defaults to your configuration to illustrate how to override them.

import os

from pathlib import Path
from typing import List, Literal, Optional, Union

XDG_CONFIG_HOME = os.getenv("XDG_CONFIG_HOME", str(Path.home() / ".config"))
XDG_CACHE_HOME = os.getenv("XDG_CACHE_HOME", str(Path.home() / ".cache"))
XDG_DATA_HOME = os.getenv("XDG_DATA_HOME", str(Path.home() / ".local/share"))
XDG_STATE_HOME = os.getenv("XDG_STATE_HOME", str(Path.home() / ".local/state"))

WORKLOAD_CONFIG_DIR = os.getenv("WORKLOAD_CONFIG_DIR", str(Path(XDG_CONFIG_HOME) / "workload_name"))
WORKLOAD_CACHE_DIR = os.getenv("WORKLOAD_CACHE_DIR", str(Path(XDG_CACHE_HOME) / "workload_name"))
WORKLOAD_STATE_DIR = os.getenv("WORKLOAD_STATE_DIR", str(Path(XDG_STATE_HOME) / "workload_name"))
WORKLOAD_DATA_DIR = os.getenv("WORKLOAD_DATA_DIR", str(Path(XDG_DATA_HOME) / "workload_name"))

WORKLOAD_LOG_LEVEL = os.getenv("WORKLOAD_LOG_LEVEL", "INFO")
WORKLOAD_EXPERIMENT_TRACKER = os.getenv("WORKLOAD_EXPERIMENT_TRACKER", "csv")
WORKLOAD_MULTITHREADING = os.getenv("WORKLOAD_MULTITHREADING", "true")
WORKLOAD_MULTIPROCESSING = os.getenv("WORKLOAD_MULTIPROCESSING", "true")

Note: Here we illustrate the use of the XDG Base Directory Specification as fallback to make your project more portable across operating systems. However, be aware that HPC environments may not set these variables by default or may change them during job submission unpredictably, so it’s okay to fall back to standard paths.

This approach is also directly compatible with the use of .env files where you assign a namespace for the project using a prefix like WORKLOAD_*.

When you then source the .env file, you can easily override the configuration as part of your shell augmentation which is especially useful on HPC systems where you may tell the application to store large files in another location.

# Directory Structure
WORKLOAD_CONFIG_DIR=/dtu/p1/${USER}/workload_name/config     # Configuration files
WORKLOAD_CACHE_DIR=/dtu/p1/${USER}/workload_name/cache       # Temporary files, useful to avoid filling home directories with limited capacity on HPC systems
WORKLOAD_STATE_DIR=/dtu/p1/${USER}/workload_name/state       # Model checkpoints, intermediate processing, etc.
WORKLOAD_DATA_DIR=/dtu/p1/${USER}/workload_name/data         # On HPC systems, this often points to shared storage or some fast storage

# Feature Flags
WORKLOAD_LOG_LEVEL=INFO            # DEBUG, INFO, WARNING, ERROR, CRITICAL
WORKLOAD_EXPERIMENT_TRACKER=mlflow # csv, wandb, etc.
WORKLOAD_MULTITHREADING=true
WORKLOAD_MULTIPROCESSING=true

Task Configuration

For constructs with many variable parameters - like PyTorch modules, data loaders, training loops or when wrapping the scheduler - it’s better to treat these as task-specific configuration. These values often change per experiment.

Here, command-line configuration tools like hydra shine, letting you override settings dynamically while keeping defaults in version-controlled YAML files to be shared with the team.

Example: Wrapping the scheduler

Batch job scripts with embedded scheduler directives (e.g., #BSUB) are the standard way to submit workloads. While this approach works well for fixed, repeatable jobs, it tends to be rigid: any change in resources or runtime configuration often requires editing the script directly. This can lead to a mismatch between the scheduler’s resource requests and the parameters used in the execution script.

A more maintainable and flexible approach is to wrap the scheduler submission logic in a reusable interface. Instead of hardcoding directives, you can encapsulate the scheduler command (e.g., bsub for LSF) and its arguments inside a small Python utility. Which will help ensure more consistent settings.

Here’s an example on how to map the scheduler options with sensible defaults with a lightweight wrapper for LSF submissions.

import os
from dataclasses import dataclass, field

@dataclass
class SchedulerOptions:
    id: str = field(default_factory=lambda: os.getenv("SCHEDULER_ID", "workload_name"))
    cores: int = field(default_factory=lambda: int(os.getenv("SCHEDULER_CORES", "4")))
    walltime: str = field(default_factory=lambda: os.getenv("SCHEDULER_WALLTIME", "7:00"))
    queue: str = field(default_factory=lambda: os.getenv("SCHEDULER_QUEUE", "p1"))
    memory: str = field(default_factory=lambda: os.getenv("SCHEDULER_MEM", "16GB"))
    gpus: int = field(default_factory=lambda: int(os.getenv("SCHEDULER_GPUS", "2")))
    email: str = field(default_factory=lambda: os.getenv("SCHEDULER_EMAIL", "username@institution.dk"))

Here we allow overrides via environment variables, while also providing default fallback values.

import subprocess

from datetime import datetime
from pathlib import Path
from typing import Optional

class LSFScheduler:

  def __init__(self, options: Optional[SchedulerOptions] = None):
    self.options = options or SchedulerOptions()

  def submit(self, cmd: list[str], logs_dir: Optional[Path] = None) -> subprocess.CompletedProcess:
    if subprocess.run(["command", "-v", "bsub"], capture_output=True, check=False).returncode != 0:
        raise RuntimeError("bsub command not found. LSF scheduler is not available.")

    logs_dir.mkdir(parents=True, exist_ok=True)

    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    job_id = f"{self.options.id.replace(' ', '_')}_{timestamp}"

    bsub_cmd = [
        "bsub",
        "-J", self.options.id,
        "-n", str(self.options.cores),
        "-W", self.options.walltime,
        "-q", self.options.queue,
        "-R", f"rusage[mem={self.options.memory}]",
        "-R", "span[hosts=1]",
        "-gpu", f"num={self.options.gpus}:mode=exclusive_process",
        "-u", self.options.email,
        "-B", "-N",
        "-o", str(logs_dir / "%J.out"),
        "-e", str(logs_dir / "%J.err"),
    ]

    bsub_cmd.extend(cmd)
    return subprocess.run(bsub_cmd, capture_output=True, text=True, check=False)

With a simple conditional check, you can add a CLI flag like: workload train --submit to trigger job submission through the wrapper and if you use hydra we can compose the scheduler structured config like so: workload train --submit scheduler.cores=8 scheduler.gpus=1.