coastal-calibration: Design Documentation#
Overview#
The coastal-calibration Python package is a complete redesign and rewrite of the
original bash-based SCHISM model calibration workflow. This document details the
architectural improvements, design decisions, and substantial enhancements made over the
original implementation.
Table of Contents#
- Executive Summary
- Original Implementation Analysis
- New Architecture
- Key Design Decisions
- Substantial Improvements
- API Reference
- Potential Future Developments
Executive Summary#
The coastal-calibration package provides a modern Python interface for running SCHISM
and SFINCS coastal model calibration workflows on HPC clusters. It wraps the existing
operational workflow scripts with a clean, type-safe API while establishing the
foundation for incremental improvements.
Design Goals#
The primary objectives of this rewrite are to create a workflow that is:
- Intuitive and user-friendly - Simple YAML configuration, clear CLI commands, helpful error messages
- Less prone to errors - Type-safe configuration, comprehensive validation, structured logging
- Extensible - Polymorphic model architecture that supports SCHISM, SFINCS, and
future models via a common
ModelConfigABC
Architectural Strategy#
The package is designed with a stable public API that shields users from internal changes. This enables:
- Immediate usability - Users get a clean interface today, even while internals are being improved
- Incremental rewriting - Embedded bash scripts can be replaced with pure Python one stage at a time
- Safe evolution - Internal rewrites don't break user-facing code or configurations
The long-term goal is to completely rewrite all embedded bash scripts in Python, but doing so incrementally allows the package to be useful immediately while that work proceeds.
Key Features#
- Type-safe configuration via
dataclasseswith runtime validation - Modular stage-based architecture for maintainability and extensibility
- Native Python datetime handling replacing fragile shell date arithmetic
- Async data downloading with built-in source validation
- CLI and programmatic APIs for both interactive and automated use
- Progress tracking and structured logging
- Configuration inheritance for DRY multi-run setups
- Smart default paths with variable interpolation
Original Implementation Analysis#
File Structure (20+ scripts)#
calib_org/
├── sing_run.bash # Main entry point (258 lines)
├── schism_calib.cfg # Configuration file
├── pre_nwm_forcing_coastal.bash # Forcing preparation
├── post_nwm_forcing_coastal.bash # Forcing post-processing
├── make_tpxo_ocean.bash # TPXO boundary conditions
├── pre_regrid_stofs.bash # STOFS pre-processing
├── post_regrid_stofs.bash # STOFS post-processing
├── update_param.bash # Parameter file updates (249 lines)
├── pre_schism.bash # SCHISM input preparation
├── post_schism.bash # SCHISM output processing
├── merge_source_sink.bash # Discharge file merging
├── initial_discharge.bash # Initial discharge creation
├── combine_sink_source.bash # Sink/source combination
└── run_sing_coastal_workflow_*.bash # 8+ Singularity wrappers
Critical Issues in Original Implementation#
1. Fragile Date Arithmetic#
The original workflow relied on external scripts for date calculations:
# Original: External script calls for every date operation
export FORCING_END_DATE=$(${USHnwm}/utils/advance_time.sh $PDY$cyc $LENGTH_HRS)'00'
pdycyc=$(${USHnwm}/utils/advance_time.sh $PDY$cyc $hr)
This approach had several problems:
- Required external
advance_time.shandadvance_cymdh.plscripts - Shell spawning overhead for each date operation
- Inconsistent handling of edge cases (leap years, month boundaries)
- No error handling for invalid dates
2. Environment Variable Pitfalls#
The original scripts passed dozens of environment variables between scripts:
# Original configuration (schism_calib.cfg)
export STARTPDY=20230611
export STARTCYC=00
export FCST_LENGTH_HRS=3.0
export HOT_START_FILE=''
export USE_TPXO="NO"
export COASTAL_DOMAIN=pacific
export METEO_SOURCE=NWM_RETRO
export COASTAL_WORK_DIR=/efs/schism_use_case/...
# Plus 40+ more in sing_run.bash
export NGWPC_COASTAL_PARM_DIR=/ngen-test/coastal/ngwpc-coastal
export NGEN_APP_DIR=/ngen-app
export FCST_TIMESTEP_LENGTH_SECS=3600
export OTPSDIR=$NGEN_APP_DIR/OTPSnc
# ... etc
Problems:
- No validation of variable values
- Easy to have typos that fail silently
- Difficult to track variable dependencies
- No documentation of which variables are required vs optional
3. String-Based Domain Mapping#
# Original: Repeated in multiple files
declare -A coastal_domain_to_inland_domain=( \
[prvi]="domain_puertorico" \
[hawaii]="domain_hawaii" \
[atlgulf]="domain" \
[pacific]="domain" )
declare -A coastal_domain_to_nwm_domain=( \
[prvi]="prvi" \
[hawaii]="hawaii" \
[atlgulf]="conus" \
[pacific]="conus" )
declare -A coastal_domain_to_geo_grid=( \
[prvi]="geo_em_PRVI.nc" \
[hawaii]="geo_em_HI.nc" \
[atlgulf]="geo_em_CONUS.nc" \
[pacific]="geo_em_CONUS.nc" )
Problems:
- Duplicated across multiple scripts
- No compile-time type checking
- Silent failures on unknown domains
4. No Data Download Integration#
The original workflow required manual data downloading via a separate workflow. That workflow had no date validation, no source awareness, and no progress tracking.
5. Minimal Error Handling#
# Original: Scripts would continue on failure
singularity exec -B $BINDINGS --pwd ${work_dir} $SIF_PATH \
./run_sing_coastal_workflow_pre_forcing_coastal.bash
# No error check here
${MPICOMMAND3} singularity exec -B $BINDINGS \
--pwd ${work_dir} \
$SIF_PATH \
$CONDA_ENVS_PATH/$CONDA_ENV_NAME/bin/python \
$USHnwm/wrf_hydro_workflow_dev/forcings/WrfHydroFECPP/workflow_driver.py
# No error check here either
New Architecture#
Package Structure#
src/coastal_calibration/
├── __init__.py # Package exports
├── cli.py # Command-line interface
├── runner.py # Main workflow orchestrator
├── downloader.py # Async data downloading
│
├── config/
│ ├── __init__.py
│ ├── schema.py # YAML config dataclasses + ModelConfig ABC
│ └── create_schema.py # SFINCS creation config schema
│
├── stages/ # Workflow stages
│ ├── __init__.py
│ ├── base.py # Abstract WorkflowStage base class
│ ├── download.py # Data download stage
│ ├── forcing.py # NWM forcing stages
│ ├── boundary.py # Boundary condition stages
│ ├── schism.py # SCHISM execution stages
│ ├── sfincs.py # SFINCS data catalog & symlinks
│ ├── sfincs_build.py # SFINCS model build stages (HydroMT)
│ ├── sfincs_create.py # SFINCS model creation stages
│ └── _hydromt_compat.py # Compatibility patches for hydromt bugs
│
├── schism_prep.py # Pure-Python SCHISM preparation functions
├── sflux.py # Atmospheric forcing generation
├── tides/ # TPXO boundary utilities + pytides
├── regridding/ # ESMF-based regridding (STOFS, NWM forcing)
│
└── utils/
├── __init__.py
├── logging.py # Workflow monitoring
├── time.py # Datetime utilities
├── streamflow.py # NWM streamflow read utilities
├── floodmap.py # Flood depth map generation
└── workflow.py # Workflow helper functions
Core Components#
1. Configuration System (config/schema.py)#
The new configuration system uses Python dataclasses with full type hints:
from dataclasses import dataclass
from typing import Literal
CoastalDomain = Literal["prvi", "hawaii", "atlgulf", "pacific"]
MeteoSource = Literal["nwm_retro", "nwm_ana"]
BoundarySource = Literal["tpxo", "stofs"]
@dataclass
class SimulationConfig:
"""Simulation time and domain configuration."""
start_date: datetime
duration_hours: int
coastal_domain: CoastalDomain
meteo_source: MeteoSource
timestep_seconds: int = 3600
# Domain mappings as class variables
_INLAND_DOMAIN: ClassVar[dict[str, str]] = {
"prvi": "domain_puertorico",
"hawaii": "domain_hawaii",
"atlgulf": "domain",
"pacific": "domain",
}
@property
def start_pdy(self) -> str:
"""Return start date as YYYYMMDD string."""
return self.start_date.strftime("%Y%m%d")
@property
def inland_domain(self) -> str:
"""Inland domain directory name for this coastal domain."""
return self._INLAND_DOMAIN[self.coastal_domain]
Benefits:
- Type safety: IDE autocompletion, static analysis with
pyright - Self-documenting: Property names and docstrings explain purpose
- Validation: Runtime checks with helpful error messages
- DRY: Domain mappings defined once
2. YAML Configuration with Inheritance#
# base.yaml - Shared defaults
paths:
nfs_mount: /ngen-test
---
# hawaii_run.yaml - Inherits from base
_base: base.yaml
simulation:
start_date: '2023-06-11T00:00:00'
duration_hours: 24
coastal_domain: hawaii
meteo_source: nwm_retro
paths:
work_dir: /ngen-test/coastal_runs/${simulation.coastal_domain}
Features:
- Variable interpolation:
${section.key}syntax - Inheritance:
_basefield for configuration reuse - Deep merging: Override only what changes
- Smart defaults: Minimal configuration required
When paths are not specified, they are automatically generated using templates that
include the ${model} variable for model-aware directory naming:
DEFAULT_WORK_DIR_TEMPLATE = (
"/ngen-test/coastal/${user}/"
"${model}_${simulation.coastal_domain}_${boundary.source}_${simulation.meteo_source}/"
"${model}_${simulation.start_date}"
)
DEFAULT_RAW_DOWNLOAD_DIR_TEMPLATE = (
"/ngen-test/coastal/${user}/"
"${model}_${simulation.coastal_domain}_${boundary.source}_${simulation.meteo_source}/"
"raw_data"
)
flowchart TD
base[base.yaml] --> hawaii[hawaii_run.yaml]
base --> pacific[pacific_run.yaml]
base --> prvi[prvi_run.yaml]
3. Stage-Based Workflow Architecture#
The stage pipeline is model-specific. Each ModelConfig subclass defines its own
stage_order and create_stages().
SCHISM pipeline:
flowchart TD
A[download] --> B[pre_forcing]
B --> C[nwm_forcing]
C --> D[post_forcing]
D --> E[update_params]
E --> F[schism_obs]
F --> G[boundary_conditions]
G --> H[pre_schism]
H --> I[schism_run]
I --> J[post_schism]
J --> K[schism_plot]
SFINCS pipeline:
flowchart TD
A[download] --> B[sfincs_symlinks]
B --> C[sfincs_data_catalog]
C --> D[sfincs_init]
D --> E[sfincs_timing]
E --> F[sfincs_forcing]
F --> G[sfincs_obs]
G --> H[sfincs_discharge]
H --> I[sfincs_precip]
I --> J[sfincs_wind]
J --> K[sfincs_pressure]
K --> L[sfincs_write]
L --> M[sfincs_run]
Each stage is a Python class inheriting from WorkflowStage:
classDiagram
class WorkflowStage {
<<abstract>>
+run() dict
+validate() list
}
WorkflowStage <|-- DownloadStage
WorkflowStage <|-- ForcingStage
WorkflowStage <|-- BoundaryStage
WorkflowStage <|-- SCHISMStage
WorkflowStage <|-- SFINCSBuildStage
The base class implementation:
class WorkflowStage(ABC):
"""Abstract base class for workflow stages."""
name: str = "base"
description: str = "Base workflow stage"
def __init__(self, config: CoastalCalibConfig, monitor: WorkflowMonitor | None):
self.config = config
self.monitor = monitor
def build_environment(self) -> dict[str, str]:
"""Build environment variables for the stage."""
# Converts config to env vars for bash scripts
env = os.environ.copy()
env["STARTPDY"] = self.config.simulation.start_pdy
env["STARTCYC"] = self.config.simulation.start_cyc
# ... all precomputed, no shell date arithmetic needed
return env
def run_singularity_command(
self,
command: list[str],
use_mpi: bool = False,
mpi_tasks: int | None = None,
) -> subprocess.CompletedProcess[str]:
"""Run a command inside the Singularity container."""
# Handles all Singularity setup, bindings, error checking
pass
@abstractmethod
def run(self) -> dict[str, Any]:
"""Execute the stage and return results."""
pass
def validate(self) -> list[str]:
"""Validate stage prerequisites. Return list of errors."""
return []
4. Workflow Runner Orchestration#
class CoastalCalibRunner:
"""Main workflow runner for coastal model calibration."""
@property
def STAGE_ORDER(self) -> list[str]:
"""Stage order is delegated to the model config."""
return self.config.model_config.stage_order
def run(
self,
start_from: str | None = None,
stop_after: str | None = None,
dry_run: bool = False,
) -> WorkflowResult:
"""Execute the calibration workflow."""
# Validation, stage sequencing, error handling, result collection
pass
Key Design Decisions#
1. Python-Native Date Arithmetic#
Decision: Replace all bash/Perl date scripts with Python datetime.
Rationale:
- Python's
datetimeandtimedeltahandle all edge cases correctly - No external dependencies or shell spawning
- Type-safe with IDE support
Implementation (utils/time.py):
_DATE_RE = re.compile(r"^\d{10}$")
def _parse_date(date_string: str) -> datetime:
"""Parse a YYYYMMDDHH string into a datetime, with strict validation."""
if not isinstance(date_string, str) or not _DATE_RE.match(date_string):
raise ValueError(
f"date_string must be exactly 10 digits in YYYYMMDDHH format, got {date_string!r}"
)
return datetime.strptime(date_string, "%Y%m%d%H")
def advance_time(date_string: str, hours: int) -> str:
"""Advance a date string by a specified number of hours.
Replaces advance_time.sh and advance_cymdh.pl with native Python.
Handles leap years, month boundaries, DST, etc.
"""
dt = _parse_date(date_string) + timedelta(hours=hours)
return dt.strftime("%Y%m%d%H")
The module also consolidates parse_datetime() (flexible datetime parsing, previously
duplicated in config.schema and downloader) and iter_hours() (hour-range
iteration, previously in downloader).
Impact: The build_environment() method precomputes shared date-derived values,
then delegates model-specific env vars to model_config.build_environment():
# Shared dates computed once in Python, passed to bash scripts
env["FORCING_BEGIN_DATE"] = f"{pdycyc}00"
env["FORCING_END_DATE"] = forcing_end_dt.strftime("%Y%m%d%H00")
env["END_DATETIME"] = forcing_end_dt.strftime("%Y%m%d%H")
# Model-specific env vars (e.g., SCHISM_BEGIN_DATE, OMP_NUM_THREADS)
env = self.config.model_config.build_environment(env, self.config)
2. Integrated Data Downloading with Validation#
Decision: Build a comprehensive downloader with source awareness and date range validation.
Rationale:
- Different data sources have different availability windows
- Users shouldn't waste time on downloads that will fail
- Async downloading is faster than sequential
Implementation (downloader.py):
DATA_SOURCE_DATE_RANGES: dict[str, dict[str, DateRange]] = {
"nwm_retro": {
"conus": DateRange(
start=datetime(1979, 2, 1),
end=datetime(2023, 1, 31),
description="NWM Retrospective 3.0 (CONUS)",
),
"hawaii": DateRange(
start=datetime(1994, 1, 1),
end=datetime(2013, 12, 31),
description="NWM Retrospective 3.0 (Hawaii)",
),
# ...
},
"stofs": {
"_default": DateRange(
start=datetime(2020, 12, 30),
end=None, # operational, no end date
description="STOFS (operational)",
),
},
}
def download_data(
start_time: datetime,
end_time: datetime,
output_dir: Path,
domain: Domain,
meteo_source: MeteoSource = "nwm_retro",
coastal_source: CoastalSource = "stofs",
) -> DownloadResults:
"""Download with validation and progress tracking."""
# Validates dates before downloading
errors = _validate_date_ranges(start, end, meteo_source, coastal_source, domain)
if errors:
raise ValueError("Date range validation failed:\n" + "\n".join(errors))
# Uses tiny_retriever for async parallel downloads
download(urls, paths, timeout=timeout)
3. Configuration Over Convention#
Decision: Use explicit YAML configuration with sensible defaults.
Rationale:
- Original relied on implicit conventions (file locations, naming patterns)
- Explicit configuration is self-documenting
- Easier to version control and share
Example SCHISM configuration:
simulation:
start_date: '2023-06-11T00:00:00'
duration_hours: 24
coastal_domain: pacific
meteo_source: nwm_retro
boundary:
source: tpxo # or: source: stofs
paths:
work_dir: /ngen-test/coastal_runs/my_run
raw_download_dir: /ngen-test/data/downloads
# SCHISM compute parameters (model_config defaults to SchismModelConfig)
model_config:
nodes: 2
ntasks_per_node: 18
nscribes: 2
omp_num_threads: 2
download:
enabled: true
skip_existing: true
Example SFINCS configuration:
model: sfincs
simulation:
start_date: 2025-06-01
duration_hours: 168
coastal_domain: atlgulf
meteo_source: nwm_ana
boundary:
source: stofs
model_config:
prebuilt_dir: /path/to/texas/model
include_noaa_gages: true
forcing_to_mesh_offset_m: 0.0 # STOFS already in mesh datum
vdatum_mesh_to_msl_m: 0.171 # mesh datum → MSL for obs comparison
omp_num_threads: 36
download:
enabled: true
skip_existing: true
4. Direct Execution Inside SLURM Jobs (run Command)#
Decision: Provide a run command for direct, in-process execution inside
user-written sbatch scripts.
Rationale:
Users need full control over SLURM resource allocation, for example when using
non-default partitions, requesting specific hardware, or embedding the workflow in a
larger pipeline. The run command executes all stages locally on whatever resources are
already allocated, making it ideal for use inside manually written sbatch scripts.
Usage pattern (preferred on clusters):
The recommended approach on clusters is to write an sbatch script that creates a YAML
configuration inline using a heredoc and passes it to coastal-calibration run. This is
the preferred method because:
- The SLURM directives in the
sbatchscript control resource allocation, while the YAML controls workflow configuration - Everything is contained in a single file that can be submitted with
sbatch - No separate YAML file needs to be managed or kept in sync with SLURM settings
- The heredoc is self-documenting: reviewers can see the exact configuration used
#!/usr/bin/env bash
#SBATCH --job-name=coastal_schism
#SBATCH --partition=c5n-18xlarge
#SBATCH -N 2
#SBATCH --ntasks-per-node=18
#SBATCH --exclusive
#SBATCH --output=slurm-%j.out
CONFIG_FILE="/tmp/coastal_config_${SLURM_JOB_ID}.yaml"
cat > "${CONFIG_FILE}" <<'EOF'
model: schism
simulation:
start_date: 2021-01-01
duration_hours: 12
coastal_domain: hawaii
meteo_source: nwm_retro
boundary:
source: tpxo
model_config:
include_noaa_gages: true
EOF
/ngen-test/coastal-calibration/coastal-calibration run "${CONFIG_FILE}"
rm -f "${CONFIG_FILE}"
Design choices:
- The config filename includes
$SLURM_JOB_IDto avoid collisions when multiple jobs run concurrently - Single-quoted heredoc (
<<'EOF') prevents accidental shell variable expansion inside the YAML - Complete examples for both SCHISM and SFINCS are provided in
docs/examples/
5. Stable Public API with Incremental Internal Rewrite#
Decision: Establish a clean, stable public API while embedding existing scripts as a transitional measure.
Rationale:
The primary goal of this rewrite is to create an intuitive, user-friendly, and extensible workflow system. The existing bash and Python scripts are difficult to maintain and not performant. However, rewriting everything at once would:
- Delay delivery of a usable tool to users
- Risk introducing regressions without a baseline
- Require extensive testing before any release
Strategy:
The architecture deliberately separates public API from private implementation:
| Layer | Components | Stability |
|---|---|---|
| Public API | CoastalCalibConfig, CoastalCalibRunner, CLI |
Stable |
| Stage Interface | WorkflowStage.run(), .validate(), .build_environment() |
Stable |
| Private Implementation | Bash scripts → Pure Python | Evolving |
This allows:
- Users get a stable interface today - The CLI and Python API won't change as internals evolve
- Incremental rewriting - Each stage can be rewritten independently without affecting others
- Testing baseline first - Establish test coverage against current behavior before changes
- Performance optimization - Replace bash subprocess calls with native Python as needed
Current State:
- Package includes
scripts/directory with embedded bash scripts WorkflowStage.run_singularity_command()provides abstraction layer- Python precomputes all environment variables, minimizing bash complexity
Future Direction:
- Add comprehensive integration tests capturing current behavior
- Incrementally rewrite stages in pure Python (starting with simpler stages)
- Deprecate bash scripts as Python replacements are validated
- Optimize performance-critical paths (file I/O, data processing)
6. Strict Type Checking with pyright#
Decision: Use strict pyright mode for static type analysis.
Rationale:
- Catches errors before runtime
- Enables IDE features (autocomplete, refactoring)
- Self-documents function signatures
Configuration (pyproject.toml):
Substantial Improvements#
1. Error Handling and Validation#
| Aspect | Original | New |
|---|---|---|
| Configuration validation | None | 12+ checks in CoastalCalibConfig.validate() |
| Stage validation | None | Each stage has validate() method |
| Error messages | Exit codes only | Detailed, actionable messages |
| Recovery | Manual restart | Partial workflow execution with --start-from |
Validation examples:
def validate(self) -> list[str]:
errors = []
# Shared validation
if self.simulation.duration_hours <= 0:
errors.append("simulation.duration_hours must be positive")
if (
self.boundary.source == "stofs"
and not self.boundary.stofs_file
and not self.download.enabled
):
errors.append(
"boundary.stofs_file required when using STOFS source and download is disabled"
)
# Model-specific validation (delegated to ModelConfig subclass)
errors.extend(self.model_config.validate(self))
return errors
2. Progress Tracking and Monitoring#
Original: No progress tracking, just log messages scattered in bash scripts.
New: Structured monitoring with stage context:
class WorkflowMonitor:
"""Monitors and logs workflow execution progress."""
def register_stages(self, stages: list[str]) -> None:
"""Register stages for progress tracking."""
@contextmanager
def stage_context(self, stage_name: str, description: str):
"""Context manager for stage execution with timing."""
self.info(f"Starting stage: {stage_name} - {description}")
start = time.perf_counter()
try:
yield
duration = time.perf_counter() - start
self.info(f"Completed stage: {stage_name} in {duration:.1f}s")
self.progress[stage_name] = "completed"
except Exception as e:
self.progress[stage_name] = "failed"
raise
def save_progress(self, path: Path) -> None:
"""Save progress to JSON for resumption."""
3. CLI with Multiple Entry Points#
# Initialize configuration for a domain
coastal-calibration init config.yaml --domain hawaii
# Validate configuration
coastal-calibration validate config.yaml
# Run workflow (inside an sbatch script or locally)
coastal-calibration run config.yaml
# Dry-run to validate without executing
coastal-calibration run config.yaml --dry-run
# Run partial workflow
coastal-calibration run config.yaml --start-from update_params --stop-after boundary_conditions
# List available stages
coastal-calibration stages
4. Dual API: CLI and Programmatic#
# Python API
from coastal_calibration import CoastalCalibConfig, CoastalCalibRunner
config = CoastalCalibConfig.from_yaml("config.yaml")
runner = CoastalCalibRunner(config)
# Validate first
errors = runner.validate()
if errors:
print("Validation failed:", errors)
else:
result = runner.run()
print(f"Success: {result.success}")
5. Comprehensive Downloader#
| Feature | Original | New |
|---|---|---|
| Data sources | Manual AWS CLI | NWM Retro, NWM Ana, STOFS, GLOFS |
| Date validation | None | Checks against known availability |
| Parallel download | None | Async with tiny_retriever |
| Skip existing | None | skip_existing=True option |
| Progress tracking | None | Success/failure counts |
| Domain awareness | Manual | Automatic URL building |
6. Results Serialization#
@dataclass
class WorkflowResult:
success: bool
job_id: str | None
start_time: datetime
end_time: datetime | None
stages_completed: list[str]
stages_failed: list[str]
outputs: dict[str, Any]
errors: list[str]
@property
def duration_seconds(self) -> float | None:
if self.end_time:
return (self.end_time - self.start_time).total_seconds()
return None
def save(self, path: Path) -> None:
"""Save result to JSON for post-processing."""
API Reference#
Configuration Classes#
| Class | Purpose |
|---|---|
CoastalCalibConfig |
Root configuration container |
SimulationConfig |
Time, domain, and source settings |
BoundaryConfig |
TPXO vs STOFS selection |
PathConfig |
All file and directory paths |
ModelConfig |
ABC for model-specific configuration |
SchismModelConfig |
SCHISM compute, MPI, and stage settings |
SfincsModelConfig |
SFINCS model paths, OpenMP, and stage settings |
MonitoringConfig |
Logging and progress tracking |
DownloadConfig |
Data download settings |
SCHISM Workflow Stages#
| Stage | Class | Description |
|---|---|---|
download |
DownloadStage |
Download NWM/STOFS/GLOFS data |
pre_forcing |
PreForcingStage |
Prepare forcing directories and symlinks |
nwm_forcing |
NWMForcingStage |
Run WRF-Hydro forcing engine (MPI) |
post_forcing |
PostForcingStage |
Post-process forcing files |
update_params |
UpdateParamsStage |
Generate SCHISM param.nml |
schism_obs |
SchismObsStage |
Discover NOAA stations and write station.in |
boundary_conditions |
BoundaryConditionStage |
TPXO or STOFS boundary generation |
pre_schism |
PreSCHISMStage |
Prepare SCHISM inputs |
schism_run |
SCHISMRunStage |
Execute pschism binary (MPI) |
post_schism |
PostSCHISMStage |
Validate and post-process outputs |
schism_plot |
SchismPlotStage |
Plot simulated vs observed water levels (with datum conversion) |
SFINCS Workflow Stages#
| Stage | Class | Description |
|---|---|---|
download |
DownloadStage |
Download NWM/STOFS data |
sfincs_symlinks |
SFINCSSymlinksStage |
Create .nc symlinks for NWM data |
sfincs_data_catalog |
SFINCSDataCatalogStage |
Generate HydroMT data catalog |
sfincs_init |
SfincsInitStage |
Initialize SFINCS model + clean stale files |
sfincs_timing |
SfincsTimingStage |
Set SFINCS timing |
sfincs_forcing |
SfincsForcingStage |
Add water level forcing (IDW interpolation) |
sfincs_obs |
SfincsObsStage |
Add observation points |
sfincs_discharge |
SfincsDischargeStage |
Add discharge sources (active-cell filter) |
sfincs_precip |
SfincsPrecipitationStage |
Add precipitation forcing + clip meteo grid |
sfincs_wind |
SfincsWindStage |
Add wind forcing + clip meteo grid |
sfincs_pressure |
SfincsPressureStage |
Add pressure forcing + clip meteo grid |
sfincs_write |
SfincsWriteStage |
Write SFINCS model |
sfincs_run |
SfincsRunStage |
Run SFINCS (native binary, OpenMP) |
sfincs_plot |
SfincsPlotStage |
Plot simulated vs observed water levels (with datum conversion) |
Potential Future Developments#
Vision: Unified Workflow Architecture#
The overarching goal is to make the SCHISM and SFINCS workflows architecturally consistent. Every coastal model workflow is conceptually the same four-phase pipeline:
Model Creation ──► Model Preparation ──► Model Execution ──► Evaluation
(mesh, config) (forcing, boundaries) (run the solver) (obs vs sim)
The SFINCS workflow already follows this pattern cleanly: users provide a pre-built
model (prebuilt_dir), the Python pipeline adds forcing/boundaries/observations, then a
single native binary call runs the solver. The SCHISM workflow, by contrast, conflates
model creation and preparation inside monolithic bash scripts with hardcoded paths to a
pre-built model on the cluster. The future direction is to bring SCHISM in line with
SFINCS.
The end state is two purpose-built containers (SCHISM, ESMF) plus a natively compiled SFINCS binary:
| Component | Purpose | Invocation |
|---|---|---|
| SFINCS | SFINCS solver (OpenMP, single-node) | Native binary (compiled via pixi) |
| SCHISM | SCHISM solver + mesh partitioning (MPI, multi-node) | singularity exec (single call) |
| ESMF regridding | NWM forcing + STOFS boundary regridding (MPI Python + ESMPy) | singularity exec (single call) |
Current State: SCHISM vs SFINCS Architectural Gap#
| Aspect | SFINCS (target pattern) | SCHISM (current) |
|---|---|---|
| Model input | prebuilt_dir (user-provided) |
Hardcoded paths in /ngwpc-coastal/parm/ |
| Model manipulation | Python (HydroMT-SFINCS library) | Bash scripts inside Singularity |
| Forcing generation | Pure Python (xarray, rasterio) |
MPI Python + bash wrappers (container) |
| Boundary conditions | Pure Python (IDW interpolation) | Fortran binary (predict_tide) or MPI Python |
| Configuration | sfincs.inp read/written by Python |
param.nml generated by 230-line bash |
| Execution | Single native binary call | 9 separate singularity exec calls |
| Pre-run stages | 12 stages, 11 pure Python | 9 stages, only 2 pure Python |
| Bash dependency | 0 bash scripts | 15 bash scripts (~1,000 lines) |
| Embedded Python | 0 (all in package proper) | 6 scripts (~1,100 lines) in scripts/ |
What the current monolithic Singularity container bundles:
pschismbinary (the SCHISM solver, compiled with MPI + ParMETIS)metis_prepandgpmetisbinaries (mesh partitioning for parallel execution)combine_hotstart7binary (hot-start file post-processing)predict_tide(OTPS Fortran binary for TPXO tidal prediction)- Conda environments with ESMF-based Python scripts
- All of the above run via 9 separate
singularity execcalls per workflow
Phase 1: Pre-Built SCHISM Model (prebuilt_dir)#
Goal: Accept a pre-built SCHISM model directory, just like SFINCS.
A pre-built SCHISM model directory would contain the mesh and static configuration:
prebuilt_dir/
hgrid.gr3 # Unstructured mesh (required)
hgrid.nc # Same mesh in NetCDF (required for ESMF regridding)
vgrid.in # Vertical grid specification
param.nml.template # Namelist template (dates/paths filled at runtime)
bctides.in # Tidal boundary setup (optional, for TPXO)
station.in # Observation stations (optional, auto-generated if absent)
Currently, these files live at hardcoded cluster paths
(${parm_dir}/parm/coastal/{domain}/) and are symlinked into the work directory by
update_param.bash. The refactoring moves them into a user-provided directory:
Changes required:
- Add
prebuilt_dir: PathtoSchismModelConfig(mirrorsSfincsModelConfig) - Add validation that
prebuilt_dircontains required files - Replace the symlink logic in
update_paramsbash with a Python init stage that copies/symlinks fromprebuilt_dirtowork_dir - Replace
param.nmlgeneration (currently 230 lines of bash inupdate_param.bash) with Python usingf90nmlto read the template and fill runtime values - Remove the
parm_dirandnwm_dirpath dependencies fromPathConfig
Backward compatibility: The current parm_dir-based paths can be preserved as a
fallback: if prebuilt_dir is not set, construct it from
${parm_dir}/parm/coastal/{domain}/ to maintain cluster compatibility during
transition.
Phase 2: Pure-Python Model Preparation#
Goal: Rewrite all pre-run SCHISM stages in pure Python, eliminating the bash scripts and the need to run preparation stages inside the Singularity container.
Stage-by-stage rewrite plan:
| Stage | Current | Replacement |
|---|---|---|
update_params |
update_param.bash (230 LOC) |
f90nml template fill |
pre_forcing |
pre_nwm_forcing_coastal.bash |
pathlib + shutil (symlinks) |
post_forcing |
makeAtmo.py (232 LOC) |
Absorb into package as module |
boundary (TPXO) |
predict_tide Fortran binary |
pure Python TPXO |
pre_schism |
pre_schism.bash (56 LOC) |
makeDischarge.py already Python |
post_schism |
post_schism.bash (37 LOC) |
pathlib + NetCDF4 checks |
Scripts that can be absorbed immediately (already Python, just need to move out of
scripts/ into proper package modules):
makeAtmo.py(232 LOC) - atmospheric post-processingmakeDischarge.py(139 LOC) - discharge source generationmerge_source_sink.py(166 LOC) - source/sink mergingcorrect_elevation.py(32 LOC) - elevation correctionotps_to_open_bnds_hgrid.py(108 LOC) - TPXO output parsing
These 5 Python files total ~677 lines and are already functional Python code that just needs to be brought under the package's type checking, testing, and import system.
Phase 3: Three Purpose-Built Containers#
Goal: Replace the current monolithic container with three focused containers, each with a single responsibility.
SFINCS Binary (done)#
SFINCS no longer uses a container. The solver is compiled natively from source (via pixi activation script or manual build) and runs directly as a native binary with OpenMP parallelism.
SCHISM Container (new, solver only)#
A minimal container with only the binaries needed to run and partition a SCHISM model:
pschism(the SCHISM solver, compiled with MPI + NetCDF)metis_prep+gpmetis(mesh partitioning for parallel execution)combine_hotstart7(combines distributed hot-start files after a run)- OpenMPI runtime and InfiniBand/network libraries for multi-node MPI
- HDF5/NetCDF4 Fortran libraries
This container is invoked twice: once for mesh partitioning (metis_prep + gpmetis)
and once for the solver (mpiexec pschism). Both are simple singularity exec calls.
Removed from container (moved to host-side Python):
- All bash wrapper scripts
- Conda environments
predict_tide(replaced by pure Python TPXO)- NWM USH/EXEC scripts
ESMF Regridding Container (new, MPI Python + ESMPy)#
The NWM forcing engine (workflow_driver.py) and STOFS boundary regridding
(regrid_estofs.py) both depend on ESMPy (import ESMF), which is the Python interface
to the ESMF (Earth System Modeling Framework) regridding library. ESMPy itself requires
MPI and performs parallel regridding of NWM meteorological fields (wind, pressure,
precipitation) and STOFS water levels onto the SCHISM unstructured mesh.
These ESMF dependencies are heavyweight (MPI-aware C/Fortran libraries + Python bindings) and do not belong in either the SCHISM solver container or the host Python environment. A dedicated ESMF container isolates this concern:
- ESMPy (
ESMFPython module) with MPI support workflow_driver.py- regrids NWM forcing fields to SCHISM mesh via ESMFregrid_estofs.py- regrids STOFS water levels to SCHISM open boundaries via ESMF- Python scientific stack (
numpy,netCDF4,xarray) - OpenMPI runtime matching the cluster
This container is invoked via singularity exec with mpiexec for the two ESMF-based
stages (nwm_forcing and STOFS boundary_conditions). This is the last container to
address since both scripts are already functional MPI Python programs.
Target Architecture#
Host (Python) Containers
────────────── ──────────
download ─── pure Python (no container)
pre_forcing ─── pure Python (no container)
nwm_forcing ─── ESMF container (mpiexec + workflow_driver.py)
post_forcing ─── pure Python (no container)
update_params ─── pure Python (no container)
schism_obs ─── pure Python (no container)
boundary (TPXO) ─── pure Python (no container)
boundary (STOFS) ─── ESMF container (mpiexec + regrid_estofs.py)
pre_schism ─── SCHISM container (metis_prep + gpmetis)
schism_run ─── SCHISM container (mpiexec + pschism)
post_schism ─── pure Python (no container)
schism_plot ─── pure Python (no container)
Bind mounts simplified from 15+ to 2-3 per container (work_dir + MPI libs).
Phase 4: Evaluation and Visualization#
Goal: Expand the existing schism_plot and sfincs_plot stages into a unified
evaluation framework.
Both SchismPlotStage and SfincsPlotStage are already pure Python, query NOAA CO-OPS
observations, and generate comparison plots. Future enhancements:
- Unified
EvaluationStagebase class for both models - Statistical metrics (RMSE, bias, correlation, skill scores)
- Multi-station summary dashboards
- Time series export (CSV/Parquet) for downstream analysis
Near-Term Priorities#
-
Pure-Python TPXO - Replaces the
predict_tideFortran binary (currently run inside the SCHISM Singularity container) with a pure-Python implementation. -
Absorb embedded Python scripts - The 5 Python files in
scripts/(677 LOC) are already functional Python. Moving them into the package proper brings them under type checking, testing, and import hygiene with minimal risk. -
f90nml-basedparam.nmlgeneration - Replaces the largest bash script (update_param.bash, 230 LOC) and unblocks theprebuilt_dirpattern.
Feature Expansion#
-
Hot Start Chain Automation
- Automatic hot-start file discovery
- Multi-run chaining for long simulations
-
Ensemble Runs
- Multiple configurations from single base
- Parallel SLURM array jobs
-
Cloud-Native Deployment
- AWS Batch support
- Container-native execution (no Singularity)
Conclusion#
The coastal-calibration package represents a substantial modernization of the original
bash-based workflow:
| Metric | Original | New | Improvement |
|---|---|---|---|
| Lines of bash | ~2,500 | ~500 (embedded) | 80% reduction |
| Lines of Python | ~200 (scattered) | ~4,000 (structured) | Full rewrite |
| Configuration | Environment variables | Typed YAML | Type-safe |
| Error handling | Exit codes | Exceptions + validation | Comprehensive |
| Testing | None | pytest + pyright |
CI-ready |
| Documentation | Comments only | Docstrings + types | Self-documenting |
| Extensibility | Copy & modify scripts | Inherit WorkflowStage |
Object-oriented |
| Model support | SCHISM only | SCHISM + SFINCS | Polymorphic |
The architecture is designed for maintainability, extensibility, and correctness while
supporting multiple coastal models (SCHISM and SFINCS) through a polymorphic
ModelConfig ABC and preserving compatibility with the existing HPC infrastructure.