Making ML Training Carbon-Aware with Compute Gardener (Part 3)

We routed 21 training jobs to the cleanest available region. Norway won every time, reducing emissions 93% vs. the Ohio baseline. But the real lesson is simpler than we expected.

The Story So Far

In Part 1, we showed that shifting ML training jobs to cleaner time windows could reduce carbon emissions by ~30%. In Part 2, we scaled up to a real hyperparameter search, 21 LoRA fine-tuning experiments orchestrated with Ray, and achieved 40% reductions in carbon emissions using a tiered scheduling strategy.

Both experiments used Compute Gardener's scheduler to make temporal decisions: not whether to run a job, but when. And both ran on hardware in our lab — GPUs we own, in a single location, drawing from a single grid region.

Part 3 asks a different question: what if we could choose where to run?

Why Spatial Shifting Matters

Temporal shifting works because grid carbon intensity varies throughout the day. But it also varies across grids — dramatically. At any given moment, Oregon's hydro-heavy grid might be running at 80 gCO2/kWh while Virginia's gas-heavy grid sits at 350. That's more than a 4x difference, available instantly, no waiting required.

For most ML scientists and engineers, this is probably more accessible than temporal shifting. You may not have a dedicated lab with GPUs and a Kubernetes scheduler you control. But you can (sometimes) choose a cleaner cloud region to spin up your training job in. The question is whether anyone actually does — and whether the savings justify the effort.

Why Another Experiment?

The academic literature is full of simulations showing 20-40% potential savings from spatial shifting. Could a simulation predict that clean grids beat dirty ones? Sure. But simulations don't tell you whether Vast.ai actually maintains GPU availability across regions, whether Tailscale works reliably across continents, whether the Electricity Maps API is stable enough for real-time routing or whether a prototype can be setup in an afternoon.

Part 3 extends our real-world testing to cloud infrastructure; where most ML training actually happens. The experiment validates operational feasibility, not just theoretical carbon math.

An Experiment in Routing

Compute Gardener's scheduler currently handles temporal shifting within a single cluster. Spatial shifting is on the roadmap, but rather than wait for the perfect design, we experimented with the simplest approach: a routing script that checks carbon intensity across three regions and sends each job to the cleanest one.

This isn't what a production spatial shifting system would look like. (For anyone curious, we'll probably build on top of a gitops or hub+spoke multi-cluster strategy.) But it's small enough to build quickly and honest enough to teach us where the complexity actually lives.

The Setup

We kept the workload familiar: LoRA fine-tuning of Qwen2.5-Coder-7B, the same model from Part 2. But we stripped the infrastructure down to essentials: no Kubernetes, no Ray orchestration. Just three GPU instances, SSH and a routing script.

Three regions, three grid profiles:

Quebec: Hydro-Quebec grid, one of the cleanest grids in North America. A viable clean option that's geographically close for US-based teams.
Ohio: PJM grid, gas/coal-heavy, typically the highest carbon intensity of our three options. Representative of the "default" US East/Virginia/Midwest data centers where most workloads already run.
Norway: Nordic grid, among the cleanest in the world due to hydro dominance. The "what if we could route to fundamentally different energy infrastructure" option.

We chose these regions to represent a practical range: the status quo (Ohio/Virginia), a clean North American alternative (Quebec) and an aspirational clean option (Norway). This creates a stark contrast — that's intentional. The question isn't whether clean grids exist, but whether dynamic routing into such a pool is accessible or necessary.

Hardware: NVIDIA RTX 4090 GPUs (24GB VRAM) rented via Vast.ai in Quebec, Ohio and Norway.

Network: Tailscale VPN connecting all instances to the router script running on a local machine, simplifying SSH access without exposing instances to the public internet.

Routing logic: At each scheduled interval, the script queries Electricity Maps for current carbon intensity in each region, picks the lowest, SSHs into that instance and kicks off training. No complex orchestration — just python train_lora_cloud.py --r X --alpha Y --lr Z executed remotely.

The script waits for each job to complete (~45 minutes), then waits 2 hours before the next submission. This gives roughly 2.75-hour intervals between job starts, allowing us to sample different times of day across all three grids.

The Infrastructure

Router script (runs locally in our California lab):

def pick_cleanest_region():
    """Query Electricity Maps, return cleanest region."""
    intensities = {}
    for region, config in REGIONS.items():
        response = requests.get(
            "https://api.electricitymap.org/v3/carbon-intensity/latest",
            params={"zone": config["zone"]},
            headers={"auth-token": ELECTRICITY_MAPS_TOKEN}
        )
        intensities[region] = response.json()["carbonIntensity"]
    
    return min(intensities, key=intensities.get), intensities

def submit_job_ssh(region, config, intensities):
    """SSH into region and kick off training."""
    ip = REGIONS[region]["ip"]  # Tailscale IP
    
    remote_cmd = (
        f"CARBON_REGION={region} "
        f"CARBON_INTENSITY={intensities[region]} "
        f"python3 train_lora_cloud.py "
        f"--r {config['r']} --alpha {config['alpha']} "
        f"--lr {config['lr']} --dropout {config['dropout']}"
    )
    
    subprocess.run([
        "ssh", "-i", SSH_KEY_PATH, f"root@{ip}", remote_cmd
    ])

The entire "routing layer" is a few hundred lines of Python.

The Submission Schedule

We ran 21 jobs over approximately 2.5 days. Each job trains the same Qwen2.5-Coder-7B model with LoRA, using different hyperparameter configurations from a predefined sweep (varying rank, alpha, learning rate and dropout).

The jobs weren't identical — we varied hyperparameters across the sweep to make the training genuinely useful:

Parameter	Values Tested
Rank (r)	16, 32
Alpha (α)	32, 64, 128
Learning Rate	1e-4, 5e-5, 1e-5
Dropout	0.05, 0.1, 0.15

Each experiment trains Qwen2.5-Coder-7B on 5,000 samples from the HelpSteer2 dataset. Training time: roughly 0.76 hours per job on 4090 GPUs.

Results

Where Did Jobs Actually Run?

Of the 21 jobs submitted over ~55 hours:

Norway: 21 jobs (100%)
Quebec: 0 jobs (0%)
Ohio: 0 jobs (0%)

[Was planning to have a visualization here but... umm. Norway ate the whole pie! Can you imagine a circle?]

Every. Single. Job. went to Norway.

This wasn't quite what we expected. We anticipated some distribution — Quebec catching jobs during Norway's peaks, maybe Ohio winning the odd mid-day window. Instead, Norway's hydro-dominated grid was so consistently clean (33-39 gCO2/kWh) that it won every routing decision by a wide margin.

Quebec's hydro grid is objectively clean (40-63 gCO2/kWh, avg 51), but Norway's was cleaner. Ohio's fossil-heavy grid (472-580 gCO2/kWh, avg 523) never came close to Norway's baseline.

What this teaches us: Spatial shifting doesn't require sophisticated runtime optimization. The cleanest region is often so much cleaner that dynamic balancing barely matters. Just pick Norway (or Quebec or Pacific Northwest) instead of Ohio/Virginia and you're getting at least 80% of the carbon saving benefits.

Carbon Intensity Comparison

Metric	Carbon-Routed (Norway)	Ohio Baseline	Difference
Avg Intensity	35 gCO2/kWh	523 gCO2/kWh	-93%
Total Emissions	0.21 kgCO2	3.09 kgCO2	-93%
Total Energy	5.9 kWh	5.9 kWh	~same

The energy consumption is essentially identical — we're doing the same computational work either way. The difference is purely in where that energy came from.

The Counterfactual: What If We'd Just Used Ohio?

The real-world baseline is Ohio/Virginia, because that's "where my team runs things." To establish this, we calculated what emissions would have been if all 21 jobs had run in Ohio at the times they were submitted and run, instead, in Norway.

Ohio's grid averaged 523 gCO2/kWh during our experiment window, while our routed jobs (all Norway) averaged 35 gCO2/kWh — a 93% reduction from a single routing decision: "use Norway instead of Ohio."

To put it another way: Ohio's grid was nearly 15x dirtier than Norway's throughout our experiment. Even Ohio's best hour (472 gCO2/kWh) was over 12x worse than Norway's worst (39 gCO2/kWh).

Time-of-Day Patterns

The pattern was remarkably stable:

Norway: Consistently 33-39 gCO2/kWh across all hours. Hydro-dominated grids don't vary much with time of day.
Quebec: Ranged 40-63 gCO2/kWh. Also hydro-heavy, also stable, just slightly higher baseline.
Ohio: Ranged 472-580 gCO2/kWh. Even Ohio's "best" hours couldn't compete with Norway's "worst" hours.

The key insight: When one grid is fundamentally cleaner by 15x, temporal variations don't matter.

What We Learned

The Simplicity Thesis: The 80/20 of Carbon Reduction

The headline result — 93% carbon reduction — exceeds what many simulations predicted. But the more important finding is how we achieved it: not through sophisticated dynamic optimization, but through a single decision: "use Norway instead of Ohio."

Norway won all 21 routing decisions because it's fundamentally cleaner (33-39 gCO2/kWh) than the alternatives. Quebec is clean (~51), but not as clean. Ohio is dirty (~523) and no amount of temporal variation brings it close to Norway's baseline.

The 93% number isn't a discovery; anyone who's looked at Electricity Maps for 10 minutes could predict Norway beats Ohio. The finding is that 93% is achievable with very simple tools and minimal setup friction. The experiment proved you can actually route to Norway, not just that you theoretically should.

This is an 80/20 leverage point. The actionable advice isn't "build a complex multi-region orchestration system." It's "check Electricity Maps for your available cloud regions, pick the cleanest one as your new default, done."

What Actually May Require Complexity

Region-to-grid mapping can take a bit of homework. Cloud regions and GPU rental locations don't always map cleanly to electrical grid zones. In Ohio, exact location determines whether you're on PJM or MISO. "Norway" on Vast.ai could technically mean different parts of the Norwegian grid, though Norway's grid is remarkably homogeneous due to hydro dominance. Ultimately, we spent as much time verifying our instances were in the right grid zones as we did building the routing logic.

When using public cloud providers, Electricity Maps API will even do that mapping for you.

Spatial vs. Temporal: Different Problems, Different Solutions

Spatial shifting can be about defaults. For many workloads, "just always use Norway" is enough. The clean region wins so consistently that dynamic routing adds complexity without much benefit.

But spatial shifting has thorny constraints:

Data residency and sovereignty (GDPR, healthcare, financial regulations)
Data transfer costs and latency (moving TB-scale datasets between regions)
Regional availability and quota limits

Temporal shifting avoids most of these. Running the same job at 2pm instead of 2am doesn't move data across borders or incur egress fees. For many organizations, temporal shifting within a single region may be more practical than spatial shifting across borders.

The layers stack. The real opportunity is combining a few:

Pick the cleanest region you can use (spatial)
Within that region, run jobs during cleanest hours (temporal; ex: Compute Gardener)
Use infrastructure sleep strategies (ex: KubeGreen) to shut down idle resources
Use carbon-aware autoscaling (ex: KEDA with carbon signals) for variable loads
Application layer optimization

Each layer compounds the benefit.

Versus the Simulations

Academic papers often report 20-40% potential savings from spatial shifting. Our 93% result significantly exceeds that range because we deliberately chose a stark contrast: one of the cleanest grids on Earth (Norway hydro) versus a dirty, fossil-heavy grid (Ohio coal/gas). Simulations typically model more moderate contrasts or average across diverse workloads.

We're not claiming 93% is universally achievable — it's not. But using an extreme example makes the point obvious: the leverage exists and the workflow to access it is surprisingly simple. The carbon cost of proving this? About 0.21 kgCO2 for the Norway runs — less than driving 2 miles. The difference from running this experiment versus just simulating it is negligible; the difference between running in Norway versus Ohio for real workloads at scale is enormous.

Where Compute Gardener Fits

This experiment taught us that routing and scheduling are different problems:

Routing (which region): Can often be a pre-deployment decision. "Should this job run in Ohio or Norway?" This can be as simple as a policy: "all deferrable ML training goes to the generally cleanest grid."

Scheduling (when to run): Can be a runtime decision within a cluster. This is where Compute Gardener excels. For deferrable batch workloads in Kubernetes, Compute Gardener (CG) provides carbon intensity thresholds per pod, maximum delay enforcement, energy budget tracking and price-aware scheduling.

Why temporal shifting still matters: Even in a dirty grid like Ohio, intensity varies 472-580 gCO2/kWh throughout the day. A job scheduled to run during midday (472) instead of overnight (580) achieves ~19% reduction — without crossing borders, moving data or dealing with compliance headaches.

The complete picture for carbon-aware ML training:

Spatial layer: Pick the cleanest region your constraints allow
Temporal layer: Within that region, defer jobs to cleaner hours
Infrastructure efficiency: Sleep idle resources, scale based on carbon signals
Application efficiency: Make the work itself more efficient

For organizations that can't use Norway's grid due to data residency, temporal shifting becomes even more critical.

What About US-Only Spatial Shifting?

An interesting middle ground we didn't test: routing between US regions only. This avoids most data residency concerns while still leveraging grid variation (Pacific Northwest: 80-150, Ohio/Virginia: 400-600, Texas: 200-500).

Does spatial shifting within America behave the same way? Set-it-and-forget-it? Or is that an environment when the "dynamic optimization matters" thesis from academic papers rings true? Our hypothesis: somewhere in between, but likely still favoring "just default to Quebec or Pacific Northwest" for most jobs.

Limitations and Honest Caveats

Our experiment showed dramatic results (93% carbon reduction), but it's important to understand what this doesn't capture and when the "just use Norway" heuristic breaks down.

Data Transfer Costs and Latency

We assume training data magically appears wherever we need it. In reality, moving large datasets between regions costs money and adds latency. For our 5,000-sample dataset (~few hundred MB) and ~15GB model, this was negligible. For TB-scale datasets, it's a real constraint that fundamentally changes the economics.

Data Sovereignty and Compliance

Many organizations face legal requirements: GDPR restricts where EU citizen's data can be processed, healthcare data often can't leave national boundaries, financial services may prohibit cross-border processing. Our "just use Norway" advice only works if you legally can use Norway. For many orgs, US-only or EU-only spatial shifting is the only option.

No Temporal Shifting in This Experiment

We deliberately excluded temporal shifting to isolate spatial effects. Jobs ran immediately in whichever region was selected. Combining spatial + temporal could compound savings (95%+ total reduction), but adds coordination complexity.

Measurement Uncertainty

Unlike Parts 1 and 2, we don't have actual power meter data. We're using observed GPU utilization (~320W at P2 power level) plus an estimate for system overhead (~50W). Vast.ai instance locations aren't usually as precisely defined as hyperscaler regions. Our carbon intensity numbers are real (from Electricity Maps API), but total emissions calculations have ~10-20% uncertainty.

That said, even with 20% measurement error in both directions, the fundamental result holds: Norway's grid (33-39 gCO2/kWh) is so much cleaner than Ohio's (472-580 gCO2/kWh) that the magnitude of the difference isn't in question, only the precise percentage.

What's Next

For Compute Gardener's roadmap:

Temporal optimization within regions: CG already handles this for Kubernetes workloads
Integration with spatial routing policies: Organizations can set "use Pacific Northwest for batch jobs" policies, CG optimizes when within that region
Hybrid on-prem + cloud: If you have local GPUs and cloud access, CG can help you weigh more nuanced trade-offs (ex: run now locally or defer and burst to cloud during clean hours).

We're not waiting for perfect solutions to start reducing emissions. The necessary tools exist today.

Reproducibility

All code, logs and data from this experiment are available in our GitHub repo.

To run your own spatial shifting experiment:

Rent GPU instances in multiple regions (we used Vast.ai RTX 4090s)
Set up Tailscale or similar VPN for easy instance access
Install Python ML dependencies on each instance
Get an Electricity Maps API key (free tier works for experimentation)
Configure the routing script with your instance IPs and grid zones
Run: python spatial_router.py --run-experiment

No Kubernetes required. Total compute cost: approximately $50 over 2.5 days. Could be much less with a more active provisioning strategy (we had all nodes reserved).

Is This Right for Your Workloads?

Every discussion of carbon-aware computing comes with the same caveat: results depend heavily on your specific usage profile. Your default region, your job duration, your flexibility constraints; they do all matter.

The experiments in this series demonstrate what's possible under specific conditions. Of course, your conditions are different. Maybe you're already multi-region and spatial routing is low-hanging fruit. Maybe your workloads are latency-sensitive and can't move. The only way to know is to look at your actual usage patterns and do the analysis.

Whether you're exploring carbon-aware scheduling for the first time or trying to quantify the opportunity for your specific infrastructure, we offer consulting engagements designed to answer: what would this look like for us and our wacky X stack?

Let's figure out your carbon optimization opportunity →

Compute Gardener is an open-source project focused on making carbon-aware computing practical, not just theoretical. Join us in moving sustainable computing from white papers to production.