Making ML Training Carbon-Aware with Compute Gardener (Part 2) - Hyperparameter Search with Ray

How we reduced emissions by 40% during a LoRA hyperparameter search with a flexible, tiered and carbon aware strategy

Continuing from Part 1

In Part 1, we demonstrated how Compute Gardener (CG) could reduce carbon emissions by ~30% for simple, recurring ML training jobs. The setup was straight-forward: daily ResNet50 training jobs, shifted to cleaner energy windows.

But that raised important questions that have come up in related discussion:

If everyone delays to 2pm, haven't we just moved the problem?

This works for toy examples, but what about real ML workflows?

These are fair questions. So for Part 2, let's build a more complex and realistic scenario: hyperparameter optimization for LLM fine-tuning using Ray on Kubernetes. And let's do it with intelligent, carbon-aware scheduling that distributes load rather than creating new stampedes.

Real-World Scenario: LLM Fine-Tuning Hyperparameter Search

Qwen2.5-Coder was one of the more pleasantly surprising open-weights models I've worked heavily with in the past year. This model made me realize that open models aren't more than a year (probably more like just 6 months) behind the capabilities of closed frontier models (of similar sizes).

Today, Qwen2.5-Coder is not the "latest and greatest" of open models BUT it does hold a soft spot in my heart. And perhaps more importantly, the newer Qwen3-Coder models are all too large to train on immediately available GPUs (even with some quantization).

So, we decided to fine-tune Qwen2.5-Coder-7B to create a specialized coding assistant. Like most ML projects, we didn't know the optimal training configuration upfront, so we needed to test multiple LoRA (Low-Rank Adaptation) hyperparameter combinations:

Rank (r): [16, 32] — size of adapter (r=32 is the practical limit for 7B models at bfloat16 on a 24GB GPU)
Alpha (a): [32, 64, 128] — scaling factor for LoRA weights
Learning rate (lr): [1e-4, 5e-5, 1e-5] — how aggressively to update weights
Dropout (d): [0.05, 0.1, 0.15] — regularization to prevent overfitting

A full factorial sweep would be 54 combinations (2 × 3 × 3 × 3), too many for this experiment. Instead, we designed a targeted sample of 21 experiments that systematically explore the hyperparameter space:

1 baseline and 7 related variations (Tier 1, r=16): Testing learning rates, alpha values and dropout values around the proven baseline rank
4 promising directions (Tier 2, r=32): Stepping up model capacity with standard hyperparameters
9 experimental long shots (Tier 3, r=32): Aggressive/unconventional hyperparameter combinations (high alpha, extreme dropout, very conservative learning rates)

This is a hyperparameter sweep; training the same model architecture multiple times with different configuration settings, then comparing results to find the best combination. It's one of the most common workflows in ML research and production model development.

Note on scope: The following approach works best for batch/deferrable workloads (ex: hyperparameter searches, model evaluations, periodic retraining, research experiments). Production inference requires different carbon-awareness strategies like geographic load balancing and efficient model architectures.

Training Dataset

We're fine-tuning on nvidia/HelpSteer2, an open-source dataset of 21k prompt-response pairs with human annotations for helpfulness, correctness, coherence, complexity, and verbosity. It's NVIDIA's recommended dataset for training helpful assistants.

For this experiment, we use 5,000 samples split 80/20 into training (4,000) and validation (1,000). This keeps training times reasonable while providing enough data to demonstrate the carbon-aware scheduling approach and compare hyperparameter configurations.

Setup and Timeline

Training times vary by rank due to the increased LoRA parameter count:

r=16: ~1.4 hours (8 experiments)
r=32: ~1.6 hours (13 experiments)

This is a total of ~32 GPU hours.

We used two 24GB GPUs (one NVIDIA RTX 3090, one NVIDIA RTX A5000) on separate nodes in our Kubernetes cluster. The experiment ran for just over 2 days (51.3 hours), 28-30Oct2025. The consumer RTX 3090 is ~5% faster at completing these jobs but uses ~80% more energy compared to the enterprise A5000. This presents a trade-off in carbon-aware scheduling: slightly faster completion vs much lower energy consumption.

The Traditional Approach: Random Submission

The typical workflow looks like this:

# Monday morning: Try a baseline
kubectl apply -f rayjob-r16-a32-lr1e4.yaml

# Monday afternoon: Results look good, try variations
kubectl apply -f rayjob-r16-a64-lr5e5.yaml
kubectl apply -f rayjob-r32-a32-lr1e4.yaml

# Tuesday morning: More experiments based on what worked
kubectl apply -f rayjob-r32-a64-lr5e5.yaml
# ... and so on

Predictable Result: Jobs run whenever submitted and resources are available, scattered somewhat randomly across the carbon intensity spectrum. Some at 350 gCO2eq/kWh (peak intensity hours), others at 120 gCO2eq/kWh (solar valleys).

Note: For the remainder of this article, we will simplify the unit as gCO2/kWh. But in all places, (k)gCO2eq or gCO2eq/kWh are technically accurate.

The Naive Carbon-Aware Approach (And Why It Fails)

Our first instinct was simple: "Just delay everything until carbon intensity is low!"

The problem: With limited GPU resources (2 GPUs) and many experiments, forcing everything to wait for the absolute lowest carbon window creates a stampede. When conditions are finally met (or max-delay expires), all jobs compete for just 2 GPUs. The first few might run during optimal windows, but the tail-end jobs likely don't start until carbon intensity climbs back to 250+ gCO2/kWh.

The lesson: We need to distribute load across multiple "good enough" windows rather than clustering everything at the absolute minimum.

Better Solution: A Tiered Strategy

Not all experiments are created equal. Some configurations need fast feedback (baseline adjacent validations), while others are speculative and can wait for optimal conditions (long shots).

Our strategy: Distribute experiments across multiple "good enough" carbon thresholds using three tiers, each with different urgency profiles:

Tier 1 (225 gCO2/kWh, 24h max delay): Baseline adjacent variations with r=16 for fast feedback
Tier 2 (175 gCO2/kWh, 48h max delay): Promising r=32 configurations with standard hyperparameters
Tier 3 (125 gCO2/kWh, 96h max delay): Experimental r=32 long shots with aggressive hyperparameters

This is essentially a priority queue pattern with carbon constraints as the primary gate.

California's grid (CalISO) has a clear daily pattern: carbon intensity is highest overnight and early morning (300-400 gCO2/kWh), then drops dramatically during the midday solar peak (typically 100-200 gCO2/kWh around noon to 2pm), before climbing again in the evening as solar generation fades.

Our three-tier strategy doesn't try to hit the single "perfect" window throughout the day. Instead, it solves two practical problems when you have limited GPU resources (2 GPUs) and many experiments (21 jobs):

Preventing the Stampede

If all 21 experiments had the same low threshold (say, 100 gCO2/kWh), they would all wait behind CG's carbon gate until either:

Carbon intensity drops below 100 OR
They hit their max delay (say, 24h) and CG stops blocking them

But after CG releases the jobs, they compete for just 2 GPUs through normal Kubernetes resource scheduling. The initial few jobs can run immediately during optimal conditions. However, the deeply queued jobs likely won't be able to execute until a less optimal time. In the worst case, such a stampede can make carbon intensity worse for many of the jobs.

But by using 225/175/125 gCO2/kWh thresholds, we'll create a natural flow:

Tier 1 jobs (threshold 225, 8 jobs) start running in early morning as intensity drops below 225
Tier 2 jobs (threshold 175, 4 jobs) join in late morning when it dips below 175
Tier 3 jobs (threshold 125, 9 jobs) run during the deepest part of the solar valley in the early afternoon

This spreads the 21-job workload across a handful of low-carbon periods rather than clustering everything at the absolute minimum.

Intelligent GPU Allocation

This is the priority queue pattern with carbon constraints. When a GPU finishes one experiment and becomes available, the scheduler picks the next job that:

Meets its carbon threshold - current intensity is low enough
Has the highest priority - Tier 1 > Tier 2 > Tier 3 within acceptable jobs
Has been waiting longest - within that tier

The separation of concerns makes this clean:

Ray: Manages experiment orchestration and dependencies
Kubernetes: Allocates GPU resources and handles pod lifecycle
Compute Gardener: Gates scheduling based on carbon intensity threshold
Result: Each layer handles what it's best at

The key insight: We're not optimizing to "run everything at minimum carbon intensity." Instead we're optimizing with a blended consideration, "distribute the queue intelligently so experiments run when carbon is acceptably low AND GPU resources are available."

Implementation: Automating Tier Assignment

Rather than manually categorizing each experiment, we automated tier assignment based on hyperparameter characteristics, using a python script.

Note: Please checkout this post's supporting materials in our Github repo.

The workflow

# 1. Generate targeted sample of 21 manifests with carbon tiers
python generate_sweep.py --output-dir sweep_manifests --core-subset
# Output: 21 YAML files with intelligent tier assignments and annotations

# 2. Submit to Kubernetes (can submit all at once, individually or by tier)
kubectl apply -f sweep_manifests/

# 3. Ray coordinates job execution, CG handles carbon-aware scheduling
# No further intervention needed - jobs run when conditions are favorable

What happens under the hood:

Ray creates pods ("head" and GPU workers) for each training job
Compute Gardener scheduler evaluates each pod:
- Checks current carbon intensity via Electricity Maps API
- Compares against pod's threshold annotation
- If intensity is acceptable, schedules immediately
- Otherwise, delays and re-evaluates periodically
Jobs complete, then save models and metrics to persistent storage
Repeat until all experiments finish

Carbon Intelligent Scheduling vs Simple Batch Submission

We ran all 21 experiments from October 28-30, 2025, capturing actual carbon intensity data for the California grid (CalISO) via Electricity Maps API. Here's what we found:

Three-Tier Strategy (Our Approach)

Average carbon intensity: 142 gCO2/kWh
Total emissions: 1.81 kgCO2
Total energy consumption: 12.2 kWh across all 21 experiments
Wall-clock completion time: ~50.3 hours
Distribution: Naturally spread across low-carbon periods over 2 days
Research velocity: Maintained — Tier 1 baseline results available within the first day

The Alternative Universe: No Carbon Awareness

To estimate relative impact, we calculated what would have happened if the same 21 experiments had run immediately upon submission, executing two at a time continuously through the first night:

Average carbon intensity: 226 gCO2/kWh (62.7% higher than our approach)
Total emissions: 3.03 kgCO2 (67% higher than our approach)
Total energy consumption: 12.2 kWh (same — same work done either way)
Wall-clock completion time: ~16.2 hours (2 GPUs running continuously)
Distribution: Heavy overnight execution during fossil-fuel heavy grid hours

Key insight: We achieved more than 40% carbon reduction by intelligently distributing experiments across cleaner time windows. The same computational work (12.2 kWh) produced dramatically different emissions based purely on when it ran.

Assessing Training Quality

After all experiments completed, we needed to determine which hyperparameter configuration produced the best model. This requires both quantitative metrics and qualitative assessment.

Quantitative Metrics

Each training run logged both the final training loss as well as the average validation loss.

Validation loss (not training loss) is the primary quality metric. It measures how well the model generalizes to unseen data. Lower is better, but only to a point (too low can indicate overfitting).

We established rank=16, alpha=32, learning rate=1e-4, dropout=0.1 as our baseline configuration (validation loss: 0.7899). Here are the top performers relative to that baseline:

Top 5 configurations by validation loss:

Top Finishers	rank	alpha	learning rate	dropout	Val Loss	Δ vs Baseline	Carbon cost (gCO2)	Tier
1	32	128	1e-4	0.05	0.6970	-0.0929	51.4	3
2	32	128	1e-4	0.1	0.7010	-0.0889	94.5	3
3	32	128	1e-4	0.15	0.7037	-0.0862	56.1	3
4	16	128	1e-4	0.1	0.7059	-0.0840	90.8	1
5	16	64	1e-4	0.1	0.7490	-0.0409	70.3	1
Baseline	16	32	1e-4	0.1	0.7899	0.0000	52.1	1

Interesting findings:

The top 3 models all used alpha=128 with r=32, achieving 9-12% improvement over baseline
Tier 3 experiments (the "long shots" designed to wait for optimal carbon windows) dominated the top results
Higher alpha values (128 vs 32) had more impact on performance than rank increases (r=32 vs r=16)

Qualitative Testing

Metrics only tell part of the story. We also deployed the top 3 candidates with vLLM and tested them on real coding tasks:

# Deploy top candidates for interactive testing
kubectl apply -f vllm-deployment-r32-a128.yaml
kubectl port-forward svc/vllm-lora 8000:8000

# Test with prompts
curl http://localhost:8000/v1/chat/completions -d '{
  "model": "Qwen/Qwen2.5-Coder-7B-Instruct:qwen-lora-r32-a128",
  "messages": [{
    "role": "user",
    "content": "Write a Python function for binary search with edge case handling"
  }]
}'

What we evaluated:

Code correctness and handling of edge cases
Code style and readability
Ability to follow instructions and successfully use tools
Response to follow-up questions ("ex: what else can be improved in this codebase?")

The winner: rank=32, alpha=128, learning rate=1e-4, dropout=0.05; this configuration combined the best quantitative performance (lowest validation loss) with seemingly strong code and assistance quality. This was a Tier 3 experiment which waited for near-optimal carbon conditions.

Recap

40% carbon reduction is achievable with intelligent scheduling: By shifting the same computational work (12.2 kWh) from high-carbon overnight hours to solar-powered daytime windows, we reduced emissions from 3.03 kgCO2 to 1.81 kgCO2. This saved 1.22 kgCO2 across the 21 experiments.
Three-tier distribution prevents stampede: By using multiple thresholds (225/175/125 gCO2/kWh), we distributed load across the solar valley rather than clustering everything at the absolute minimum. All 21 jobs completed successfully without hitting max-delay timeouts.
The trade-off is real but manageable: Carbon-aware scheduling stretched wall-clock time from 16 hours to 51 hours (3.2× longer). But for batch workloads like hyperparameter sweeps, this is often acceptable; we don't need results in 16 hours if we can get them in 2 days, having emitted 40% less carbon along the way.
Priority queue based on carbon intensity signal works: Not all experiments are equal. Fast feedback on baseline adjacent (Tier 1) configurations, a bit more waiting for long shots (Tier 3). Tier 1 results were available within the first day, maintaining research velocity.
The best model came from a Tier 3 experiment: Our top performer (r=32, alpha=128, lr=1e-4, dropout=0.05) was a "long shot" that waited for optimal carbon conditions (123 gCO2/kWh). It achieved 12% better validation loss than our baseline.

What's Next

Future work, and a potential part 3 in this series, might explore intelligent scheduling with carbon intensity forecasts as well as scaling up to cloud GPU providers where carbon data varies by region.

Interested in implementing something similar in your organization?

If automated carbon awareness for your ML infrastructure sounds appealing, we're available to help. Whether you're looking to deploy Compute Gardener, customize it for your use case or explore new or advanced carbon optimization strategies, don't hesitate to reach out. Let's work together to advance sustainable computing practices.

Compute Gardener is an open-source Kubernetes scheduler that enables carbon-aware workload scheduling across hybrid clouds. Join us in making carbon-aware computing the default, not the exception.

Making ML Training Carbon-Aware with Compute Gardener (Part 2)