Building a Carbon and Price-Aware Kubernetes Scheduler

Kubernetes and cloud infrastructure have revolutionized how we deploy and manage applications and services, but with traditional data center and cloud computing infrastructure now likely responsible for about 4% of global electricity consumption and set to triple to potentially 12% of U.S. electricity in the next three years[1], now is the time to build energy and emissions awareness into our orchestration systems. In this post, I'll walk through the technical implementation of the Compute Gardener Scheduler, a carbon and price-aware Kubernetes scheduler plugin we've developed; building upon recent advancements in energy-aware computing.[2]

The Challenge

Standard Kubernetes scheduling doesn't consider energy implications - pods are placed to optimize for resource utilization, spread, affinity and other traditional metrics. But what if we want to:

Delay non-urgent workloads until carbon intensity of the grid is lower[3]
Run batch jobs during off-peak electricity pricing hours
Track energy consumption for workloads and enforce budgets
Optimize for energy-efficient hardware when available[4]

Overview

We built our solution using Kubernetes' scheduler framework, which allows custom plugins to influence scheduling decisions.[7] The core components include:

# Main components of the scheduler
api/      # Various data providers' API clients 
carbon/   # Carbon-aware scheduling implementations (currently Electricity Map API)
config/   # Configuration types and validation
metrics/  # Prometheus metrics and hardware profiling
price/    # Price-aware scheduling implementations (currently time-of-use)

Our scheduler implements two key extension points in the framework:

PreFilter: Initial validation of scheduling constraints and annotations related to carbon intensity and price
Filter: Main decision logic that evaluates efficiency and suitability of potential nodes to bind

Decision Flow

PreFilter Stage

The PreFilter stage/method handles initial validation of scheduling constraints and annotations related to carbon intensity and pricing.

// PreFilter implements the PreFilter interface (simplified)
func (cs *ComputeGardenerScheduler) PreFilter(ctx context.Context, state *framework.CycleState, 
                                              pod *v1.Pod) *framework.Status {
    // Check pricing constraints, if enabled
    if cs.config.Pricing.Enabled && cs.pricingImpl != nil {
        if status := cs.pricingImpl.CheckPriceConstraints(pod, cs.clock.Now()); !status.IsSuccess() {
            return status
        }
    }

    // Check carbon intensity constraints, if enabled
    if cs.config.Carbon.Enabled && cs.carbonImpl != nil {
        // Get threshold from configured threshold
        threshold := cs.config.Carbon.IntensityThreshold
        // Override with annotated value, if present
        ...

        // Get current carbon intensity
        intensity, err := cs.carbonImpl.GetCurrentIntensity(ctx)
        if err != nil {
            // If we can't determine carbon intensity, allow pod thru
            return framework.NewStatus(framework.Success, "")
        }

        // Make scheduling decision based on threshold
        if intensity > threshold {
            msg := fmt.Sprintf("Current carbon intensity (%.2f) exceeds threshold (%.2f)", 
                              intensity, threshold)
            return framework.NewStatus(framework.Unschedulable, msg)
        }
    }

    // Default to letting pod thru
    return framework.NewStatus(framework.Success, "")
}

Filter Stage

PreFilter occurs without any node awareness. Then, in the Filter stage, we gain node information and can consider our knowledge of node energy use and efficiency in scheduling decisions.

// Filter implements the Filter interface (simplified)
func (cs *ComputeGardenerScheduler) Filter(ctx context.Context, state *framework.CycleState, 
                                          pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
    // Logic for hardware profiling and node selection
    if cs.config.HardwareProfiling.Enabled && cs.hardwareProfileImpl != nil {
        profile := cs.hardwareProfileImpl.GetNodeProfile(nodeInfo.Node().Name)
        if profile == nil {
            return framework.NewStatus(framework.Unschedulable, "No hardware profile available")
        }

        // Apply node selection based on hardware capabilities
        if !cs.nodeSelectionImpl.IsNodeSuitable(profile, pod) {
            return framework.NewStatus(framework.Unschedulable, "Node not suitable for workload")
        }
    }

    // Default to letting pod thru
    return framework.NewStatus(framework.Success, "")
}

This implementation follows the Kubernetes scheduler pattern of returning Unschedulable status when conditions aren't met, which causes the pod to wait in the scheduling queue until conditions change.

Real-Time Carbon Intensity Data

The scheduler consumes carbon intensity data from external APIs and makes scheduling decisions based on configurable thresholds. Our solution:

Supports pod-level configuration through annotations, allowing different thresholds per workload
Uses caching and failure backoff to tolerate API failures and maintain scheduler reliability
Makes it easy for operators to dynamically override scheduling decisions as preferred

Hardware Profiling and Estimating Node Energy Use

Accurately modeling power consumption, estimating energy usage, and selecting suitable nodes based on hardware capabilities enables efficient scheduling. It seems like it SHOULD be easy, but most computing hardware wasn't designed with real-time power monitoring as a primary feature.

Fortunately, for Graphical Processing Units (GPUs), we can leverage direct power metrics through NVIDIA's DCGM, which provides reasonably accurate real-time power consumption data to which we can apply a PUE (Power Usage Effectiveness) factor. However, for CPUs and memory, we must rely on a combination of heuristics, frequency scaling information, and utilization patterns to estimate power consumption.

Our approach follows and extends the Green Software Foundation's Software Carbon Intensity (SCI) methodology[8], adding enhancements for CPU frequency scaling and dynamic power profiling. We believe our estimates are within 5% of actual consumption for calibrated nodes - likely sufficient accuracy for carbon accounting and potential carbon credit validation.

// calculatePodPower estimates power consumption (simplified)
func calculatePodPower(nodeName string, cpu float64, memory float64, gpuPower float64) float64 {
    // Get hardware profile for the node
    profile := getNodeHardwareProfile(nodeName)
    
    // Apply frequency scaling adjustment
    cpuFreq := getCurrentCPUFrequency(nodeName)
    if cpuFreq > 0 && profile.BaseFrequency > 0 {
        freqRatio := cpuFreq / profile.BaseFrequency
        // Power typically scales with square of frequency
        powerRatio := freqRatio * freqRatio
        profile.MaxPower *= powerRatio
    }

    // Power-law model for CPU power considering utilization
    cpuPower := profile.IdlePower + 
                (profile.MaxPower - profile.IdlePower) * 
                math.Pow(cpu, 1.4)  // Common exponent used in datacenter modeling
                
    // Add GPU power with datacenter overhead (PUE)
    totalPower := cpuPower
    if gpuPower > 0 {
        totalPower += gpuPower * profile.GPUPUE
    }

    return totalPower
}

// Calculate energy using numerical integration
func CalculateTotalEnergy(records []PodMetricsRecord) float64 {
    totalEnergy := 0.0
    
    for i := 1; i < len(records); i++ {
        // Integrate power over time, summing totalEnergy
        ...
    }

    return totalEnergy
}

This allows us to model power consumption accurately with dynamic frequencies, and then calculate energy usage over time using numerical integration.

Performance Considerations

A key concern with any scheduler is adding latency to scheduling decisions. We implemented several optimizations:

Caching: Both carbon API responses and hardware profiles are cached, ensuring API hiccups never derail a scheduling decision
Parallel Processing: API requests and metrics collection happen asynchronously wrt core scheduler flow
Bounded Memory: If capturing metrics, downsampling is used for long-running pods to limit memory usage
Prometheus Integration: Leverage existing monitoring infrastructure; don't reinvent or duplicate

Most importantly, all of this runs as a secondary scheduler, which means:

System-critical components continue using the default scheduler; your 24/7 HA services will never even know there's another scheduler in the cluster
Only workloads that opt-in are subject to carbon/price decisions; can opt-in at pod level or namespace level using webhook admission controller
Scheduling delays (capped at a configurable max) cause temporal workload shifting strategies that have shown promising results in reducing carbon emissions[5]

Pod Configuration

Workload creators can opt into carbon/price-aware scheduling by requesting schedulerName: compute-gardener-scheduler in pod spec and then can apply individual job/pod overrides with annotations:

metadata:
  annotations:
    # Set custom carbon intensity threshold
    compute-gardener-scheduler.kubernetes.io/carbon-intensity-threshold: "200.0"
    # Enable price awareness using configMap'd schedules
    compute-gardener-scheduler.kubernetes.io/price-enabled: "true"
    # Set energy budget
    compute-gardener-scheduler.kubernetes.io/energy-budget-kwh: "5.0"
    # Limit any potential carbon/price delays to a maximum of 12 hrs.
    compute-gardener-scheduler.kubernetes.io/max-scheduling-delay: "12h"
spec:
  schedulerName: compute-gardener-scheduler

Projected Results and Validation Needs

Based on our simulations and initial measurements, the scheduler shows promising potential benefits:

Carbon reduction: Testing suggests potential for 50%+ reduction in carbon emissions for time deferrable workloads (mitigations measured 03-17Mar25 with intensity data in CalISO region cycling between common daily lows of 50 and highs of 200+ gCO2eq/kWh)
Cost savings: Testing indicates 20%+ reduction in electricity costs when leveraging time-of-use (TOU) rates (mitigations measured 03-17Mar25 and considering PGE's E-TOU-B pricing schedule)
Energy tracking: Previously invisible energy consumption becomes visible and actionable (priceless, right? ;D)
Performance impact: Testing shows minimal increase in scheduling latency (< 100ms per decision; shouldn't be significant considering k8s orchestration timescales)

For a medium-large cloud data center with ML training workloads, our simulations project approximately 18 metric tons of CO2 equivalent emissions that could be avoided per month. These estimates align with recent research showing substantial potential for carbon reduction in ML workloads through intelligent scheduling.[6]

We're actively seeking partners to help validate these projections in production environments. If you're interested in deploying carbon and/or price-aware scheduling in your Kubernetes clusters, we'd love to collaborate to measure and refine real-world impact. Please reach out via our website or GitHub repository.

Future Developments

As the field of carbon-aware computing continues to evolve, several exciting developments are on the horizon:

Integration with Advanced Forecasting: Leveraging ML, along with weather and various other signals, to predict future carbon intensity and optimize scheduling decisions in real-time.
Enhanced Hardware Profiling: Developing more accurate models for power consumption across different common cloud hardware configurations, including CPUs, GPUs, and memory.
Global Carbon Markets: Integrating the scheduler with global carbon markets to enable workloads mitigating carbon emissions, in a calibrated, validated way, to recieve carbon credits.
Cross-Cluster Scheduling: Extending the scheduler's capabilities to manage carbon emissions across multiple Kubernetes clusters and thereby space.

These developments will further enhance the ability of the Compute Gardener to reduce carbon footprints and optimize costs in Kubernetes environments.

Conclusion

Building a carbon and price-aware Kubernetes scheduler involves considered integration with Kubernetes core components, external APIs, and hardware-specific optimizations. By creating as a secondary scheduler with an opt-in model, we've made it easy to deploy without disrupting critical workloads while still providing substantial, and immediate, environmental and cost benefits.

Try it out today with your next training or build job!

Compute Gardener Scheduler is open source and available on GitHub.