Kubernetes and cloud infrastructure have revolutionized how we deploy and manage applications and services, but with traditional data center and cloud computing infrastructure now likely responsible for about 4% of global electricity consumption and set to triple to potentially 12% of U.S. electricity in the next three years[1], now is the time to build energy and emissions awareness into our orchestration systems. In this post, I'll walk through the technical implementation of the Compute Gardener Scheduler, a carbon and price-aware Kubernetes scheduler plugin we've developed; building upon recent advancements in energy-aware computing.[2]
The Challenge
Standard Kubernetes scheduling doesn't consider energy implications - pods are placed to optimize for resource utilization, spread, affinity, and other traditional metrics. But what if we want to:
- Delay non-urgent workloads until carbon intensity of the grid is lower[3]
- Run batch jobs during off-peak electricity pricing hours
- Track energy consumption for workloads and enforce budgets
- Optimize for energy-efficient hardware when available[4]
Overview
We built our solution using Kubernetes' scheduler framework, which allows custom plugins to influence scheduling decisions.[7] The core components include:
# Main components of the scheduler
api/ # Various data providers' API clients
carbon/ # Carbon-aware scheduling implementations (currently Electricity Map API)
config/ # Configuration types and validation
metrics/ # Prometheus metrics and hardware profiling
price/ # Price-aware scheduling implementations (currently time-of-use)
Our scheduler implements two key extension points in the framework:
- PreFilter: Initial validation of scheduling constraints and annotations related to carbon intensity and price
- Filter: Main decision logic that evaluates efficiency and suitability of potential nodes to bind
Decision Flow
PreFilter Stage
The PreFilter
stage/method handles initial validation of scheduling constraints and annotations related to carbon intensity and pricing.
// PreFilter implements the PreFilter interface (simplified)
func (cs *ComputeGardenerScheduler) PreFilter(ctx context.Context, state *framework.CycleState,
pod *v1.Pod) *framework.Status {
// Check pricing constraints, if enabled
if cs.config.Pricing.Enabled && cs.pricingImpl != nil {
if status := cs.pricingImpl.CheckPriceConstraints(pod, cs.clock.Now()); !status.IsSuccess() {
return status
}
}
// Check carbon intensity constraints, if enabled
if cs.config.Carbon.Enabled && cs.carbonImpl != nil {
// Get threshold from configured threshold
threshold := cs.config.Carbon.IntensityThreshold
// Override with annotated value, if present
...
// Get current carbon intensity
intensity, err := cs.carbonImpl.GetCurrentIntensity(ctx)
if err != nil {
// If we can't determine carbon intensity, allow pod thru
return framework.NewStatus(framework.Success, "")
}
// Make scheduling decision based on threshold
if intensity > threshold {
msg := fmt.Sprintf("Current carbon intensity (%.2f) exceeds threshold (%.2f)",
intensity, threshold)
return framework.NewStatus(framework.Unschedulable, msg)
}
}
// Default to letting pod thru
return framework.NewStatus(framework.Success, "")
}
Filter Stage
PreFilter
occurs without any node awareness. Then, in the Filter
stage, we gain node information and can consider our knowledge of node energy use and efficiency in scheduling decisions.
// Filter implements the Filter interface (simplified)
func (cs *ComputeGardenerScheduler) Filter(ctx context.Context, state *framework.CycleState,
pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
// Logic for hardware profiling and node selection
if cs.config.HardwareProfiling.Enabled && cs.hardwareProfileImpl != nil {
profile := cs.hardwareProfileImpl.GetNodeProfile(nodeInfo.Node().Name)
if profile == nil {
return framework.NewStatus(framework.Unschedulable, "No hardware profile available")
}
// Apply node selection based on hardware capabilities
if !cs.nodeSelectionImpl.IsNodeSuitable(profile, pod) {
return framework.NewStatus(framework.Unschedulable, "Node not suitable for workload")
}
}
// Default to letting pod thru
return framework.NewStatus(framework.Success, "")
}
This implementation follows the Kubernetes scheduler pattern of returning Unschedulable
status when conditions aren't met, which causes the pod to wait in the scheduling queue until conditions change.
Real-Time Carbon Intensity Data
The scheduler consumes carbon intensity data from external APIs and makes scheduling decisions based on configurable thresholds. Our solution:
- Supports pod-level configuration through annotations, allowing different thresholds per workload
- Uses caching and failure backoff to tolerate API failures and maintain scheduler reliability
- Makes it easy for operators to dynamically override scheduling decisions as preferred
Hardware Profiling and Estimating Node Energy Use
Accurately modeling power consumption, estimating energy usage, and selecting suitable nodes based on hardware capabilities enables efficient scheduling. It seems like it SHOULD be easy, but most computing hardware wasn't designed with real-time power monitoring as a primary feature.
Fortunately, for Graphical Processing Units (GPUs), we can leverage direct power metrics through NVIDIA's DCGM, which provides reasonably accurate real-time power consumption data to which we can apply a PUE (Power Usage Effectiveness) factor. However, for CPUs and memory, we must rely on a combination of heuristics, frequency scaling information, and utilization patterns to estimate power consumption.
Our approach follows and extends the Green Software Foundation's Software Carbon Intensity (SCI) methodology[8], adding enhancements for CPU frequency scaling and dynamic power profiling. We believe our estimates are within 5% of actual consumption for calibrated nodes - likely sufficient accuracy for carbon accounting and potential carbon credit validation.
// calculatePodPower estimates power consumption (simplified)
func calculatePodPower(nodeName string, cpu float64, memory float64, gpuPower float64) float64 {
// Get hardware profile for the node
profile := getNodeHardwareProfile(nodeName)
// Apply frequency scaling adjustment
cpuFreq := getCurrentCPUFrequency(nodeName)
if cpuFreq > 0 && profile.BaseFrequency > 0 {
freqRatio := cpuFreq / profile.BaseFrequency
// Power typically scales with square of frequency
powerRatio := freqRatio * freqRatio
profile.MaxPower *= powerRatio
}
// Power-law model for CPU power considering utilization
cpuPower := profile.IdlePower +
(profile.MaxPower - profile.IdlePower) *
math.Pow(cpu, 1.4) // Common exponent used in datacenter modeling
// Add GPU power with datacenter overhead (PUE)
totalPower := cpuPower
if gpuPower > 0 {
totalPower += gpuPower * profile.GPUPUE
}
return totalPower
}
// Calculate energy using numerical integration
func CalculateTotalEnergy(records []PodMetricsRecord) float64 {
totalEnergy := 0.0
for i := 1; i < len(records); i++ {
// Integrate power over time, summing totalEnergy
...
}
return totalEnergy
}
This allows us to model power consumption accurately with dynamic frequencies, and then calculate energy usage over time using numerical integration.
Performance Considerations
A key concern with any scheduler is adding latency to scheduling decisions. We implemented several optimizations:
- Caching: Both carbon API responses and hardware profiles are cached, ensuring API hiccups never derail a scheduling decision
- Parallel Processing: API requests and metrics collection happen asynchronously wrt core scheduler flow
- Bounded Memory: If capturing metrics, downsampling is used for long-running pods to limit memory usage
- Prometheus Integration: Leverage existing monitoring infrastructure; don't reinvent or duplicate
Most importantly, all of this runs as a secondary scheduler, which means:
- System-critical components continue using the default scheduler; your 24/7 HA services will never even know there's another scheduler in the cluster
- Only workloads that opt-in are subject to carbon/price decisions; can opt-in at pod level or namespace level using webhook admission controller
- Scheduling delays (capped at a configurable max) cause temporal workload shifting strategies that have shown promising results in reducing carbon emissions[5]
Pod Configuration
Workload creators can opt into carbon/price-aware scheduling by requesting schedulerName: compute-gardener-scheduler
in pod spec and then can apply individual job/pod overrides with annotations:
metadata:
annotations:
# Set custom carbon intensity threshold
compute-gardener-scheduler.kubernetes.io/carbon-intensity-threshold: "200.0"
# Enable price awareness using configMap'd schedules
compute-gardener-scheduler.kubernetes.io/price-enabled: "true"
# Set energy budget
compute-gardener-scheduler.kubernetes.io/energy-budget-kwh: "5.0"
# Limit any potential carbon/price delays to a maximum of 12 hrs.
compute-gardener-scheduler.kubernetes.io/max-scheduling-delay: "12h"
spec:
schedulerName: compute-gardener-scheduler
Projected Results and Validation Needs
Based on our simulations and initial measurements, the scheduler shows promising potential benefits:
- Carbon reduction: Testing suggests potential for 50%+ reduction in carbon emissions for time deferrable workloads (mitigations measured 03-17Mar25 with intensity data in CalISO region cycling between common daily lows of 50 and highs of 200+ gCO2eq/kWh)
- Cost savings: Testing indicates 20%+ reduction in electricity costs when leveraging time-of-use (TOU) rates (mitigations measured 03-17Mar25 and considering PGE's E-TOU-B pricing schedule)
- Energy tracking: Previously invisible energy consumption becomes visible and actionable (priceless, right? ;D)
- Performance impact: Testing shows minimal increase in scheduling latency (< 100ms per decision; shouldn't be significant considering k8s orchestration timescales)
For a medium-large cloud data center with ML training workloads, our simulations project approximately 18 metric tons of CO2 equivalent emissions that could be avoided per month. These estimates align with recent research showing substantial potential for carbon reduction in ML workloads through intelligent scheduling.[6]
We're actively seeking partners to help validate these projections in production environments. If you're interested in deploying carbon and/or price-aware scheduling in your Kubernetes clusters, we'd love to collaborate to measure and refine real-world impact. Please reach out via our website or GitHub repository.
Future Developments
As the field of carbon-aware computing continues to evolve, several exciting developments are on the horizon:
- Integration with Advanced Forecasting: Leveraging ML, along with weather and various other signals, to predict future carbon intensity and optimize scheduling decisions in real-time.
- Enhanced Hardware Profiling: Developing more accurate models for power consumption across different common cloud hardware configurations, including CPUs, GPUs, and memory.
- Global Carbon Markets: Integrating the scheduler with global carbon markets to enable workloads mitigating carbon emissions, in a calibrated, validated way, to recieve carbon credits.
- Cross-Cluster Scheduling: Extending the scheduler's capabilities to manage carbon emissions across multiple Kubernetes clusters and thereby space.
These developments will further enhance the ability of the Compute Gardener to reduce carbon footprints and optimize costs in Kubernetes environments.
Conclusion
Building a carbon and price-aware Kubernetes scheduler involves considered integration with Kubernetes core components, external APIs, and hardware-specific optimizations. By creating as a secondary scheduler with an opt-in model, we've made it easy to deploy without disrupting critical workloads while still providing substantial, and immediate, environmental and cost benefits.
Try it out today with your next training or build job!
Compute Gardener Scheduler is open source and available on GitHub.
Related Reading
Our work builds on the thoughts and patterns from those in the community who have pioneered research in carbon-aware computing, in particular with a k8s perspective. We're grateful for the foundation they've established, which has made our approach practical and possible. Beyond the sources cited above, [9], [10], and [11] have each influenced our designs, as well.
[1] Reuters. Data center build-out stokes fears of overburdening biggest US grid. March 13, 2025.
[2] James, A., & Schien, D. (2019). A Low Carbon Kubernetes Scheduler. In Proceedings of the 6th International Conference on ICT for Sustainability (ICT4S 2019), Vol. 2382. CEUR Workshop Proceedings.
[3] Philipp Wiesner, Ilja Behnke, Dominik Scheinert, Kordian Gontarska, Lauritz Thamsen. Let's Wait Awhile: How Temporal Workload Shifting Can Reduce Carbon Emissions in the Cloud
[4] Zhenyu Wen, Renyu Yang, Albert Zomaya. Energy-Aware Kubernetes Scheduler: Opportunities and Challenges
[5] Randall Ross, Matt Rutherford, David Grunwald. Carbon-Aware Computing And Scheduling for CPU-Intensive Workloads in Heterogeneous Clusters
[6] James Downton, Antonios Katsarakis, Juliana Franco, et al. Scheduling machine learning training jobs on heterogeneous compute with Carbon signal
[7] Piontek, T., Haghshenas, K. & Aiello, M. Carbon emission-aware job scheduling for Kubernetes deployments. J Supercomputer 80, 549–569 (2024).
[8] Green Software Foundation. Software Carbon Intensity (SCI) Specification
[9] UMass Solar. Carbon-aware-dag. GitHub repository.
[10] C3Lab. K8s-carbon-aware-scheduler. GitHub repository.
[11] Zemdom. Carbon-aware-scheduling. GitHub repository.