Documentation
The Compute Gardener Scheduler is a Kubernetes scheduler plugin that enables carbon-aware scheduling based on real-time carbon intensity data. It provides robust pod-level energy tracking with comprehensive metrics, helping you accurately measure your compute's carbon footprint. Additionally, the scheduler offers configurable capabilities like price-aware scheduling and hardware power profiling to support your sustainability journey.
Key Features
- Carbon-Aware Scheduling (Optional)
Schedule pods based on real-time carbon intensity data from Electricity Map API or implement your own intensity source
 - Price-Aware Scheduling (Optional)
Schedule pods based on time-of-use electricity pricing schedules or implement your own pricing source
 - Hardware Power Profiling
Accurate power modeling with datacenter PUE consideration and hardware-specific profiles
 - GPU Workload Classification
Optimize power modeling based on workload type - inference, training, and rendering
 
- Flexible Configuration
Extensive configuration options for fine-tuning scheduler behavior
 - Energy Budget Tracking
Define and monitor energy usage limits for workloads with configurable actions when exceeded
 - Namespace-Level Policies
Define energy policies at namespace level with workload-specific overrides for batch/service types
 - Comprehensive Metrics
Detailed metrics for monitoring energy usage, carbon intensity, and cost savings
 
Installation
Using Helm
The recommended way to deploy the Compute Gardener Scheduler is using Helm:
# Add the Helm repository helm repo add compute-gardener https://elevated-systems.github.io/compute-gardener-scheduler helm repo update # Install the chart helm install compute-gardener-scheduler compute-gardener/compute-gardener-scheduler \ --namespace kube-system \ --set carbonAware.electricityMap.apiKey=YOUR_API_KEY
Using YAML Manifests
Alternatively, you can deploy using the provided YAML manifests:
# First, update the API key in the manifest # Then apply the manifest kubectl apply -f manifests/compute-gardener-scheduler/compute-gardener-scheduler.yaml
Recommended Components
Metrics Server: Highly recommended but not strictly required. Without Metrics Server, the scheduler won't be able to collect real-time node utilization data, resulting in less accurate energy usage estimates. Core carbon-aware and price-aware scheduling will still function using requested resources rather than actual usage.
Prometheus: Highly recommended but not strictly required. Without Prometheus, you won't be able to visualize scheduler performance metrics or validate carbon/cost savings. The scheduler will continue to function, but you'll miss valuable insights into its operation and won't have visibility into the actual emissions and cost reductions achieved.
Configuration
Environment Variables
API Configuration
# Required: Your API key for Electricity Map API ELECTRICITY_MAP_API_KEY=<your-api-key> # Optional: Default is https://api.electricitymap.org/v3/carbon-intensity/latest?zone= ELECTRICITY_MAP_API_URL=<api-url> # Optional: Default is US-CAL-CISO ELECTRICITY_MAP_API_REGION=<region> # Optional: API request timeout API_TIMEOUT=10s # Optional: Maximum API retry attempts API_MAX_RETRIES=3 # Optional: Delay between retries API_RETRY_DELAY=1s # Optional: API rate limit per minute API_RATE_LIMIT=10 # Optional: Cache TTL for API responses CACHE_TTL=5m # Optional: Maximum age of cached data MAX_CACHE_AGE=1h # Optional: Enable pod priority-based scheduling ENABLE_POD_PRIORITIES=false
Carbon Configuration
# Optional: Enable carbon-aware scheduling (default: true) CARBON_ENABLED=true # Optional: Base carbon intensity threshold (gCO2/kWh) CARBON_INTENSITY_THRESHOLD=200.0
Time-of-Use Pricing Configuration
# Optional: Enable TOU pricing PRICING_ENABLED=false # Optional: Default is 'tou' PRICING_PROVIDER=tou # Optional: Path to pricing schedules PRICING_SCHEDULES_PATH=/path/to/schedules.yaml
Node Power Configuration
# Default idle power consumption in watts NODE_DEFAULT_IDLE_POWER=100.0 # Default maximum power consumption in watts NODE_DEFAULT_MAX_POWER=400.0 # Node-specific power settings NODE_POWER_CONFIG_worker1=idle:50,max:300 # Path to hardware profiles ConfigMap HARDWARE_PROFILES_PATH=/path/to/hardware-profiles.yaml
Metrics Collection Configuration
# Interval for collecting pod metrics (e.g. "30s", "1m") METRICS_SAMPLING_INTERVAL=30s # Maximum number of metrics samples to store per pod MAX_SAMPLES_PER_POD=500 # How long to keep metrics for completed pods COMPLETED_POD_RETENTION=1h # Strategy for downsampling metrics (lttb, timeBased, minMax) DOWNSAMPLING_STRATEGY=timeBased
Observability Configuration
# Optional: Logging level LOG_LEVEL=info # Optional: Enable tracing ENABLE_TRACING=false
Pod Annotations
# Basic scheduling controls # Opt out of compute-gardener scheduling compute-gardener-scheduler.kubernetes.io/skip: "true" # Disable carbon-aware scheduling for this pod only compute-gardener-scheduler.kubernetes.io/carbon-enabled: "false" # Set custom carbon intensity threshold compute-gardener-scheduler.kubernetes.io/carbon-intensity-threshold: "250.0" # Set custom price threshold compute-gardener-scheduler.kubernetes.io/price-threshold: "0.12" # Set custom maximum scheduling delay (e.g. "12h", "30m", "1h30m") compute-gardener-scheduler.kubernetes.io/max-scheduling-delay: "12h" # Energy budget controls # Set energy budget in kilowatt-hours compute-gardener-scheduler.kubernetes.io/energy-budget-kwh: "5.0" # Action when budget exceeded: log, notify, annotate, label compute-gardener-scheduler.kubernetes.io/energy-budget-action: "notify" # Hardware efficiency controls # GPU workload type (inference, training, rendering) compute-gardener-scheduler.kubernetes.io/gpu-workload-type: "inference" # PUE configuration # Power Usage Effectiveness for datacenter compute-gardener-scheduler.kubernetes.io/pue: "1.2" # Node hardware labels (for improved energy profiles) node.kubernetes.io/cpu-model: "Intel(R) Xeon(R) Platinum 8275CL" node.kubernetes.io/gpu-model: "NVIDIA A100"
Hardware Power Profiles
The scheduler uses hardware-specific power profiles to accurately estimate and optimize energy consumption:
# Global PUE defaults
# Default datacenter PUE (typical range: 1.1-1.6)
defaultPUE: 1.1
# Default GPU-specific PUE for power conversion losses
defaultGPUPUE: 1.15
# CPU power profiles
cpuProfiles:
  "Intel(R) Xeon(R) Platinum 8275CL":
    # Idle power in watts
    idlePower: 10.5
    # Max power in watts
    maxPower: 120.0
  "Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz":
    idlePower: 5.0
    maxPower: 65.0
# GPU power profiles with workload type coefficients
gpuProfiles:
  "NVIDIA A100":
    # Idle power in watts
    idlePower: 25.0
    # Max power in watts at 100% utilization
    maxPower: 400.0
    # Power coefficients for different workload types
    workloadTypes:
      # Inference typically uses ~60% of max power at 100% utilization
      inference: 0.6
      # Training uses full power
      training: 1.0
      # Rendering uses ~90% of max power at 100% utilization
      rendering: 0.9
  "NVIDIA GeForce GTX 1660":
    idlePower: 7.0
    maxPower: 125.0
    workloadTypes:
      inference: 0.5
      training: 0.9
      rendering: 0.8
# Memory power profiles
memProfiles:
  "DDR4-2666 ECC":
    # Idle power per GB in watts
    idlePowerPerGB: 0.125
    # Max power per GB in watts at full utilization
    maxPowerPerGB: 0.375
    # Base power overhead in watts
    baseIdlePower: 1.0
# Cloud instance mappings to hardware components
cloudInstanceMapping:
  aws:
    "m5.large":
      cpuModel: "Intel(R) Xeon(R) Platinum 8175M"
      memoryType: "DDR4-2666 ECC"
      numCPUs: 2
      # in MB
      totalMemory: 8192The hardware detection system works through multiple layers:
- Hardware Profile Database: A ConfigMap containing power profiles for CPUs, GPUs, and memory types
 - Cloud Instance Detection: Automatically maps cloud instance types to hardware components
 - Hybrid Cloud Hardware Detection: Uses node labels or runtime detection to identify hardware
 - Auto Labeling: Scripts provided to automatically label nodes with hardware information
 
Metrics
The scheduler exports the following Prometheus metrics:
scheduler_compute_gardener_carbon_intensity: Current carbon intensity (gCO2eq/kWh) for a given regionscheduler_compute_gardener_electricity_rate: Current electricity rate ($/kWh) for a given locationscheduler_compute_gardener_scheduling_attempt_total: Number of attempts to schedule pods by resultscheduler_compute_gardener_pod_scheduling_duration_seconds: Latency for scheduling attemptsscheduler_compute_gardener_estimated_savings: Estimated savings from scheduling (carbon, cost)scheduler_compute_gardener_price_delay_total: Number of scheduling delays due to price thresholdsscheduler_compute_gardener_carbon_delay_total: Number of scheduling delays due to carbon intensity thresholdsscheduler_compute_gardener_node_cpu_usage_cores: CPU usage on nodes (baseline, current, final)scheduler_compute_gardener_node_memory_usage_bytes: Memory usage on nodes (baseline, current, final)scheduler_compute_gardener_node_gpu_usage: GPU utilization on nodes (baseline, current, final)scheduler_compute_gardener_node_power_estimate_watts: Estimated node power consumption based on detected hardware profiles (baseline, current, final)scheduler_compute_gardener_metrics_samples_stored: Number of pod metrics samples currently stored in cachescheduler_compute_gardener_metrics_cache_size: Number of pods being tracked in metrics cachescheduler_compute_gardener_job_energy_usage_kwh: Estimated energy usage for completed jobsscheduler_compute_gardener_job_carbon_emissions_grams: Estimated carbon emissions for completed jobsscheduler_compute_gardener_scheduling_efficiency: Scheduling efficiency metrics (carbon/cost improvements)
Metrics Collection
The deployment includes both a Service and ServiceMonitor for Prometheus integration:
# Service exposes the metrics endpoint
apiVersion: v1
kind: Service
metadata:
  name: compute-gardener-scheduler-metrics
  namespace: kube-system
  labels:
    component: scheduler
    tier: control-plane
spec:
  ports:
  - name: https
    port: 10259
    targetPort: 10259
    protocol: TCP
  selector:
    component: scheduler
    tier: control-plane
# ServiceMonitor configures Prometheus to scrape metrics
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: compute-gardener-scheduler-monitor
  # Adjust to your Prometheus namespace
  namespace: cattle-monitoring-system
spec:
  selector:
    matchLabels:
      component: scheduler
      tier: control-plane
  endpoints:
  - port: https 
    scheme: https
    path: /metrics
    interval: 30sAdditionally, the Pod template includes Prometheus annotations for environments that use annotation-based discovery:
annotations: prometheus.io/scrape: 'true' prometheus.io/port: '10259' prometheus.io/scheme: 'https' prometheus.io/path: '/metrics'
Need Help?
Can't find what you're looking for? Check out our GitHub repository or reach out to our team.