Skip to content

enable cache for energy-balance statistics #52

@maxnutz

Description

@maxnutz

Goal

Enable a caching mode for n.statistics.energy_balance that allows easy usage in statistics-functions

Architecture Overview

  • Decorator-based approach (@cache_energy_balance) that wraps function calls
  • In-memory cache stored globally (not on network object, as per your point 5)
  • Cache key = sorted tuple of all energy_balance call parameters (kwargs + positional args)
  • Auto-enabled by default, with a disable flag for testing/debugging
  • Minimal changes to existing statistics_functions.py functions

How It Works

# In a new module: pypsa_validation_processing/caching.py

# Global cache state
_energy_balance_cache = {}
_caching_enabled = True

def disable_caching():
    """Disable caching temporarily (e.g., for testing)"""
    global _caching_enabled
    _caching_enabled = False

def enable_caching():
    """Re-enable caching"""
    global _caching_enabled
    _caching_enabled = True

def clear_cache():
    """Clear the cache (e.g., between network changes)"""
    global _energy_balance_cache
    _energy_balance_cache.clear()

def cache_energy_balance(func):
    """Decorator: caches n.statistics.energy_balance calls within a function"""
    @functools.wraps(func)
    def wrapper(n: pypsa.Network, *args, **kwargs):
        # Inject cached_energy_balance into the network temporarily
        original_energy_balance = n.statistics.energy_balance
        
        def cached_energy_balance(*eb_args, **eb_kwargs):
            cache_key = _make_cache_key(*eb_args, **eb_kwargs)
            
            if _caching_enabled and cache_key in _energy_balance_cache:
                return _energy_balance_cache[cache_key]
            
            result = original_energy_balance(*eb_args, **eb_kwargs)
            
            if _caching_enabled:
                _energy_balance_cache[cache_key] = result
            
            return result
        
        # Temporarily replace
        n.statistics.energy_balance = cached_energy_balance
        try:
            return func(n, *args, **kwargs)
        finally:
            # Restore original
            n.statistics.energy_balance = original_energy_balance
    
    return wrapper

Usage (minimal code change)

# In statistics_functions.py - just add one decorator

@cache_energy_balance
def Final_Energy_by_Carrier__Electricity(
    n: pypsa.Network,
    aggregate_per_year: bool = True,
) -> pd.Series | pd.DataFrame:
    """Extract electricity final energy from a PyPSA Network."""
    res = abs(
        n.statistics.energy_balance(  # This call is now cached!
            bus_carrier="AC",
            groupby=["carrier", "location", "unit"],
            groupby_time=aggregate_per_year,
        )
    )
    # ... rest of function unchanged
Full idea-finding process with copilot

# Copilot Chat Conversation Export: Caching strategy for energy balance calculations

User: @maxnutz
Thread URL: https://github.com/copilot/c/a14ba2a6-954f-4039-8de5-a1abc0c863f3

@maxnutz

Setting: Take a look at the code in statistics_functions.py. All functions in this script are executed within one workflow. It is important, that the structure one variable=one function remains the same! All of these functions use the pypsa-statistics, most of them use n.statistics.energy_balance
Problem description: The statistics execution of n.statistics.energy_balance is quite expensive. What possibilities do you see to cache the n.statistics.energy_balance in general without having to change a lot in the structure of the functions in statistics_functions.py?
Goal: create ideas for creating a framework for caching the n.statics.energy_balance, so that it can be used easily in the functions. Get information and discuss the ideas afterwards with me!Ask me questions to get a better idea of what i need in the code.

Copilot

I'll analyze the current caching situation and explore possibilities for optimizing n.statistics.energy_balance calls. Let me first examine the code structure.Perfect! I've reviewed the codebase. Now I have a clear understanding of the structure and the problem. Let me ask you some clarifying questions to design the best caching framework for n.statistics.energy_balance:

Key Questions to Define the Caching Strategy

1. Caching Scope & Invalidation

  • Should the cache be valid for the entire workflow execution (one PyPSA network collection), or do you expect the network to change between function calls?
  • Do you ever modify the network between statistics.energy_balance calls in the same workflow run?

2. What Gets Cached?

  • Should we cache all calls to n.statistics.energy_balance, or only specific parameter combinations?
  • Looking at statistics_functions.py, I see multiple calls with different parameters (e.g., bus_carrier, carrier, components, at_port). Should each unique parameter set get its own cache entry?
  • Should we cache intermediate results before grouping/filtering, or after?

3. Cache Storage & Performance

  • In-memory cache (stored on the Network object or a global cache dict) or persistent cache (pickle files)?
  • How large do you expect the cached DataFrames to be? (affects memory vs. disk trade-off)
  • Are you running this with one network at a time or a NetworkCollection with multiple networks? (affects multi-network caching strategy)

4. API Design - How Should Functions Access the Cache?

Looking at your constraint "structure one variable=one function remains the same", which approach fits best?

Option A: Decorator approach (minimal code change)

@cache_energy_balance
def Final_Energy_by_Carrier__Electricity(n: pypsa.Network, ...) -> pd.Series:
    # Call n.statistics.energy_balance(...) inside

Option B: Context manager (slightly more explicit)

def Final_Energy_by_Carrier__Electricity(n: pypsa.Network, ...) -> pd.Series:
    with energy_balance_cache(n) as cached_stat:
        res = cached_stat(bus_carrier="AC", ...)

Option C: Cached accessor wrapper (minimal changes to existing code)

# In workflow/class_definitions.py, wrap the network:
n = CachedNetworkWrapper(n)
# Then call functions normally; cache is transparent

Option D: Manual cache context in workflow (centralized, explicit)

# In Network_Processor.calculate_variables_values():
for n in network_collection:
    with enable_energy_balance_cache(n):
        for variable in variables:
            result = execute_function(variable, n)

5. Cache Key Strategy

  • Should the cache key include the network object itself (by id), or assume there's only one active network at a time?
  • Should we hash the parameters to energy_balance, or create a unique signature (e.g., tuple of all kwargs)?

6. Monitoring & Debugging

  • Do you want cache hit/miss statistics logged during execution?
  • Should there be a way to disable caching temporarily for testing or debugging?

My initial recommendation (pending your answers):

  • Option D with a context manager + in-memory cache attached to the Network object
  • Cache keyed by a sorted tuple of all kwargs passed to n.statistics.energy_balance
  • Auto-invalidates when exiting the context or when moving to a new network
  • Minimal changes to statistics_functions.py

Please share your thoughts on these questions so I can propose a concrete framework design!

@maxnutz

(1) for one network is good! There is no change of the network itself during the entire workflow!
(2) good question! I am not yet completely shure about that! I think an approach would be to include all index levels, so that one can include the index levels needed. It is important, that there are not too many changes needed in the now implemented function calls.
(3) I think in-memory cache is enough, but no strong opinion on that!
(4) Option A would be best, so that everyone can use this without having to think about it explicitely
(5) no, chache just includes the energy-balance, nothing more, not even a network!
(5) option for disabeling caching is a good idea!

Copilot

Perfect! Your answers give me a clear direction. Let me summarize the framework design and ask a few final clarifications:

Proposed Caching Framework Design

Based on your inputs, here's what I'm proposing:

Architecture Overview

  • Decorator-based approach (@cache_energy_balance) that wraps function calls
  • In-memory cache stored globally (not on network object, as per your point 5)
  • Cache key = sorted tuple of all energy_balance call parameters (kwargs + positional args)
  • Auto-enabled by default, with a disable flag for testing/debugging
  • Minimal changes to existing statistics_functions.py functions

How It Works

# In a new module: pypsa_validation_processing/caching.py

# Global cache state
_energy_balance_cache = {}
_caching_enabled = True

def disable_caching():
    """Disable caching temporarily (e.g., for testing)"""
    global _caching_enabled
    _caching_enabled = False

def enable_caching():
    """Re-enable caching"""
    global _caching_enabled
    _caching_enabled = True

def clear_cache():
    """Clear the cache (e.g., between network changes)"""
    global _energy_balance_cache
    _energy_balance_cache.clear()

def cache_energy_balance(func):
    """Decorator: caches n.statistics.energy_balance calls within a function"""
    @functools.wraps(func)
    def wrapper(n: pypsa.Network, *args, **kwargs):
        # Inject cached_energy_balance into the network temporarily
        original_energy_balance = n.statistics.energy_balance
        
        def cached_energy_balance(*eb_args, **eb_kwargs):
            cache_key = _make_cache_key(*eb_args, **eb_kwargs)
            
            if _caching_enabled and cache_key in _energy_balance_cache:
                return _energy_balance_cache[cache_key]
            
            result = original_energy_balance(*eb_args, **eb_kwargs)
            
            if _caching_enabled:
                _energy_balance_cache[cache_key] = result
            
            return result
        
        # Temporarily replace
        n.statistics.energy_balance = cached_energy_balance
        try:
            return func(n, *args, **kwargs)
        finally:
            # Restore original
            n.statistics.energy_balance = original_energy_balance
    
    return wrapper

Usage (minimal code change)

# In statistics_functions.py - just add one decorator

@cache_energy_balance
def Final_Energy_by_Carrier__Electricity(
    n: pypsa.Network,
    aggregate_per_year: bool = True,
) -> pd.Series | pd.DataFrame:
    """Extract electricity final energy from a PyPSA Network."""
    res = abs(
        n.statistics.energy_balance(  # This call is now cached!
            bus_carrier="AC",
            groupby=["carrier", "location", "unit"],
            groupby_time=aggregate_per_year,
        )
    )
    # ... rest of function unchanged

Final Clarifying Questions

Q1: Cache Key Generation

For _make_cache_key(), should we:

  • Include all parameters (even groupby_time=False)?
  • Exclude certain params that don't affect the result (if any)?
  • Handle mutable parameters like lists specially (e.g., groupby=["carrier", "location"])?

My recommendation: Include all parameters, convert lists to sorted tuples for consistency.

Q2: Cache Lifecycle per Network

Should the cache:

  • Persist across the entire workflow (i.e., multiple networks in the collection)?
  • Clear automatically when switching to a new network?
  • Be manually cleared via clear_cache() call in Network_Processor.calculate_variables_values()?

My recommendation: Clear at the start of each network processing (in the loop in Network_Processor.calculate_variables_values() line 644-672).

Q3: Caching at Which Level?

Should we cache:

  • Every call to energy_balance (including intermediate calculations)?
  • Only the first/main call per function (to avoid over-caching)?

My recommendation: Cache every call—if functions call energy_balance multiple times with the same params, it should be cached.

Q4: Return Value Handling

Since energy_balance returns pandas objects (Series/DataFrame), should we:

  • Cache directly (reference to same object, fast but risky if modified)?
  • Cache copies (.copy(), safer but slightly slower)?

My recommendation: Cache with .copy() to prevent external modifications affecting cached results.

Q5: Stats/Logging

Should we provide:

  • Silent caching (no output)?
  • Optional debug logging (cache hits/misses per function call)?
  • Summary stats printed at end of workflow?

My recommendation: Optional logging via a parameter or environment variable (e.g., PYPSA_CACHE_DEBUG=1).


Once you confirm these, I'll create:

  1. ✅ New caching.py module with the complete framework
  2. ✅ Updated statistics_functions.py with decorators
  3. ✅ Integration point in Network_Processor to clear cache per network
  4. ✅ Optional: helper utilities for cache inspection

What are your preferences on Q1-Q5?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions