enable cache for energy-balance statistics

### Goal 
Enable a caching mode for `n.statistics.energy_balance` that allows easy usage in statistics-functions 

### **Architecture Overview**
- **Decorator-based approach** (`@cache_energy_balance`) that wraps function calls
- **In-memory cache** stored globally (not on network object, as per your point 5)
- **Cache key** = sorted tuple of all `energy_balance` call parameters (kwargs + positional args)
- **Auto-enabled** by default, with a **disable flag** for testing/debugging
- **Minimal changes** to existing `statistics_functions.py` functions

### **How It Works**
```python
# In a new module: pypsa_validation_processing/caching.py

# Global cache state
_energy_balance_cache = {}
_caching_enabled = True

def disable_caching():
    """Disable caching temporarily (e.g., for testing)"""
    global _caching_enabled
    _caching_enabled = False

def enable_caching():
    """Re-enable caching"""
    global _caching_enabled
    _caching_enabled = True

def clear_cache():
    """Clear the cache (e.g., between network changes)"""
    global _energy_balance_cache
    _energy_balance_cache.clear()

def cache_energy_balance(func):
    """Decorator: caches n.statistics.energy_balance calls within a function"""
    @functools.wraps(func)
    def wrapper(n: pypsa.Network, *args, **kwargs):
        # Inject cached_energy_balance into the network temporarily
        original_energy_balance = n.statistics.energy_balance
        
        def cached_energy_balance(*eb_args, **eb_kwargs):
            cache_key = _make_cache_key(*eb_args, **eb_kwargs)
            
            if _caching_enabled and cache_key in _energy_balance_cache:
                return _energy_balance_cache[cache_key]
            
            result = original_energy_balance(*eb_args, **eb_kwargs)
            
            if _caching_enabled:
                _energy_balance_cache[cache_key] = result
            
            return result
        
        # Temporarily replace
        n.statistics.energy_balance = cached_energy_balance
        try:
            return func(n, *args, **kwargs)
        finally:
            # Restore original
            n.statistics.energy_balance = original_energy_balance
    
    return wrapper
```

### **Usage (minimal code change)**
```python
# In statistics_functions.py - just add one decorator

@cache_energy_balance
def Final_Energy_by_Carrier__Electricity(
    n: pypsa.Network,
    aggregate_per_year: bool = True,
) -> pd.Series | pd.DataFrame:
    """Extract electricity final energy from a PyPSA Network."""
    res = abs(
        n.statistics.energy_balance(  # This call is now cached!
            bus_carrier="AC",
            groupby=["carrier", "location", "unit"],
            groupby_time=aggregate_per_year,
        )
    )
    # ... rest of function unchanged
```

<details><summary>Full idea-finding process with copilot</summary>
<p>
# Copilot Chat Conversation Export: Caching strategy for energy balance calculations

**User:** @maxnutz
**Thread URL:** https://github.com/copilot/c/a14ba2a6-954f-4039-8de5-a1abc0c863f3

## @maxnutz

Setting: Take a look at the code in `statistics_functions.py`. All functions in this script are executed within one workflow. It is important, that the structure one variable=one function remains the same! All of these functions use the pypsa-statistics, most of them use `n.statistics.energy_balance`
Problem description: The statistics execution of n.statistics.energy_balance is quite expensive. What possibilities do you see to cache the n.statistics.energy_balance in general without having to change a lot in the structure of the functions in `statistics_functions.py`?
Goal: create ideas for creating a framework for caching the `n.statics.energy_balance`, so that it can be used easily in the functions. Get information and discuss the ideas afterwards with me!Ask me questions to get a better idea of what i need in the code.

## Copilot

I'll analyze the current caching situation and explore possibilities for optimizing `n.statistics.energy_balance` calls. Let me first examine the code structure.Perfect! I've reviewed the codebase. Now I have a clear understanding of the structure and the problem. Let me ask you some clarifying questions to design the best caching framework for `n.statistics.energy_balance`:

## Key Questions to Define the Caching Strategy

### 1. **Caching Scope & Invalidation**
   - Should the cache be valid for the **entire workflow execution** (one PyPSA network collection), or do you expect the network to change between function calls?
   - Do you ever modify the network between `statistics.energy_balance` calls in the same workflow run?

### 2. **What Gets Cached?**
   - Should we cache **all calls** to `n.statistics.energy_balance`, or only **specific parameter combinations**?
   - Looking at `statistics_functions.py`, I see multiple calls with different parameters (e.g., `bus_carrier`, `carrier`, `components`, `at_port`). Should each unique parameter set get its own cache entry?
   - Should we cache intermediate results before grouping/filtering, or after?

### 3. **Cache Storage & Performance**
   - **In-memory cache** (stored on the Network object or a global cache dict) or **persistent cache** (pickle files)?
   - How large do you expect the cached DataFrames to be? (affects memory vs. disk trade-off)
   - Are you running this with **one network at a time** or a **NetworkCollection with multiple networks**? (affects multi-network caching strategy)

### 4. **API Design - How Should Functions Access the Cache?**
   Looking at your constraint *"structure one variable=one function remains the same"*, which approach fits best?

   **Option A: Decorator approach** (minimal code change)
   ```python
   @cache_energy_balance
   def Final_Energy_by_Carrier__Electricity(n: pypsa.Network, ...) -> pd.Series:
       # Call n.statistics.energy_balance(...) inside
   ```

   **Option B: Context manager** (slightly more explicit)
   ```python
   def Final_Energy_by_Carrier__Electricity(n: pypsa.Network, ...) -> pd.Series:
       with energy_balance_cache(n) as cached_stat:
           res = cached_stat(bus_carrier="AC", ...)
   ```

   **Option C: Cached accessor wrapper** (minimal changes to existing code)
   ```python
   # In workflow/class_definitions.py, wrap the network:
   n = CachedNetworkWrapper(n)
   # Then call functions normally; cache is transparent
   ```

   **Option D: Manual cache context in workflow** (centralized, explicit)
   ```python
   # In Network_Processor.calculate_variables_values():
   for n in network_collection:
       with enable_energy_balance_cache(n):
           for variable in variables:
               result = execute_function(variable, n)
   ```

### 5. **Cache Key Strategy**
   - Should the cache key include the **network object itself** (by id), or assume there's only one active network at a time?
   - Should we hash the **parameters** to `energy_balance`, or create a unique signature (e.g., tuple of all kwargs)?

### 6. **Monitoring & Debugging**
   - Do you want **cache hit/miss statistics** logged during execution?
   - Should there be a way to **disable caching temporarily** for testing or debugging?

---

**My initial recommendation** (pending your answers):
- **Option D** with a **context manager** + **in-memory cache** attached to the Network object
- Cache keyed by a sorted tuple of all kwargs passed to `n.statistics.energy_balance`
- Auto-invalidates when exiting the context or when moving to a new network
- Minimal changes to `statistics_functions.py`

Please share your thoughts on these questions so I can propose a concrete framework design!

## @maxnutz

(1) for one network is good! There is no change of the network itself during the entire workflow! 
(2) good question! I am not yet completely shure about that! I think an approach would be to include all index levels, so that one can include the index levels needed. It is important, that there are not too many changes needed in the now implemented function calls. 
(3) I think in-memory cache is enough, but no strong opinion on that! 
(4) Option A would be best, so that everyone can use this without having to think about it explicitely
(5) no, chache just includes the energy-balance, nothing more, not even a network! 
(5) option for disabeling caching is a good idea! 

## Copilot

Perfect! Your answers give me a clear direction. Let me summarize the framework design and ask a few final clarifications:

## Proposed Caching Framework Design

Based on your inputs, here's what I'm proposing:

### **Architecture Overview**
- **Decorator-based approach** (`@cache_energy_balance`) that wraps function calls
- **In-memory cache** stored globally (not on network object, as per your point 5)
- **Cache key** = sorted tuple of all `energy_balance` call parameters (kwargs + positional args)
- **Auto-enabled** by default, with a **disable flag** for testing/debugging
- **Minimal changes** to existing `statistics_functions.py` functions

### **How It Works**
```python
# In a new module: pypsa_validation_processing/caching.py

# Global cache state
_energy_balance_cache = {}
_caching_enabled = True

def disable_caching():
    """Disable caching temporarily (e.g., for testing)"""
    global _caching_enabled
    _caching_enabled = False

def enable_caching():
    """Re-enable caching"""
    global _caching_enabled
    _caching_enabled = True

def clear_cache():
    """Clear the cache (e.g., between network changes)"""
    global _energy_balance_cache
    _energy_balance_cache.clear()

def cache_energy_balance(func):
    """Decorator: caches n.statistics.energy_balance calls within a function"""
    @functools.wraps(func)
    def wrapper(n: pypsa.Network, *args, **kwargs):
        # Inject cached_energy_balance into the network temporarily
        original_energy_balance = n.statistics.energy_balance
        
        def cached_energy_balance(*eb_args, **eb_kwargs):
            cache_key = _make_cache_key(*eb_args, **eb_kwargs)
            
            if _caching_enabled and cache_key in _energy_balance_cache:
                return _energy_balance_cache[cache_key]
            
            result = original_energy_balance(*eb_args, **eb_kwargs)
            
            if _caching_enabled:
                _energy_balance_cache[cache_key] = result
            
            return result
        
        # Temporarily replace
        n.statistics.energy_balance = cached_energy_balance
        try:
            return func(n, *args, **kwargs)
        finally:
            # Restore original
            n.statistics.energy_balance = original_energy_balance
    
    return wrapper
```

### **Usage (minimal code change)**
```python
# In statistics_functions.py - just add one decorator

@cache_energy_balance
def Final_Energy_by_Carrier__Electricity(
    n: pypsa.Network,
    aggregate_per_year: bool = True,
) -> pd.Series | pd.DataFrame:
    """Extract electricity final energy from a PyPSA Network."""
    res = abs(
        n.statistics.energy_balance(  # This call is now cached!
            bus_carrier="AC",
            groupby=["carrier", "location", "unit"],
            groupby_time=aggregate_per_year,
        )
    )
    # ... rest of function unchanged
```

---

## Final Clarifying Questions

### **Q1: Cache Key Generation**
For `_make_cache_key()`, should we:
- **Include all parameters** (even `groupby_time=False`)?
- **Exclude certain params** that don't affect the result (if any)?
- **Handle mutable parameters** like lists specially (e.g., `groupby=["carrier", "location"]`)?

**My recommendation:** Include all parameters, convert lists to sorted tuples for consistency.

### **Q2: Cache Lifecycle per Network**
Should the cache:
- **Persist across the entire workflow** (i.e., multiple networks in the collection)?
- **Clear automatically when switching to a new network**?
- **Be manually cleared** via `clear_cache()` call in `Network_Processor.calculate_variables_values()`?

**My recommendation:** Clear at the start of each network processing (in the loop in `Network_Processor.calculate_variables_values()` line 644-672).

### **Q3: Caching at Which Level?**
Should we cache:
- **Every call** to `energy_balance` (including intermediate calculations)?
- **Only the first/main call** per function (to avoid over-caching)?

**My recommendation:** Cache every call—if functions call `energy_balance` multiple times with the same params, it should be cached.

### **Q4: Return Value Handling**
Since `energy_balance` returns pandas objects (Series/DataFrame), should we:
- **Cache directly** (reference to same object, fast but risky if modified)?
- **Cache copies** (`.copy()`, safer but slightly slower)?

**My recommendation:** Cache with `.copy()` to prevent external modifications affecting cached results.

### **Q5: Stats/Logging**
Should we provide:
- **Silent caching** (no output)?
- **Optional debug logging** (cache hits/misses per function call)?
- **Summary stats** printed at end of workflow?

**My recommendation:** Optional logging via a parameter or environment variable (e.g., `PYPSA_CACHE_DEBUG=1`).

---

Once you confirm these, I'll create:
1. ✅ New `caching.py` module with the complete framework
2. ✅ Updated `statistics_functions.py` with decorators
3. ✅ Integration point in `Network_Processor` to clear cache per network
4. ✅ Optional: helper utilities for cache inspection

What are your preferences on Q1-Q5?
</p>
</details> 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable cache for energy-balance statistics #52

Goal

Architecture Overview

How It Works

Usage (minimal code change)

@maxnutz

Copilot

Key Questions to Define the Caching Strategy

1. Caching Scope & Invalidation

2. What Gets Cached?

3. Cache Storage & Performance

4. API Design - How Should Functions Access the Cache?

5. Cache Key Strategy

6. Monitoring & Debugging

@maxnutz

Copilot

Proposed Caching Framework Design

Architecture Overview

How It Works

Usage (minimal code change)

Final Clarifying Questions

Q1: Cache Key Generation

Q2: Cache Lifecycle per Network

Q3: Caching at Which Level?

Q4: Return Value Handling

Q5: Stats/Logging

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

enable cache for energy-balance statistics #52

Description

Goal

Architecture Overview

How It Works

Usage (minimal code change)

@maxnutz

Copilot

Key Questions to Define the Caching Strategy

1. Caching Scope & Invalidation

2. What Gets Cached?

3. Cache Storage & Performance

4. API Design - How Should Functions Access the Cache?

5. Cache Key Strategy

6. Monitoring & Debugging

@maxnutz

Copilot

Proposed Caching Framework Design

Architecture Overview

How It Works

Usage (minimal code change)

Final Clarifying Questions

Q1: Cache Key Generation

Q2: Cache Lifecycle per Network

Q3: Caching at Which Level?

Q4: Return Value Handling

Q5: Stats/Logging

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions