Implementation of paper Free(): Learning to Forget in Malloc-Only Reasoning Models.
Reasoning models enhance problem-solving by scaling test-time compute, yet they face a critical paradox: excessive thinking tokens often degrade performance rather than improve it. We attribute this to a fundamental architectural flaw: standard LLMs operate as "malloc-only" engines, continuously accumulating valid and redundant steps alike without a mechanism to prune obsolete information. To break this cycle, we propose Free()LM, a model that introduces an intrinsic self-forgetting capability via the Free-Module, a plug-and-play LoRA adapter. By iteratively switching between reasoning and cleaning modes, Free()LM dynamically identifies and prunes useless context chunks, maintaining a compact and noise-free state.
Extensive experiments show that Free()LM provides consistent improvements across all model scales (8B to 685B). It achieves a 3.3% average improvement over top-tier reasoning baselines, even establishing a new SOTA on IMOanswerBench using DeepSeek V3.2-Speciale. Most notably, in long-horizon tasks where the standard Qwen3-235B-A22B model suffers a total collapse (0% accuracy), Free()LM restores performance to ~50. Our findings suggest that sustainable intelligence requires the freedom to forget as much as the power to think.
This repository contains the official implementation, data, and model checkpoints for the paper "Free(): Learning to Forget in Malloc-Only Reasoning Models".
We provide LoRA checkpoints for various base models. You can download them directly from Hugging Face.
| Base Model | Method | Checkpoint |
|---|---|---|
| Qwen3-8B | Free()LM | 🤗 ldsjmdy/Qwen3-8B-FreeLM-LoRA |
| Qwen3-30B-A3B-Thinking-2507 | Free()LM | 🤗 ldsjmdy/Qwen3-30B-A3B-Thinking-2507-FreeLM-LoRA |
| Qwen3-235B-A3B-Thinking-2507 | Free()LM | 🤗 ldsjmdy/Qwen3-235B-A3B-Thinking-2507-FreeLM-LoRA |
The training data used in our experiments can be downloaded here: 🤗 ldsjmdy/FreeLM
The data follows the JSON format below:
{
"prompt": "Instruction here...",
"completion": "Desired model response..."
}We utilize datasets such as AIME 24/25 for evaluation. The processed evaluation sets are available here: 🤗 ldsjmdy/FreeLM.
The evaluation data format is as follows:
{
"prompt": "Question text...",
"answer": "Ground truth answer...",
"id": 101,
"source": "aime24"
}Clone the repository and install the required dependencies.
To ensure environment stability and avoid conflicts, we have separated the dependencies for Inference/Evaluation, SGLang Deployment, and Training. We recommend using uv for package management.
Dependency Files:
- Inference & Eval: requirements.eval.txt
- SGLang Deployment: requirements.sglang.txt
- Training: requirements.train.txt
git clone https://github.com/TemporaryLoRA/FreeLM.git
cd FreeLM
# Install dependencies based on your needs:
# For Inference and Evaluation
uv pip install -r requirements.eval.txt
# For SGLang Deployment
# uv pip install -r requirements.sglang.txt
# For Training
# uv pip install -r requirements.train.txtWe used the megatron-core and megatron-bridge versions bundled with the NeMo Framework container nvcr.io/nvidia/nemo:25.11.01. The corresponding code is available in the megatron directory of this repository.
Our training pipeline is built upon Megatron-Bridge. We provide a comprehensive training script for large-scale models (e.g., Qwen3-235B).
1. Convert Checkpoints First, convert the Hugging Face checkpoint to the Megatron format:
bash scripts/convert_hf_to_mbridge.py \
--input Qwen/Qwen3-235B-A22B-Thinking-2507 \
--output <path_to_save_megatron_checkpoint>2. Run Training
We recommend using 8 nodes (64 GPUs) for training large-scale models.
Before running scripts/train_qwen3_235b.sh, please define the following variables within the script or export them as environment variables:
model_path=""
nemo_path="${model_path}-nemo"
train_fp=""
output_dir=""
run_name=""
# Adjust based on your dataset size and saving strategy
train_iters=-1
save_interval=-1Then Execute the training script:
bash scripts/train_qwen3_235b.shWe support efficient inference using SGLang.
1. Launch the Model Service Deploy the model with LoRA adapters enabled:
sglang serve --model-path Qwen3/Qwen3-8B \
--host 0.0.0.0 \
--port 30000 \
--tensor-parallel-size 1 \
--context-length 32768 \
--enable-lora \
--lora-path lora=ldsjmdy/Qwen3-8B-FreeLM-LoRA 2. Run Inference Client
Use runner.py to send requests to the deployed service.
Note: Please configure the service URL in
runner.pybefore running. If you have deployed multiple services for parallel processing, add them to the list:# runner.py ... if __name__ == '__main__': # ... service_ips = [ ("127.0.0.1", "30000"), ("127.0.0.1", "30001") # Add more workers if available ]
Run the inference script:
python3 runner.py --help # View available argumentsThe runner.py script supports concurrent calls to multiple service endpoints to maximize inference throughput.
We employ openmathinst for mathematical reasoning evaluation.
- Standard Evaluation: Please refer to
eval_passk.pyfor Pass@K calculation. - LLM-as-a-Judge: For DeepSeek models or open-ended generation, we provide an LLM judge script located at llm_judge.py.
If you find this repository or our paper useful for your research, please cite:
@misc{zheng2026freelearningforgetmalloconly,
title={Free(): Learning to Forget in Malloc-Only Reasoning Models},
author={Yilun Zheng and Dongyang Ma and Tian Liang and Jiahao Xu and Xinting Huang and Lijie Chen and Haitao Mi and Yan Wang},
year={2026},
eprint={2602.08030},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.08030},
}