Skip to content

huggingface/optimum-intel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,396 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Optimum Intel

🤗 Optimum Intel is the interface between the 🤗 Transformers, Diffusers, Sentence Transformers and timm libraries and the different tools and libraries provided by OpenVINO to accelerate end-to-end pipelines on Intel architectures.

OpenVINO is an open-source toolkit that enables high performance inference capabilities for Intel CPUs, GPUs, and special DL inference accelerators (see the full list of supported devices). It is supplied with a set of tools to optimize your models with compression techniques such as quantization, pruning and knowledge distillation. Optimum Intel provides a simple interface to optimize your model, convert it to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime.

Installation

To install the latest release of 🤗 Optimum Intel with the corresponding required dependencies, you can use pip as follows:

python -m pip install -U "optimum-intel[openvino]"

Optimum Intel is a fast-moving project with regular additions of new model support, so you may want to install from source with the following command:

python -m pip install "optimum-intel"@git+https://github.com/huggingface/optimum-intel.git

Deprecation Notice: The extras for openvino (e.g., pip install optimum-intel[openvino,nncf]), nncf, neural-compressor, ipex are deprecated and will be removed in a future release.

Export:

To export your model to OpenVINO IR format, use the optimum-cli tool. Below is an example of exporting TinyLlama/TinyLlama_v1.1 model:

optimum-cli export openvino --model TinyLlama/TinyLlama_v1.1 ov_TinyLlama_v1_1

To export a model hosted on the Hub you can use our space. After conversion, a repository will be pushed under your namespace, this repository can be either public or private.

Additional information on exporting models is available in the documentation.

Inference:

To load an exported model and run inference using Optimum Intel, use the corresponding OVModelForXxx class instead of AutoModelForXxx:

from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer, pipeline

model_id = "ov_TinyLlama_v1_1"
model = OVModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
results = pipe("Hey, how are you doing today?", max_new_tokens=100)

For more details on Optimum Intel inference, refer to the documentation.

Note: Alternatively, an exported model can also be inferred using OpenVINO GenAI framework, that provides optimized execution methods for highly performant Generative AI.

Quantization:

Post-training static quantization can also be applied. Here is an example on how to apply static quantization on a Whisper model using the LibriSpeech dataset for the calibration step.

from optimum.intel import OVModelForSpeechSeq2Seq, OVQuantizationConfig

model_id = "openai/whisper-tiny"
q_config = OVQuantizationConfig(dtype="int8", dataset="librispeech", num_samples=50)
q_model = OVModelForSpeechSeq2Seq.from_pretrained(model_id, quantization_config=q_config)

# The directory where the quantized model will be saved
save_dir = "nncf_results"
q_model.save_pretrained(save_dir)

You can find more information in the documentation.

Running the examples

Check out the notebooks directory to see how 🤗 Optimum Intel can be used to optimize models and accelerate inference.

Do not forget to install requirements for every example:

cd <example-folder>
pip install -r requirements.txt