TensorRT Execution Provider
With the TensorRT execution provider, the ONNX Runtime delivers better inferencing performance on the same hardware compared to generic GPU acceleration.
The TensorRT execution provider in the ONNX Runtime makes use of NVIDIA’s TensorRT Deep Learning inferencing engine to accelerate ONNX model in their family of GPUs. Microsoft and NVIDIA worked closely to integrate the TensorRT execution provider with ONNX Runtime.
Contents
Install
Pre-built packages and Docker images are available for Jetpack in the Jetson Zoo.
Requirements
ONNX Runtime | TensorRT | CUDA |
---|---|---|
main | 8.5 | 11.6 |
1.14 | 8.5 | 11.6 |
1.12-1.13 | 8.4 | 11.4 |
1.11 | 8.2 | 11.4 |
1.10 | 8.0 | 11.4 |
1.9 | 8.0 | 11.4 |
1.7-1.8 | 7.2 | 11.0.3 |
1.5-1.6 | 7.1 | 10.2 |
1.2-1.4 | 7.0 | 10.1 |
1.0-1.1 | 6.0 | 10.0 |
For more details on CUDA/cuDNN versions, please see CUDA EP requirements.
Build
See Build instructions.
The TensorRT execution provider for ONNX Runtime is built and tested with TensorRT 8.5.
Usage
C/C++
Ort::Env env = Ort::Env{ORT_LOGGING_LEVEL_ERROR, "Default"};
Ort::SessionOptions sf;
int device_id = 0;
Ort::ThrowOnError(OrtSessionOptionsAppendExecutionProvider_Tensorrt(sf, device_id));
Ort::ThrowOnError(OrtSessionOptionsAppendExecutionProvider_CUDA(sf, device_id));
Ort::Session session(env, model_path, sf);
The C API details are here.
Shape Inference for TensorRT Subgraphs
If some operators in the model are not supported by TensorRT, ONNX Runtime will partition the graph and only send supported subgraphs to TensorRT execution provider. Because TensorRT requires that all inputs of the subgraphs have shape specified, ONNX Runtime will throw error if there is no input shape info. In this case please run shape inference for the entire model first by running script here (Check below for sample).
Python
To use TensorRT execution provider, you must explicitly register TensorRT execution provider when instantiating the InferenceSession
. Note that it is recommended you also register CUDAExecutionProvider
to allow Onnx Runtime to assign nodes to CUDA execution provider that TensorRT does not support.
import onnxruntime as ort
# set providers to ['TensorrtExecutionProvider', 'CUDAExecutionProvider'] with TensorrtExecutionProvider having the higher priority.
sess = ort.InferenceSession('model.onnx', providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider'])
Configurations
There are two ways to configure TensorRT settings, either by environment variables or by execution provider option APIs.
Environment Variables
Following environment variables can be set for TensorRT execution provider.
-
ORT_TENSORRT_MAX_WORKSPACE_SIZE
: maximum workspace size for TensorRT engine. Default value: 1073741824 (1GB). -
ORT_TENSORRT_MAX_PARTITION_ITERATIONS
: maximum number of iterations allowed in model partitioning for TensorRT. If target model can’t be successfully partitioned when the maximum number of iterations is reached, the whole model will fall back to other execution providers such as CUDA or CPU. Default value: 1000. -
ORT_TENSORRT_MIN_SUBGRAPH_SIZE
: minimum node size in a subgraph after partitioning. Subgraphs with smaller size will fall back to other execution providers. Default value: 1. -
ORT_TENSORRT_FP16_ENABLE
: Enable FP16 mode in TensorRT. 1: enabled, 0: disabled. Default value: 0. Note not all Nvidia GPUs support FP16 precision. -
ORT_TENSORRT_INT8_ENABLE
: Enable INT8 mode in TensorRT. 1: enabled, 0: disabled. Default value: 0. Note not all Nvidia GPUs support INT8 precision. -
ORT_TENSORRT_INT8_CALIBRATION_TABLE_NAME
: Specify INT8 calibration table file for non-QDQ models in INT8 mode. Note calibration table should not be provided for QDQ model because TensorRT doesn’t allow calibration table to be loded if there is any Q/DQ node in the model. By default the name is empty. ORT_TENSORRT_INT8_USE_NATIVE_CALIBRATION_TABLE
: Select what calibration table is used for non-QDQ models in INT8 mode. If 1, native TensorRT generated calibration table is used; if 0, ONNXRUNTIME tool generated calibration table is used. Default value: 0.- Note: Please copy up-to-date calibration table file to
ORT_TENSORRT_CACHE_PATH
before inference. Calibration table is specific to models and calibration data sets. Whenever new calibration table is generated, old file in the path should be cleaned up or be replaced.
- Note: Please copy up-to-date calibration table file to
-
ORT_TENSORRT_DLA_ENABLE
: Enable DLA (Deep Learning Accelerator). 1: enabled, 0: disabled. Default value: 0. Note not all Nvidia GPUs support DLA. -
ORT_TENSORRT_DLA_CORE
: Specify DLA core to execute on. Default value: 0. ORT_TENSORRT_ENGINE_CACHE_ENABLE
: Enable TensorRT engine caching. The purpose of using engine caching is to save engine build time in the case that TensorRT may take long time to optimize and build engine. Engine will be cached when it’s built for the first time so next time when new inference session is created the engine can be loaded directly from cache. In order to validate that the loaded engine is usable for current inference, engine profile is also cached and loaded along with engine. If current input shapes are in the range of the engine profile, the loaded engine can be safely used. Otherwise if input shapes are out of range, profile cache will be updated to cover the new shape and engine will be recreated based on the new profile (and also refreshed in the engine cache). Note each engine is created for specific settings such as model path/name, precision (FP32/FP16/INT8 etc), workspace, profiles etc, and specific GPUs and it’s not portable, so it’s essential to make sure those settings are not changing, otherwise the engine needs to be rebuilt and cached again. 1: enabled, 0: disabled. Default value: 0.- Warning: Please clean up any old engine and profile cache files (.engine and .profile) if any of the following changes:
- Model changes (if there are any changes to the model topology, opset version, operators etc.)
- ORT version changes (i.e. moving from ORT version 1.8 to 1.9)
- TensorRT version changes (i.e. moving from TensorRT 7.0 to 8.0)
- Hardware changes. (Engine and profile files are not portable and optimized for specific Nvidia hardware)
- Warning: Please clean up any old engine and profile cache files (.engine and .profile) if any of the following changes:
-
ORT_TENSORRT_CACHE_PATH
: Specify path for TensorRT engine and profile files ifORT_TENSORRT_ENGINE_CACHE_ENABLE
is 1, or path for INT8 calibration table file if ORT_TENSORRT_INT8_ENABLE is 1. -
ORT_TENSORRT_DUMP_SUBGRAPHS
: Dumps the subgraphs that are transformed into TRT engines in onnx format to the filesystem. This can help debugging subgraphs, e.g. by usingtrtexec --onnx my_model.onnx
and check the outputs of the parser. 1: enabled, 0: disabled. Default value: 0. -
ORT_TENSORRT_FORCE_SEQUENTIAL_ENGINE_BUILD
: Sequentially build TensorRT engines across provider instances in multi-GPU environment. 1: enabled, 0: disabled. Default value: 0. ORT_TENSORRT_CONTEXT_MEMORY_SHARING_ENABLE
: Share execution context memory between TensorRT subgraphs. Default 0 = false, nonzero = true.
One can override default values by setting environment variables. e.g. on Linux:
# Override default max workspace size to 2GB
export ORT_TENSORRT_MAX_WORKSPACE_SIZE=2147483648
# Override default maximum number of iterations to 10
export ORT_TENSORRT_MAX_PARTITION_ITERATIONS=10
# Override default minimum subgraph node size to 5
export ORT_TENSORRT_MIN_SUBGRAPH_SIZE=5
# Enable FP16 mode in TensorRT
export ORT_TENSORRT_FP16_ENABLE=1
# Enable INT8 mode in TensorRT
export ORT_TENSORRT_INT8_ENABLE=1
# Use native TensorRT calibration table
export ORT_TENSORRT_INT8_USE_NATIVE_CALIBRATION_TABLE=1
# Enable TensorRT engine caching
export ORT_TENSORRT_ENGINE_CACHE_ENABLE=1
# Please Note warning above. This feature is experimental.
# Engine cache files must be invalidated if there are any changes to the model, ORT version, TensorRT version or if the underlying hardware changes. Engine files are not portable across devices.
# Specify TensorRT cache path
export ORT_TENSORRT_CACHE_PATH="/path/to/cache"
# Dump out subgraphs to run on TensorRT
export ORT_TENSORRT_DUMP_SUBGRAPHS=1
# Enable context memory sharing between TensorRT subgraphs. Default 0 = false, nonzero = true
export ORT_TENSORRT_CONTEXT_MEMORY_SHARING_ENABLE=1
Execution Provider Options
TensorRT configurations can also be set by execution provider option APIs. It’s useful when each model and inference session have their own configurations. In this case, execution provider option settings will override any environment variable settings. All configurations should be set explicitly, otherwise default value will be taken.
There are one-to-one mappings between environment variables and execution provider options APIs shown as below:
Note: for bool type options, assign them with True/False in python, or 1/0 in C++.
environment variables | execution provider option APIs | type |
---|---|---|
ORT_TENSORRT_MAX_WORKSPACE_SIZE | trt_max_workspace_size | int |
ORT_TENSORRT_MAX_PARTITION_ITERATIONS | trt_max_partition_iterations | int |
ORT_TENSORRT_MIN_SUBGRAPH_SIZE | trt_min_subgraph_size | int |
ORT_TENSORRT_FP16_ENABLE | trt_fp16_enable | bool |
ORT_TENSORRT_INT8_ENABLE | trt_int8_enable | bool |
ORT_TENSORRT_INT8_CALIBRATION_TABLE_NAME | trt_int8_calibration_table_name | string |
ORT_TENSORRT_INT8_USE_NATIVE_CALIBRATION_TABLE | trt_int8_use_native_calibration_table | bool |
ORT_TENSORRT_DLA_ENABLE | trt_dla_enable | bool |
ORT_TENSORRT_DLA_CORE | trt_dla_core | int |
ORT_TENSORRT_ENGINE_CACHE_ENABLE | trt_engine_cache_enable | bool |
ORT_TENSORRT_CACHE_PATH | trt_engine_cache_path | string |
ORT_TENSORRT_DUMP_SUBGRAPHS | trt_dump_subgraphs | bool |
ORT_TENSORRT_FORCE_SEQUENTIAL_ENGINE_BUILD | trt_force_sequential_engine_build | bool |
ORT_TENSORRT_CONTEXT_MEMORY_SHARING_ENABLE | trt_context_memory_sharing_enable | bool |
Besides, device_id
can also be set by execution provider option.
C++ API example
Ort::SessionOptions session_options;
OrtTensorRTProviderOptions trt_options{};
// note: for bool type options in c++ API, set them as 0/1
trt_options.device_id = 1;
trt_options.trt_max_workspace_size = 2147483648;
trt_options.trt_max_partition_iterations = 10;
trt_options.trt_min_subgraph_size = 5;
trt_options.trt_fp16_enable = 1;
trt_options.trt_int8_enable = 1;
trt_options.trt_int8_use_native_calibration_table = 1;
trt_options.trt_engine_cache_enable = 1;
trt_options.trt_engine_cache_path = "/path/to/cache"
trt_options.trt_dump_subgraphs = 1;
session_options.AppendExecutionProvider_TensorRT(trt_options);
Python API example
import onnxruntime as ort
model_path = '<path to model>'
# note: for bool type options in python API, set them as False/True
providers = [
('TensorrtExecutionProvider', {
'device_id': 0,
'trt_max_workspace_size': 2147483648,
'trt_fp16_enable': True,
}),
('CUDAExecutionProvider', {
'device_id': 0,
'arena_extend_strategy': 'kNextPowerOfTwo',
'gpu_mem_limit': 2 * 1024 * 1024 * 1024,
'cudnn_conv_algo_search': 'EXHAUSTIVE',
'do_copy_in_default_stream': True,
})
]
sess_opt = ort.SessionOptions()
sess = ort.InferenceSession(model_path, sess_options=sess_opt, providers=providers)
Performance Tuning
For performance tuning, please see guidance on this page: ONNX Runtime Perf Tuning
When/if using onnxruntime_perf_test, use the flag -e tensorrt
. Check below for sample.
Samples
This example shows how to run the Faster R-CNN model on TensorRT execution provider.
-
Download the Faster R-CNN onnx model from the ONNX model zoo here.
- Infer shapes in the model by running the shape inference script
python symbolic_shape_infer.py --input /path/to/onnx/model/model.onnx --output /path/to/onnx/model/new_model.onnx --auto_merge
- Replace the original model with the new model and run the onnx_test_runner tool under ONNX Runtime build directory.
./onnx_test_runner -e tensorrt /path/to/onnx/model/
-
Run
onnxruntime_perf_test
on your shape-inferred Faster-RCNN modelDownload sample test data with model from model zoo, and put test_data_set folder next to your inferred model
# e.g. # -r: set up test repeat time # -e: set up execution provider # -i: set up params for execution provider options ./onnxruntime_perf_test -r 1 -e tensorrt -i "trt_fp16_enable|true" /path/to/onnx/your_inferred_model.onnx
Please see this Notebook for an example of running a model on GPU using ONNX Runtime through Azure Machine Learning Services.