学习笔记(05):英特尔®OpenVINO™工具套件中级课程--(第五章)性能评估与硬件选择

it2023-05-10 93

立即学习:https://edu.csdn.net/course/play/28807/427189?utm_source=blogtoedu

性能评估与硬件选择

性能度量：吞吐量（FPS,InferPS）延时效率（frame/sec/watt,frame/sec/$$）影响性能的因素神经网络的拓扑及参数：ResNet-50，Squeezenet目标设备的构架：CPU、GPU、FPGA、AI精度（数据格式，FP16、FP32），Xeon AVX512，VNNI，DL-Boost（3X）Batching同步、异步执行（Sync、Async）CPU吞吐模式，streams，threads,#ireq使用OpenVINO评估性能

OpenVINO性能评估例子

使用time() 函数记录时间

在推理前后增加计时点 this_time = time.time()，前后相减以获得推理时间。然后使用print函数打印在屏幕上。

start_time=time.time() exec_net.start_async(request_id=next_request_id, inputs=feed_dict) end_time=time.time() exetime=end_time-start_time sys.stdout.write('%.5s' % exetime)

python3 add-perf-object-detection.py

Frame 0 0.002 Frame 1 0.0063 3 3 3 3 3 3 3 3 3 3 3 Frame 2 0.0011 3 3 3 3 3 3 3 3 3 3 3 3 Frame 3 0.0013 3 3 3 3 3 3 3 3 3 3 Frame 4 0.0011 3 3 3 3 3 3 3 3 3 Frame 5 0.0011 3 3 3 3 3 3 3 3 3 3 3 3 3 Frame 6 0.0003 3 3 3 3 3 3 3 3 3 3 3 Frame 7 0.0011 3 3 3 3 3 3 3 3 3 3 3 3 3 3

使用benchmark_app.py测试

首先测试ssd-mobilenet模型，获得它在本地设备上的性能参数。

python3 benchmark_app.py -m models/ssd-mobilenet.xml -i images/ -t 20 [Step 1/11] Parsing and validating input arguments [ WARNING ] -nstreams default value is determined automatically for a device. Although the automatic selection usually provides a reasonable performance, but it still may be non-optimal for some cases, for more information look at README. [Step 2/11] Loading Inference Engine [ INFO ] InferenceEngine: API version............. 2.1.2020.3.0-3467-15f2c61a-releases/2020/3 [ INFO ] Device info CPU MKLDNNPlugin............ version 2.1 Build................... 2020.3.0-3467-15f2c61a-releases/2020/3

[Step 3/11] Reading the Intermediate Representation network [ INFO ] Read network took 1177.84 ms [Step 4/11] Resizing network to match image sizes and given batch [ INFO ] Network batch size: 1 [Step 5/11] Configuring input of the model [Step 6/11] Setting device configuration [Step 7/11] Loading the model to the device [ INFO ] Load network took 613.86 ms [Step 8/11] Setting optimal runtime parameters [Step 9/11] Creating infer requests and filling input blobs with images [ INFO ] Network input 'image_tensor' precision U8, dimensions (NCHW): 1 3 300 300 [ WARNING ] Some image input files will be ignored: only 1 files are required from 3 [ INFO ] Infer Request 0 filling [ INFO ] Prepare image /home/dc2-user/200/05/exercise-2/images/boy-computer.jpg [ WARNING ] Image is resized from ((1080, 1920)) to ((300, 300)) [Step 10/11] Measuring performance (Start inference asyncronously, 1 inference requests using 1 streams for CPU, limits: 20000 ms duration) [Step 11/11] Dumping statistics report Count: 633 iterations Duration: 20041.34 ms Latency: 29.83 ms Throughput: 31.58 FPS

在ssd模型中应用同步推理和异步推理同步推理（单线程）：python3 benchmark_app.py -m models/ssd-mobilenet.xml -i images/ -t 20 -api sync

请在刚才的相同参数下，使用异步推理python3 benchmark_app.py -m models/ssd-mobilenet.xml -i images/ -t 20 -api async

挑战任务:

尝试benchemark输入不同参数

示例:python3 benchmark_app.py -m models/ssd-mobilenet.xml -i images/ -t 20 -nstreams 16 -nireq 8

Task 1: 测试ssd-mobilenet模型在2推理请求，2 streams下的性能：python3 benchmark_app.py -m models/ssd-mobilenet.xml -i images/ -t 20 -nstreams 2 -nireq 2

Task 2: 测试resnet-50模型在batch为4，异步推理的性能。并且将perf_counts和progress 选项设置为可见。

python3 benchmark_app.py -m models/resnet-50.xml -i images/ -t 20 -b 4 -api async -pc -progress

[Step 1/11] Parsing and validating input arguments

[ WARNING ] -nstreams default value is determined automatically for a device. Although the automatic selection usually provides a reasonable performance, but it still may be non-optimal for some cases, for more information look at README. [Step 2/11] Loading Inference Engine [ INFO ] InferenceEngine: API version............. 2.1.2020.3.0-3467-15f2c61a-releases/2020/3 [ INFO ] Device info CPU MKLDNNPlugin............ version 2.1 Build................... 2020.3.0-3467-15f2c61a-releases/2020/3

[Step 3/11] Reading the Intermediate Representation network [ INFO ] Read network took 767.85 ms [Step 4/11] Resizing network to match image sizes and given batch [ INFO ] Resizing network to batch = 4 [ INFO ] Network batch size: 4 [Step 5/11] Configuring input of the model [Step 6/11] Setting device configuration [Step 7/11] Loading the model to the device [ INFO ] Load network took 554.24 ms [Step 8/11] Setting optimal runtime parameters [Step 9/11] Creating infer requests and filling input blobs with images [ INFO ] Network input 'data' precision U8, dimensions (NCHW): 4 3 224 224 [ WARNING ] Some image input files will be duplicated: 4 files are required, but only 3 were provided [ INFO ] Infer Request 0 filling [ INFO ] Prepare image /home/dc2-user/200/05/exercise-2/images/boy-computer.jpg [ WARNING ] Image is resized from ((1080, 1920)) to ((224, 224)) [ INFO ] Prepare image /home/dc2-user/200/05/exercise-2/images/burger.jpg [ WARNING ] Image is resized from ((1080, 1920)) to ((224, 224)) [ INFO ] Prepare image /home/dc2-user/200/05/exercise-2/images/car.png [ WARNING ] Image is resized from ((259, 787)) to ((224, 224)) [ INFO ] Prepare image /home/dc2-user/200/05/exercise-2/images/boy-computer.jpg [ WARNING ] Image is resized from ((1080, 1920)) to ((224, 224)) [Step 10/11] Measuring performance (Start inference asyncronously, 1 inference requests using 1 streams for CPU, limits: 20000 ms duration) Progress: |................................| 100%

[Step 11/11] Dumping statistics report [ INFO ] Performance counts for 0-th infer request data_U8_FP32_conv1/Fused_Add_ EXECUTED layerType: Reorder realTime: 360 cpu: 360 execType: jit_uni_I8 conv1/Fused_Add_ EXECUTED layerType: Convolution realTime: 6152 cpu: 6152 execType: jit_avx512_FP32 conv1_relu NOT_RUN layerType: ReLU realTime: 0 cpu: 0 execType: undef pool1 EXECUTED layerType: Pooling realTime: 1102 cpu: 1102 execType: jit_avx512_FP32 bn2a_branch2a/variance/Fus... EXECUTED layerType: Convolution realTime: 732 cpu: 732 execType: jit_avx512_1x1_FP32 ............ res5c NOT_RUN layerType: Eltwise realTime: 0 cpu: 0 execType: undef res5c_relu NOT_RUN layerType: ReLU realTime: 0 cpu: 0 execType: undef pool5 EXECUTED layerType: Pooling realTime: 93 cpu: 93 execType: jit_avx512_FP32 fc1000 EXECUTED layerType: FullyConnected realTime: 3242 cpu: 3242 execType: jit_gemm_FP32 prob EXECUTED layerType: SoftMax realTime: 43 cpu: 43 execType: jit_avx512_FP32 out_prob EXECUTED layerType: Output realTime: 1 cpu: 1 execType: unknown_FP32 Total time: 178105 microseconds Total CPU time: 178105 microseconds

Count: 114 iterations Duration: 20252.82 ms Latency: 168.09 ms Throughput: 22.52 FPS

benchmark_app使用帮助

使用python3 benchmark_app.py -h 获得该工具的说明帮助.

usage: benchmark_app.py [-h [HELP]] [-i PATHS_TO_INPUT [PATHS_TO_INPUT ...]] -m PATH_TO_MODEL [-d TARGET_DEVICE] [-l PATH_TO_EXTENSION] [-c PATH_TO_CLDNN_CONFIG] [-api {sync,async}] [-niter NUMBER_ITERATIONS] [-nireq NUMBER_INFER_REQUESTS] [-b BATCH_SIZE] [-stream_output [STREAM_OUTPUT]] [-t TIME] [-progress [PROGRESS]] [-nstreams NUMBER_STREAMS] [-nthreads NUMBER_THREADS] [-pin {YES,NO,NUMA}] [--exec_graph_path EXEC_GRAPH_PATH] [-pc [PERF_COUNTS]] [--report_type {no_counters,average_counters,detailed_counters}] [--report_folder REPORT_FOLDER]

Options: -h [HELP], --help [HELP] Show this help message and exit. -i PATHS_TO_INPUT [PATHS_TO_INPUT ...], --paths_to_input PATHS_TO_INPUT [PATHS_TO_INPUT ...] Optional. Path to a folder with images and/or binaries or to specific image or binary file. -m PATH_TO_MODEL, --path_to_model PATH_TO_MODEL Required. Path to an .xml file with a trained model. -d TARGET_DEVICE, --target_device TARGET_DEVICE Optional. Specify a target device to infer on (the list of available devices is shown below). Default value is CPU. Use '-d HETERO:<comma separated devices list>' format to specify HETERO plugin. Use '-d MULTI:<comma separated devices list>' format to specify MULTI plugin. The application looks for a suitable plugin for the specified device. -l PATH_TO_EXTENSION, --path_to_extension PATH_TO_EXTENSION Optional. Required for CPU custom layers. Absolute path to a shared library with the kernels implementations. -c PATH_TO_CLDNN_CONFIG, --path_to_cldnn_config PATH_TO_CLDNN_CONFIG Optional. Required for GPU custom kernels. Absolute path to an .xml file with the kernels description. -api {sync,async}, --api_type {sync,async} Optional. Enable using sync/async API. Default value is async. -niter NUMBER_ITERATIONS, --number_iterations NUMBER_ITERATIONS Optional. Number of iterations. If not specified, the number of iterations is calculated depending on a device. -nireq NUMBER_INFER_REQUESTS, --number_infer_requests NUMBER_INFER_REQUESTS Optional. Number of infer requests. Default value is determined automatically for device. -b BATCH_SIZE, --batch_size BATCH_SIZE Optional. Batch size value. If not specified, the batch size value is determined from Intermediate Representation -stream_output [STREAM_OUTPUT] Optional. Print progress as a plain text. When specified, an interactive progress bar is replaced with a multi-line output. -t TIME, --time TIME Optional. Time in seconds to execute topology. -progress [PROGRESS] Optional. Show progress bar (can affect performance measurement). Default values is 'False'. -nstreams NUMBER_STREAMS, --number_streams NUMBER_STREAMS Optional. Number of streams to use for inference on the CPU/GPU in throughput mode (for HETERO and MULTI device cases use format <device1>:<nstreams1>,<device2>:<nstreams2> or just <nstreams>). Default value is determined automatically for a device. Please note that although the automatic selection usually provides a reasonable performance, it still may be non - optimal for some cases, especially for very small networks. See samples README for more details. -nthreads NUMBER_THREADS, --number_threads NUMBER_THREADS Number of threads to use for inference on the CPU (including HETERO and MULTI cases). -pin {YES,NO,NUMA}, --infer_threads_pinning {YES,NO,NUMA} Optional. Enable threads->cores ('YES' is default value), threads->(NUMA)nodes ('NUMA') or completely disable ('NO')CPU threads pinning for CPU-involved inference. --exec_graph_path EXEC_GRAPH_PATH Optional. Path to a file where to store executable graph information serialized. -pc [PERF_COUNTS], --perf_counts [PERF_COUNTS] Optional. Report performance counters. --report_type {no_counters,average_counters,detailed_counters} Optional. Enable collecting statistics report. "no_counters" report contains configuration options specified, resulting FPS and latency. "average_counters" report extends "no_counters" report and additionally includes average PM counters values for each layer from the network. "detailed_counters" report extends "average_counters" report and additionally includes per-layer PM counters and latency for each executed infer request. --report_folder REPORT_FOLDER Optional. Path to a folder where statistics report is stored.

最新回复(0)