学习笔记(05):英特尔®OpenVINO™工具套件中级课程--(第五章)性能评估与硬件选择

it2023-05-10  71

立即学习:https://edu.csdn.net/course/play/28807/427189?utm_source=blogtoedu

性能评估与硬件选择

性能度量: 吞吐量(FPS,InferPS)延时效率(frame/sec/watt,frame/sec/$$)影响性能的因素 神经网络的拓扑及参数:ResNet-50,Squeezenet目标设备的构架:CPU、GPU、FPGA、AI精度(数据格式,FP16、FP32),Xeon AVX512,VNNI,DL-Boost(3X)Batching同步、异步执行(Sync、Async)CPU吞吐模式,streams,threads,#ireq使用OpenVINO评估性能

OpenVINO性能评估例子

使用time() 函数记录时间

在推理前后增加计时点 this_time = time.time(),前后相减以获得推理时间。然后使用print函数打印在屏幕上。

start_time=time.time() exec_net.start_async(request_id=next_request_id, inputs=feed_dict) end_time=time.time() exetime=end_time-start_time sys.stdout.write('%.5s' % exetime)

python3 add-perf-object-detection.py

Frame  0       0.002 Frame  1       0.0063     3     3     3     3     3     3     3     3     3     3     3      Frame  2       0.0011     3     3     3     3     3     3     3     3     3     3     3     3      Frame  3       0.0013     3     3     3     3     3     3     3     3     3     3      Frame  4       0.0011     3     3     3     3     3     3     3     3     3      Frame  5       0.0011     3     3     3     3     3     3     3     3     3     3     3     3     3      Frame  6       0.0003     3     3     3     3     3     3     3     3     3     3     3      Frame  7       0.0011     3     3     3     3     3     3     3     3     3     3     3     3     3     3

使用benchmark_app.py测试

首先测试ssd-mobilenet模型,获得它在本地设备上的性能参数。

python3 benchmark_app.py -m models/ssd-mobilenet.xml -i images/ -t 20 [Step 1/11] Parsing and validating input arguments [ WARNING ]  -nstreams default value is determined automatically for a device. Although the automatic selection usually provides a reasonable performance, but it still may be non-optimal for some cases, for more information look at README. [Step 2/11] Loading Inference Engine [ INFO ] InferenceEngine:          API version............. 2.1.2020.3.0-3467-15f2c61a-releases/2020/3 [ INFO ] Device info          CPU          MKLDNNPlugin............ version 2.1          Build................... 2020.3.0-3467-15f2c61a-releases/2020/3

[Step 3/11] Reading the Intermediate Representation network [ INFO ] Read network took 1177.84 ms [Step 4/11] Resizing network to match image sizes and given batch [ INFO ] Network batch size: 1 [Step 5/11] Configuring input of the model [Step 6/11] Setting device configuration [Step 7/11] Loading the model to the device [ INFO ] Load network took 613.86 ms [Step 8/11] Setting optimal runtime parameters [Step 9/11] Creating infer requests and filling input blobs with images [ INFO ] Network input 'image_tensor' precision U8, dimensions (NCHW): 1 3 300 300 [ WARNING ] Some image input files will be ignored: only 1 files are required from 3 [ INFO ] Infer Request 0 filling [ INFO ] Prepare image /home/dc2-user/200/05/exercise-2/images/boy-computer.jpg [ WARNING ] Image is resized from ((1080, 1920)) to ((300, 300)) [Step 10/11] Measuring performance (Start inference asyncronously, 1 inference requests using 1 streams for CPU, limits: 20000 ms duration) [Step 11/11] Dumping statistics report Count:      633 iterations Duration:   20041.34 ms Latency:    29.83 ms Throughput: 31.58 FPS

在ssd模型中应用同步推理和异步推理 同步推理(单线程):python3 benchmark_app.py -m models/ssd-mobilenet.xml -i images/ -t 20 -api sync

请在刚才的相同参数下,使用异步推理python3 benchmark_app.py -m models/ssd-mobilenet.xml -i images/ -t 20 -api async

挑战任务:

尝试benchemark输入不同参数

示例:python3 benchmark_app.py -m models/ssd-mobilenet.xml -i images/ -t 20  -nstreams 16 -nireq 8

Task 1: 测试ssd-mobilenet模型在2推理请求,2 streams下的性能:python3 benchmark_app.py -m models/ssd-mobilenet.xml -i images/ -t 20  -nstreams 2 -nireq 2

Task 2: 测试resnet-50模型在batch为4,异步推理的性能。并且将perf_counts和progress 选项设置为可见。

python3 benchmark_app.py -m models/resnet-50.xml -i images/ -t 20  -b 4 -api async -pc -progress

[Step 1/11] Parsing and validating input arguments

[ WARNING ]  -nstreams default value is determined automatically for a device. Although the automatic selection usually provides a reasonable performance, but it still may be non-optimal for some cases, for more information look at README. [Step 2/11] Loading Inference Engine [ INFO ] InferenceEngine:          API version............. 2.1.2020.3.0-3467-15f2c61a-releases/2020/3 [ INFO ] Device info          CPU          MKLDNNPlugin............ version 2.1          Build................... 2020.3.0-3467-15f2c61a-releases/2020/3

[Step 3/11] Reading the Intermediate Representation network [ INFO ] Read network took 767.85 ms [Step 4/11] Resizing network to match image sizes and given batch [ INFO ] Resizing network to batch = 4 [ INFO ] Network batch size: 4 [Step 5/11] Configuring input of the model [Step 6/11] Setting device configuration [Step 7/11] Loading the model to the device [ INFO ] Load network took 554.24 ms [Step 8/11] Setting optimal runtime parameters [Step 9/11] Creating infer requests and filling input blobs with images [ INFO ] Network input 'data' precision U8, dimensions (NCHW): 4 3 224 224 [ WARNING ] Some image input files will be duplicated: 4 files are required, but only 3 were provided [ INFO ] Infer Request 0 filling [ INFO ] Prepare image /home/dc2-user/200/05/exercise-2/images/boy-computer.jpg [ WARNING ] Image is resized from ((1080, 1920)) to ((224, 224)) [ INFO ] Prepare image /home/dc2-user/200/05/exercise-2/images/burger.jpg [ WARNING ] Image is resized from ((1080, 1920)) to ((224, 224)) [ INFO ] Prepare image /home/dc2-user/200/05/exercise-2/images/car.png [ WARNING ] Image is resized from ((259, 787)) to ((224, 224)) [ INFO ] Prepare image /home/dc2-user/200/05/exercise-2/images/boy-computer.jpg [ WARNING ] Image is resized from ((1080, 1920)) to ((224, 224)) [Step 10/11] Measuring performance (Start inference asyncronously, 1 inference requests using 1 streams for CPU, limits: 20000 ms duration) Progress: |................................| 100%

[Step 11/11] Dumping statistics report [ INFO ] Performance counts for 0-th infer request data_U8_FP32_conv1/Fused_Add_ EXECUTED       layerType: Reorder            realTime: 360       cpu: 360            execType: jit_uni_I8 conv1/Fused_Add_              EXECUTED       layerType: Convolution        realTime: 6152      cpu: 6152           execType: jit_avx512_FP32 conv1_relu                    NOT_RUN        layerType: ReLU               realTime: 0         cpu: 0              execType: undef      pool1                         EXECUTED       layerType: Pooling            realTime: 1102      cpu: 1102           execType: jit_avx512_FP32 bn2a_branch2a/variance/Fus... EXECUTED       layerType: Convolution        realTime: 732       cpu: 732            execType: jit_avx512_1x1_FP32 ............ res5c                         NOT_RUN        layerType: Eltwise            realTime: 0         cpu: 0              execType: undef      res5c_relu                    NOT_RUN        layerType: ReLU               realTime: 0         cpu: 0              execType: undef      pool5                         EXECUTED       layerType: Pooling            realTime: 93        cpu: 93             execType: jit_avx512_FP32 fc1000                        EXECUTED       layerType: FullyConnected     realTime: 3242      cpu: 3242           execType: jit_gemm_FP32 prob                          EXECUTED       layerType: SoftMax            realTime: 43        cpu: 43             execType: jit_avx512_FP32 out_prob                      EXECUTED       layerType: Output             realTime: 1         cpu: 1              execType: unknown_FP32 Total time:     178105 microseconds Total CPU time: 178105 microseconds

Count:      114 iterations Duration:   20252.82 ms Latency:    168.09 ms Throughput: 22.52 FPS

benchmark_app使用帮助

使用python3 benchmark_app.py -h 获得该工具的说明帮助.

usage: benchmark_app.py [-h [HELP]] [-i PATHS_TO_INPUT [PATHS_TO_INPUT ...]]                         -m PATH_TO_MODEL [-d TARGET_DEVICE]                         [-l PATH_TO_EXTENSION] [-c PATH_TO_CLDNN_CONFIG]                         [-api {sync,async}] [-niter NUMBER_ITERATIONS]                         [-nireq NUMBER_INFER_REQUESTS] [-b BATCH_SIZE]                         [-stream_output [STREAM_OUTPUT]] [-t TIME]                         [-progress [PROGRESS]] [-nstreams NUMBER_STREAMS]                         [-nthreads NUMBER_THREADS] [-pin {YES,NO,NUMA}]                         [--exec_graph_path EXEC_GRAPH_PATH]                         [-pc [PERF_COUNTS]]                         [--report_type {no_counters,average_counters,detailed_counters}]                         [--report_folder REPORT_FOLDER]

Options:   -h [HELP], --help [HELP]                         Show this help message and exit.   -i PATHS_TO_INPUT [PATHS_TO_INPUT ...], --paths_to_input PATHS_TO_INPUT [PATHS_TO_INPUT ...]                         Optional. Path to a folder with images and/or binaries                         or to specific image or binary file.   -m PATH_TO_MODEL, --path_to_model PATH_TO_MODEL                         Required. Path to an .xml file with a trained model.   -d TARGET_DEVICE, --target_device TARGET_DEVICE                         Optional. Specify a target device to infer on (the                         list of available devices is shown below). Default                         value is CPU. Use '-d HETERO:<comma separated devices                         list>' format to specify HETERO plugin. Use '-d                         MULTI:<comma separated devices list>' format to                         specify MULTI plugin. The application looks for a                         suitable plugin for the specified device.   -l PATH_TO_EXTENSION, --path_to_extension PATH_TO_EXTENSION                         Optional. Required for CPU custom layers. Absolute                         path to a shared library with the kernels                         implementations.   -c PATH_TO_CLDNN_CONFIG, --path_to_cldnn_config PATH_TO_CLDNN_CONFIG                         Optional. Required for GPU custom kernels. Absolute                         path to an .xml file with the kernels description.   -api {sync,async}, --api_type {sync,async}                         Optional. Enable using sync/async API. Default value                         is async.   -niter NUMBER_ITERATIONS, --number_iterations NUMBER_ITERATIONS                         Optional. Number of iterations. If not specified, the                         number of iterations is calculated depending on a                         device.   -nireq NUMBER_INFER_REQUESTS, --number_infer_requests NUMBER_INFER_REQUESTS                         Optional. Number of infer requests. Default value is                         determined automatically for device.   -b BATCH_SIZE, --batch_size BATCH_SIZE                         Optional. Batch size value. If not specified, the                         batch size value is determined from Intermediate                         Representation   -stream_output [STREAM_OUTPUT]                         Optional. Print progress as a plain text. When                         specified, an interactive progress bar is replaced                         with a multi-line output.   -t TIME, --time TIME  Optional. Time in seconds to execute topology.   -progress [PROGRESS]  Optional. Show progress bar (can affect performance                         measurement). Default values is 'False'.   -nstreams NUMBER_STREAMS, --number_streams NUMBER_STREAMS                         Optional. Number of streams to use for inference on                         the CPU/GPU in throughput mode (for HETERO and MULTI                         device cases use format                         <device1>:<nstreams1>,<device2>:<nstreams2> or just                         <nstreams>). Default value is determined automatically                         for a device. Please note that although the automatic                         selection usually provides a reasonable performance,                         it still may be non - optimal for some cases,                         especially for very small networks. See samples README                         for more details.   -nthreads NUMBER_THREADS, --number_threads NUMBER_THREADS                         Number of threads to use for inference on the CPU                         (including HETERO and MULTI cases).   -pin {YES,NO,NUMA}, --infer_threads_pinning {YES,NO,NUMA}                         Optional. Enable threads->cores ('YES' is default                         value), threads->(NUMA)nodes ('NUMA') or completely                         disable ('NO')CPU threads pinning for CPU-involved                         inference.   --exec_graph_path EXEC_GRAPH_PATH                         Optional. Path to a file where to store executable                         graph information serialized.   -pc [PERF_COUNTS], --perf_counts [PERF_COUNTS]                         Optional. Report performance counters.   --report_type {no_counters,average_counters,detailed_counters}                         Optional. Enable collecting statistics report.                         "no_counters" report contains configuration options                         specified, resulting FPS and latency.                         "average_counters" report extends "no_counters" report                         and additionally includes average PM counters values                         for each layer from the network. "detailed_counters"                         report extends "average_counters" report and                         additionally includes per-layer PM counters and                         latency for each executed infer request.   --report_folder REPORT_FOLDER                         Optional. Path to a folder where statistics report is                         stored.

 

最新回复(0)