Skip to content

Latest commit



110 lines (88 loc) · 8.26 KB

File metadata and controls

110 lines (88 loc) · 8.26 KB


  • Mar 03, 2020

  • Mar 04, 2020

    This is challenging for me, as I never had worked on this level of speech recognition (google cloud text-to-speech API hides all the machine learning steps from you :p)

  • Mar 06, 2020

    Realized I will not complete any code until the deadline :/

    It could be easier for me if I had chosen something related to image processing or computer vision (one idea I had was to integrate openvino on a small robot I'm assembling, so I could detect my cats and/or my son around the house), but decided to try out something different to maximize what I've learned on the course ¯\(ツ)

  • Mar 07, 2020

    • Continuing the study about spech recognition and just found Mozilla's DeepSpeech very interesting! After looking on the examples, decided to try to convert it's model (it's a TensorFlow implementation) to IR, and found a page on OpenVINO documentation about it.

    • Successful converted the DeepSpeech model to IR using the Model Optimizer. Although the documentation mentions DeepSpeech v0.3.0, managed to convert v0.5.1 (tried the latest version, v0.6.1, with no luck) with just some adjustments:

        python3 /opt/intel/openvino/deployment_tools/model_optimizer/ \
            --input_model deepspeech-0.5.1-models/output_graph.pb \
            --freeze_placeholder_with_value "input_lengths->[16]" \
            --input input_node,previous_state_h/read,previous_state_c/read \
            --input_shape [1,16,19,26],[1,2048],[1,2048] \
            --output raw_logits,lstm_fused_cell/GatherNd,lstm_fused_cell/GatherNd_1 \


        Model Optimizer arguments:
        Common parameters:
                - Path to the Input Model:      /opt/src/deepspeech-0.5.1-models/output_graph.pb
                - Path for generated IR:        /opt/src/.
                - IR output name:       output_graph
                - Log level:    ERROR
                - Batch:        Not specified, inherited from the model
                - Input layers:         input_node,previous_state_h/read,previous_state_c/read
                - Output layers:        raw_logits,lstm_fused_cell/GatherNd,lstm_fused_cell/GatherNd_1
                - Input shapes:         [1,16,19,26],[1,2048],[1,2048]
                - Mean values:  Not specified
                - Scale values:         Not specified
                - Scale factor:         Not specified
                - Precision of IR:      FP32
                - Enable fusing:        True
                - Enable grouped convolutions fusing:   True
                - Move mean values to preprocess section:       False
                - Reverse input channels:       False
        TensorFlow specific parameters:
                - Input model in text protobuf format:  False
                - Path to model dump for TensorBoard:   None
                - List of shared libraries with TensorFlow custom layers implementation:        None
                - Update the configuration file with input/output node names:   None
                - Use configuration file used to generate the model with Object Detection API:  None
                - Operations to offload:        None
                - Patterns to offload:  None
                - Use the config file:  None
        Model Optimizer version:        2020.1.0-61-gd349c3ba4a
        [ SUCCESS ] Generated IR version 10 model.
        [ SUCCESS ] XML file: /opt/src/./output_graph.xml
        [ SUCCESS ] BIN file: /opt/src/./output_graph.bin
        [ SUCCESS ] Total execution time: 5.50 seconds.
        [ SUCCESS ] Memory consumed: 1130 MB.

      NOTE: the above command was run on my OpenVINO container, after installing the requirements. This step will be detailed later, and a proper docker container will be created as well.

    • Next step is to integrate this Intermediate Representation on an application and develop the project showcase :)

  • Mar 08, 2020

    • Implemented two scripts to help getting things done:


        Download the DeepSpeech models and audio samples (version 0.5.1, but it can be changed when running the script) and extract them on /tmp. The script also copy the audio samples to the current directory, under the audio folder. Just run:

        $ ./

        To change wich version of DeepSpeech to download, just run as the following:

        $ DEEPSPEECHVERSION=0.6.1 ./

        NOTE: v0.6.1 is not supported right now, only v0.5.1.


        Uses OpenVINO Model Optimizer to actually convert the model in the Intermediate Representation. The resulting IR files will be copied to the current directory, under the model folder.

        $ ./

        Again, is possible to set wich DeepSpeech version to use:

        $ DEEPSPEECHVERSION=0.6.1 ./

        This conversion is based on this document, with a slight modification on the --output flag: lstm_fused_cell/Gather -> lstm_fused_cell/GatherNd and lstm_fused_cell/Gather_1 -> lstm_fused_cell/GatherNd_1.

    • Added a specialized dockerfile to this project. This includes OpenVINO 2020.1 and all requirements already installed (see requirements.txt). The container depends on my OpenVINO docker image, and all instructions are available on docker/

    • The actual use of the model was not developed yet. As the project showcase submission deadline is today, I will finish writting the documentation and work on the implementation after the submission.

  • Mar 09, 2020

    • Fixed a typo on the script (missing character on the disable_nhwc_to_nchw paramenter)

    • Added documentation about what was done until now and preparing for submission! \o/