python: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 0: invalid start byte #1279

KJ7LNW · 2023-02-23T20:29:31Z

So far I've not been able to reproduce this problem, but while using nerd-dictation, we have hit a Vosk decoding issue that appears to be rooted in the Bosk Python API code. I am running Python version 3.6 on CentOS 7 (which gets updates form Red Hat until 2024) while using the vosk-model-en-us-0.42-gigaspeech model.

You can see the backtrace below. Notice that the last line triggers an error within the Vosk API at "vosk/init.py", line 194, in FinalResult

Traceback (most recent call last):
  File "./nerd-dictation", line 1962, in <module>
    main()
  File "./nerd-dictation", line 1958, in main
    args.func(args)
  File "./nerd-dictation", line 1845, in <lambda>
    vosk_grammar_file=args.vosk_grammar_file,
  File "./nerd-dictation", line 1440, in main_begin
    vosk_grammar_file=vosk_grammar_file,
  File "./nerd-dictation", line 1215, in text_from_vosk_pipe
    json_text = rec_handle_fn_wrapper_from_final_result()
  File "./nerd-dictation", line 1054, in rec_handle_fn_wrapper_from_final_result
    json_text = rec.FinalResult()
  File "/usr/src/nerd-dictation/lib64/python3.6/site-packages/vosk/__init__.py", line 194, in FinalResult
    return _ffi.string(_c.vosk_recognizer_final_result(self._handle)).decode("utf-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 0: invalid start byte

@ideasman42, the developer of nerd-dictation suggests that this could be fixed in Vosk by adding errors=ignore. For example:

>>> b'A\xaeB'.decode('utf-8', errors='ignore')
'AB'

There are 4 different locations where text is decoded to UTF-8, so perhaps they need fixed up as well:

The text was updated successfully, but these errors were encountered:

nshmyrev · 2023-02-23T20:34:24Z

Do you use original gigaspeech model or did you modify it? I can't see a way original model to return non-utf8 char.

KJ7LNW · 2023-02-23T20:39:17Z

Original, unmodified.

nshmyrev · 2023-02-25T22:19:12Z

We need to reproduce it somehow. The 0xa0 output is very strange to be honest, feels more like a memory corruption. How often do you see this issue?

KJ7LNW · 2023-02-26T23:24:12Z

I've only seen it once. If it happens again I'll let you know.

nshmyrev · 2023-02-26T23:32:44Z

Ok, lets keep it open, I'll think how to catch it better.

KJ7LNW · 2023-02-28T19:40:33Z

There is a possibility that this was triggered because the Vosk object was reset (rec.reset()) from a signal context while the API was executing. Nerd-dictation supports suspend through SIGTSTP/SIGSTOP, so when it gets a stop signal it issues a reset on the Vosk API object. If Vosk happened to be executing at that moment than it may create an inconsistency in the library. (Note that this is not multi-threading, just interruption from a signal.)

This is only speculation, but I wanted to point it out in case it's a problem being caused external to your API library.

In terms of troubleshooting, are there any 0xa0 characters in the text generated by the vosk-model-en-us-0.42-gigaspeech model, even if some of them are part of a Unicode sequence? If it is actually a character representation issue in the model and not an issue related to suspending the process and issuing a reset, then by finding all text examples that contain 0xa0, and we can try triggering it with those words.

KJ7LNW mentioned this issue Feb 23, 2023

Ignore unicode error within Vosk ideasman42/nerd-dictation#91

Open

KJ7LNW closed this as completed Feb 26, 2023

nshmyrev reopened this Feb 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 0: invalid start byte #1279

python: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 0: invalid start byte #1279

KJ7LNW commented Feb 23, 2023

nshmyrev commented Feb 23, 2023

KJ7LNW commented Feb 23, 2023

nshmyrev commented Feb 25, 2023

KJ7LNW commented Feb 26, 2023

nshmyrev commented Feb 26, 2023

KJ7LNW commented Feb 28, 2023 •

edited

Loading

python: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 0: invalid start byte #1279

python: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 0: invalid start byte #1279

Comments

KJ7LNW commented Feb 23, 2023

nshmyrev commented Feb 23, 2023

KJ7LNW commented Feb 23, 2023

nshmyrev commented Feb 25, 2023

KJ7LNW commented Feb 26, 2023

nshmyrev commented Feb 26, 2023

KJ7LNW commented Feb 28, 2023 • edited Loading

KJ7LNW commented Feb 28, 2023 •

edited

Loading