Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 0: invalid start byte #1279

Open
KJ7LNW opened this issue Feb 23, 2023 · 6 comments

Comments

@KJ7LNW
Copy link

KJ7LNW commented Feb 23, 2023

So far I've not been able to reproduce this problem, but while using nerd-dictation, we have hit a Vosk decoding issue that appears to be rooted in the Bosk Python API code. I am running Python version 3.6 on CentOS 7 (which gets updates form Red Hat until 2024) while using the vosk-model-en-us-0.42-gigaspeech model.

You can see the backtrace below. Notice that the last line triggers an error within the Vosk API at "vosk/init.py", line 194, in FinalResult

Traceback (most recent call last):
  File "./nerd-dictation", line 1962, in <module>
    main()
  File "./nerd-dictation", line 1958, in main
    args.func(args)
  File "./nerd-dictation", line 1845, in <lambda>
    vosk_grammar_file=args.vosk_grammar_file,
  File "./nerd-dictation", line 1440, in main_begin
    vosk_grammar_file=vosk_grammar_file,
  File "./nerd-dictation", line 1215, in text_from_vosk_pipe
    json_text = rec_handle_fn_wrapper_from_final_result()
  File "./nerd-dictation", line 1054, in rec_handle_fn_wrapper_from_final_result
    json_text = rec.FinalResult()
  File "/usr/src/nerd-dictation/lib64/python3.6/site-packages/vosk/__init__.py", line 194, in FinalResult
    return _ffi.string(_c.vosk_recognizer_final_result(self._handle)).decode("utf-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 0: invalid start byte

@ideasman42, the developer of nerd-dictation suggests that this could be fixed in Vosk by adding errors=ignore. For example:

>>> b'A\xaeB'.decode('utf-8', errors='ignore')
'AB'

There are 4 different locations where text is decoded to UTF-8, so perhaps they need fixed up as well:

  1. https://github.com/alphacep/vosk-api/blob/master/python/vosk/__init__.py#L188
  2. https://github.com/alphacep/vosk-api/blob/master/python/vosk/__init__.py#L191
  3. https://github.com/alphacep/vosk-api/blob/master/python/vosk/__init__.py#L194
  4. https://github.com/alphacep/vosk-api/blob/master/python/vosk/__init__.py#L267
@nshmyrev
Copy link
Collaborator

Do you use original gigaspeech model or did you modify it? I can't see a way original model to return non-utf8 char.

@KJ7LNW
Copy link
Author

KJ7LNW commented Feb 23, 2023

Original, unmodified.

@nshmyrev
Copy link
Collaborator

We need to reproduce it somehow. The 0xa0 output is very strange to be honest, feels more like a memory corruption. How often do you see this issue?

@KJ7LNW
Copy link
Author

KJ7LNW commented Feb 26, 2023

I've only seen it once. If it happens again I'll let you know.

@KJ7LNW KJ7LNW closed this as completed Feb 26, 2023
@nshmyrev
Copy link
Collaborator

Ok, lets keep it open, I'll think how to catch it better.

@nshmyrev nshmyrev reopened this Feb 26, 2023
@KJ7LNW
Copy link
Author

KJ7LNW commented Feb 28, 2023

There is a possibility that this was triggered because the Vosk object was reset (rec.reset()) from a signal context while the API was executing. Nerd-dictation supports suspend through SIGTSTP/SIGSTOP, so when it gets a stop signal it issues a reset on the Vosk API object. If Vosk happened to be executing at that moment than it may create an inconsistency in the library. (Note that this is not multi-threading, just interruption from a signal.)

This is only speculation, but I wanted to point it out in case it's a problem being caused external to your API library.

In terms of troubleshooting, are there any 0xa0 characters in the text generated by the vosk-model-en-us-0.42-gigaspeech model, even if some of them are part of a Unicode sequence? If it is actually a character representation issue in the model and not an issue related to suspending the process and issuing a reset, then by finding all text examples that contain 0xa0, and we can try triggering it with those words.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants