Skip to content

Commit 774ef7e

Browse files
authoredSep 2, 2022
Merge pull request #291 from cmusphinx/pocketsphinx_main_again
Update pocketsphinx command-line program
2 parents 96615a8 + 74d7254 commit 774ef7e

9 files changed

+228
-139
lines changed
 

‎README.md

+7-6
Original file line numberDiff line numberDiff line change
@@ -65,14 +65,15 @@ which defaults to `live`. The commands are as follows:
6565
contains a JSON object with these fields, which have short names
6666
to make the lines more readable:
6767

68-
- `a`: Start time in seconds, from the beginning of the stream
69-
- `e`: End time in seconds, from the beginning of the stream
70-
- `p`: Posterior probability of utterance
71-
- `t`: Full text of output
68+
- `b`: Start time in seconds, from the beginning of the stream
69+
- `d`: Duration in seconds
70+
- `p`: Estimated probability of the recognition result, i.e. a
71+
number between 0 and 1 which may be used as a confidence score
72+
- `t`: Full text of recognition result
7273
- `w`: List of segments (usually words), each of which in turn
73-
contains the `a`, `e`, `p`, and `t` fields, for start, end,
74+
contains the `b`, `d`, `p`, and `t` fields, for start, end,
7475
probability, and the text of the word. In the future we may
75-
also support hierarchical outputs in which case `w` could be
76+
also support hierarchical results in which case `w` could be
7677
present.
7778

7879
- `single`: Recognize the input as a single utterance, and write a

‎doxygen/CMakeLists.txt

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
find_package(Doxygen)
22
if(DOXYGEN_FOUND)
3-
set(DOXYGEN_PROJECT_NUMBER 5.0.0rc1)
3+
set(DOXYGEN_PROJECT_NUMBER 5.0.0rc2)
44
set(DOXYGEN_EXAMPLE_PATH ${CMAKE_SOURCE_DIR}/examples)
55
set(DOXYGEN_EXCLUDE_PATTERNS
66
*export.h
@@ -10,6 +10,7 @@ if(DOXYGEN_FOUND)
1010
WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}/include)
1111
endif()
1212
install(FILES
13+
pocketsphinx.1
1314
pocketsphinx_batch.1
1415
pocketsphinx_mdef_convert.1
1516
sphinx_lm_convert.1

‎doxygen/pocketsphinx_continuous.1 ‎doxygen/pocketsphinx.1

+70-50
Original file line numberDiff line numberDiff line change
@@ -1,64 +1,87 @@
1-
.TH POCKETSPHINX_CONTINUOUS 1 "2016-04-01"
1+
.TH POCKETSPHINX 1 "2016-04-01"
22
.SH NAME
3-
pocketsphinx_continuous \- Run speech recognition in continuous listening mode
3+
pocketsphinx \- Run speech recognition on audio data
44
.SH SYNOPSIS
5-
.B pocketsphinx_continuous
6-
.RI [ \fB\-infile\fR
7-
\fIfilename.wav\fR ]
8-
[ \fB\-inmic yes\fR ]
9-
[ \fIoptions\fR ]...
5+
.B pocketsphinx
6+
[ \fIoptions\fR... ]
7+
[ \fBlive\fR |
8+
\fBsingle\fR |
9+
\fBsoxflags\fR ]
1010
.SH DESCRIPTION
1111
.PP
12-
This program opens the audio device or a file and waits for speech. When it
13-
detects an utterance, it performs speech recognition on it.
12+
The ‘\f[CR]pocketsphinx\fP’ command-line program reads single-channel
13+
16-bit PCM audio from standard input and attemps to recognize speech
14+
in it using the default acoustic and language model. It accepts a
15+
large number of options which you probably don't care about, and a
16+
\fIcommand\fP which defaults to ‘\f[CR]live\fP’. The commands are as
17+
follows:
18+
.TP
19+
.B live
20+
Detect speech segments in standard input, run
21+
recognition on them (using those options you don't care about), and
22+
write the results to standard output in line-delimited JSON. I
23+
realize this isn't the prettiest format, but it sure beats XML. Each
24+
line contains a JSON object with these fields, which have short names
25+
to make the lines more readable:
26+
.IP
27+
"b": Start time in seconds, from the beginning of the stream
28+
.IP
29+
"d": Duration in seconds
30+
.IP
31+
"p": Estimated probability of the recognition result, i.e. a number between
32+
0 and 1 which may be used as a confidence score
33+
.IP
34+
"t": Full text of recognition result
35+
.IP
36+
"w": List of segments (usually words), each of which in turn contains the
37+
\f[CR]b\fP’, ‘\f[CR]d\fP’, ‘\f[CR]p\fP’, and ‘\f[CR]t\fP’ fields, for
38+
start, end, probability, and the text of the word. In the future we
39+
may also support hierarchical results in which case ‘\f[CR]w\fP’ could
40+
be present.
41+
.TP
42+
.B single
43+
Recognize the input as a single utterance, and write a JSON object in the same format described above.
44+
.TP
45+
.B soxflags
46+
Return arguments to ‘\f[CR]sox\fP’ which will create the appropriate
47+
input format. Note that because the ‘\f[CR]sox\fP’ command-line is
48+
slightly quirky these must always come \fIafter\fP the filename or
49+
\f[CR]-d\fP’ (which tells ‘\f[CR]sox\fP’ to read from the
50+
microphone). You can run live recognition like this:
51+
.EX
52+
sox -d $(pocketsphinx soxflags) | pocketsphinx
53+
.EE
54+
or decode from a file named "audio.mp3" like this:
55+
.EX
56+
sox audio.mp3 $(pocketsphinx soxflags) | pocketsphinx
57+
.EE
1458
.PP
15-
To record from microphone and decode use
16-
.TP
17-
.B \-inmic yes
18-
.PP
19-
To decode a 16kHz 16-bit mono WAV file use
20-
.TP
21-
.B \-infile \fIfilename.wav\fR
22-
.PP
23-
You can also specify
24-
.B \-lm
25-
or
26-
.B \-fsg
27-
or
28-
.B \-kws
29-
depending on whether you are using a statistical language
30-
model or a finite-state grammar or look for a keyphase.
59+
By default only errors are printed to standard error, but if you want more information you can pass ‘\f[CR]-loglevel INFO\fP’. Partial results are not printed, maybe they will be in the future, but don't hold your breath. Force-alignment is likely to be supported soon, however.
3160
.SH OPTIONS
3261
.TP
33-
.B \-adcdev
34-
of audio device to use for input.
35-
.TP
3662
.B \-agc
3763
Automatic gain control for c0 ('max', 'emax', 'noise', or 'none')
3864
.TP
3965
.B \-agcthresh
4066
Initial threshold for automatic gain control
4167
.TP
4268
.B \-allphone
43-
phoneme decoding with phonetic lm
69+
phoneme decoding with phonetic lm (given here)
4470
.TP
4571
.B \-allphone_ci
4672
Perform phoneme decoding with phonetic lm and context-independent units only
4773
.TP
4874
.B \-alpha
4975
Preemphasis parameter
5076
.TP
51-
.B \-argfile
52-
file giving extra arguments.
53-
.TP
5477
.B \-ascale
5578
Inverse of acoustic model scale for confidence score calculation
5679
.TP
5780
.B \-aw
5881
Inverse weight applied to acoustic scores.
5982
.TP
6083
.B \-backtrace
61-
Print results and backtraces to log file.
84+
Print results and backtraces to log.
6285
.TP
6386
.B \-beam
6487
Beam width applied to every frame in Viterbi search (smaller values mean wider beam)
@@ -73,17 +96,14 @@ Language model probability weight for bestpath search
7396
Number of components in the input feature vector
7497
.TP
7598
.B \-cmn
76-
Cepstral mean normalization scheme ('current', 'prior', or 'none')
99+
Cepstral mean normalization scheme ('live', 'batch', or 'none')
77100
.TP
78101
.B \-cmninit
79-
Initial values (comma-separated) for cepstral mean when 'prior' is used
102+
Initial values (comma-separated) for cepstral mean when 'live' is used
80103
.TP
81104
.B \-compallsen
82105
Compute all senone scores in every frame (can be faster when there are many senones)
83106
.TP
84-
.B \-debug
85-
level for debugging messages
86-
.TP
87107
.B \-dict
88108
pronunciation dictionary (lexicon) input file
89109
.TP
@@ -117,6 +137,12 @@ Frame rate
117137
.B \-fsg
118138
format finite state grammar file
119139
.TP
140+
.B \-fsgdir
141+
directory for FSG files
142+
.TP
143+
.B \-fsgext
144+
extension for FSG files (including leading dot)
145+
.TP
120146
.B \-fsgusealtpron
121147
Add alternate pronunciations to FSG
122148
.TP
@@ -147,12 +173,6 @@ Run forward lexicon-tree search (1st pass)
147173
.B \-hmm
148174
containing acoustic model files.
149175
.TP
150-
.B \-infile
151-
file to transcribe.
152-
.TP
153-
.B \-inmic
154-
Transcribe audio from microphone.
155-
.TP
156176
.B \-input_endian
157177
Endianness of input data, big or little, ignored if NIST or MS Wav
158178
.TP
@@ -169,7 +189,7 @@ file with keyphrases to spot, one per line
169189
Delay to wait for best detection score
170190
.TP
171191
.B \-kws_plp
172-
Phone loop probability for keyword spotting
192+
Phone loop probability for keyphrase spotting
173193
.TP
174194
.B \-kws_threshold
175195
Threshold for p(hyp)/p(alternatives) ratio
@@ -201,6 +221,9 @@ Base in which all log-likelihoods calculated
201221
.B \-logfn
202222
to write log messages in
203223
.TP
224+
.B \-loglevel
225+
Minimum level of log messages (DEBUG, INFO, WARN, ERROR)
226+
.TP
204227
.B \-logspec
205228
Write out logspectral files instead of cepstra
206229
.TP
@@ -250,7 +273,7 @@ Use memory-mapped I/O (if possible) for model files
250273
Number of cep coefficients
251274
.TP
252275
.B \-nfft
253-
Size of FFT
276+
Size of FFT, or 0 to set automatically (recommended)
254277
.TP
255278
.B \-nfilt
256279
Number of filter banks
@@ -286,7 +309,7 @@ to log raw audio files to
286309
Remove DC offset from each frame
287310
.TP
288311
.B \-remove_noise
289-
Remove noise with spectral subtraction in mel-energies
312+
Remove noise using spectral subtraction
290313
.TP
291314
.B \-round_filters
292315
Round mel filter frequencies to DFT points
@@ -315,9 +338,6 @@ Write out cepstral-smoothed logspectral files
315338
.B \-svspec
316339
specification (e.g., 24,0-11/25,12-23/26-38 or 0-12/13-25/26-38)
317340
.TP
318-
.B \-time
319-
Print word times in file transcription.
320-
.TP
321341
.B \-tmat
322342
state transition matrix input file
323343
.TP

‎doxygen/pocketsphinx.1.in

+72
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
.TH POCKETSPHINX 1 "2016-04-01"
2+
.SH NAME
3+
pocketsphinx \- Run speech recognition on audio data
4+
.SH SYNOPSIS
5+
.B pocketsphinx
6+
[ \fIoptions\fR... ]
7+
[ \fBlive\fR |
8+
\fBsingle\fR |
9+
\fBsoxflags\fR ]
10+
.SH DESCRIPTION
11+
.PP
12+
The ‘\f[CR]pocketsphinx\fP’ command-line program reads single-channel
13+
16-bit PCM audio from standard input and attemps to recognize speech
14+
in it using the default acoustic and language model. It accepts a
15+
large number of options which you probably don't care about, and a
16+
\fIcommand\fP which defaults to ‘\f[CR]live\fP’. The commands are as
17+
follows:
18+
.TP
19+
.B live
20+
Detect speech segments in standard input, run
21+
recognition on them (using those options you don't care about), and
22+
write the results to standard output in line-delimited JSON. I
23+
realize this isn't the prettiest format, but it sure beats XML. Each
24+
line contains a JSON object with these fields, which have short names
25+
to make the lines more readable:
26+
.IP
27+
"b": Start time in seconds, from the beginning of the stream
28+
.IP
29+
"d": Duration in seconds
30+
.IP
31+
"p": Estimated probability of the recognition result, i.e. a number between
32+
0 and 1 which may be used as a confidence score
33+
.IP
34+
"t": Full text of recognition result
35+
.IP
36+
"w": List of segments (usually words), each of which in turn contains the
37+
\f[CR]b\fP’, ‘\f[CR]d\fP’, ‘\f[CR]p\fP’, and ‘\f[CR]t\fP’ fields, for
38+
start, end, probability, and the text of the word. In the future we
39+
may also support hierarchical results in which case ‘\f[CR]w\fP’ could
40+
be present.
41+
.TP
42+
.B single
43+
Recognize the input as a single utterance, and write a JSON object in the same format described above.
44+
.TP
45+
.B soxflags
46+
Return arguments to ‘\f[CR]sox\fP’ which will create the appropriate
47+
input format. Note that because the ‘\f[CR]sox\fP’ command-line is
48+
slightly quirky these must always come \fIafter\fP the filename or
49+
\f[CR]-d\fP’ (which tells ‘\f[CR]sox\fP’ to read from the
50+
microphone). You can run live recognition like this:
51+
.EX
52+
sox -d $(pocketsphinx soxflags) | pocketsphinx
53+
.EE
54+
or decode from a file named "audio.mp3" like this:
55+
.EX
56+
sox audio.mp3 $(pocketsphinx soxflags) | pocketsphinx
57+
.EE
58+
.PP
59+
By default only errors are printed to standard error, but if you want more information you can pass ‘\f[CR]-loglevel INFO\fP’. Partial results are not printed, maybe they will be in the future, but don't hold your breath. Force-alignment is likely to be supported soon, however.
60+
.SH OPTIONS
61+
.\" ### ARGUMENTS ###
62+
.SH AUTHOR
63+
Written by numerous people at CMU from 1994 onwards. This manual page
64+
by David Huggins-Daines <dhdaines@gmail.com>
65+
.SH COPYRIGHT
66+
Copyright \(co 1994-2016 Carnegie Mellon University. See the file
67+
\fILICENSE\fR included with this package for more information.
68+
.br
69+
.SH "SEE ALSO"
70+
.BR pocketsphinx_batch (1),
71+
.BR sphinx_fe (1).
72+
.br

‎doxygen/pocketsphinx_continuous.1.in

-43
This file was deleted.

‎include/pocketsphinx.h

+6
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,12 @@ typedef struct ps_seg_s ps_seg_t;
118118
POCKETSPHINX_EXPORT
119119
void ps_default_search_args(ps_config_t *);
120120

121+
/**
122+
* Sets default file paths and parameters based on configuration.
123+
*/
124+
POCKETSPHINX_EXPORT
125+
void ps_expand_model_config(ps_config_t *config);
126+
121127
/**
122128
* Gets the system default model directory, if any exists.
123129
*

0 commit comments

Comments
 (0)