-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vosk expansion_rules don't work #43
Comments
You may need to quote expansion_rules:
artikel: "[der|die|das]" |
Ah, nice. Works now. Thank you very much. You might want to add that in the example in "vosk/DOCS.md" here and in the "README.md" of wyoming-vosk I've tried it exactly like it's written there and that also didn't work. |
Thanks! I'll update the example and the docs. Any thoughts on the add-on itself? Can you share what your use case is maybe? I haven't promoted it at all yet. I'm thinking of making a tutorial video. |
I'm a fan of your work. I've been using Rhasspy and Romkabouter's ESP32 Satellite before. Nothing properly productive, mainly tinkering around. I'm currently digging into how this one works. My general thoughts are: Wow boy is there much work to do on the microcontroller side... Silence detection and VAD only kinda work with the ADF (which unfortunately isn't open source, which isn't great at all) and the media_player component and some others don't work with the ESP-IDF requirement. openWakeWord works but it tears down the pipeline and fires on_start and on_end every few seconds. And I'm still debugging stuff so my harddisk gets filled with recordings of silence and random stuff. I'd like this to be easier for someone who is new to the stuff. But... I even managed to train my own wake-word. That's awesome. Perspectively I'd like something like all the cool signal processing that's available in the big voice assistants. Being able to play music and subtract the output from the microphone so we can simultaneously listen to music and instruct it to stop. Have microphone arrays and far-field voice control, beam-forming and speaker recognition available. I suppose you've at one point seen how the Amazon bugging devices work, the signal processing really adds to the real-world usability. But we're still missing the absolute basics here. (I always preferred projects like ESP8266Audio to the ESP-ADF because it's free software. But there is no signal processing available and it's mainly for outputting sound.) Whisper is a bit slow on my old server. And I really liked the idea of constraining the STT to the predefined sentences. (I'm currently porting the stuff from my Rhasspy Add-on config. But I still struggle with esphome, instead.) It immediately makes it blazing fast and does away with problems like a preposition not being transcribed correctly. For a wider audience it would be great if the sentences came from what HA is able to understand (automatically). But it doesn't seem like this was our main concern at this point. (And I've played with VOSK before. It's really easy to write a few shell scripts or small python scripts to integrate it into your own small projects. I had tied it into an Asterisk telephony server at some point.) I think I'm going to file more bugreports once I get to dig into the VOSK addon. Currently the in/out replace doesn't work for me. It always gives me the fixed sentence back but without the replacement being done. And I'd need that for words which aren't in the keywords file (like the 'loo mo ss' example) only with german composite words that get blanks inserted inbetween. My main use-case would be a voice assistant for the kitchen that can play music, set timers, tell you a joke and announce the weather in the morning, the delay on public transport, birthdays and appointments of the day. And add things to the shopping list. I take it for granted that I can also turn on and off some lights in the house. I'd scatter around a few more ESP32s to announce things in other rooms and play music, once it becomes useful.) And the last thing, I'm fooling around with LLMs (Artificial intelligence). An AI agent could give the house a proper personality and be tied into HA to control everything like a ship-computer on Star Trek does. That's maybe something to consider after the year after the Year of the Voice. |
Thanks for the feedback @h3ndrik! I've updated the Vosk add-on to (hopefully) fix the in/out replace issue.
Agreed. Hardware is so varied and moving so fast that it's hard to make progress. With Espressif especially, they keep deprecating boards by the time I get something working on them 😄
I got an Echo Dot for testing and wow, it can hear you through just about anything. I don't know that we'll ever get there, honestly. Maybe if the big players give up fully on voice and sell their tech to someone willing to make chips that the rest of us can use.
This is the plan, actually. I need an API on the Home Assistant side to get the entities and areas that have been exposed to Assist. With that, I can just plug those lists into the default intents and generate the possible sentences.
They're getting faster and faster, so I'm hopeful that next year we'll be able to run a local LLM and use it with Home Assistant. I'm seeing more experiments where they constrain the LLM to produce JSON, for example. That would let you interface it to HA much more easily, and still produce interesting responses (inside the JSON). Thanks again for testing and following my work! |
Thank you very much. Can confirm it works and I've closed that issue.
Hehe. I still have some older ESP32 boards (not by espressif) in my drawer. Mainly because I like to start hobby projects and don't finish them. But sometimes I pull out something like the old TAudio board which I'm currrently testing this on.
Sadly I don't know much about signal processing. I've searched the internet for libraries and algorithms for noise suppression, echo cancellation and voice stuff. Seems there isn't anything good available to tinkerers like me. Mostly companies selling their proprietary solutions and DSPs. I'd like to get some microphone array board, but it would need to come with the signal processing already implemented. (And in a way that allows me to poke around.)
Things are still moving crazy fast. I run Home Assistant on an (old) server, so I'm not that constrained like someone with an single board computer would be. The server doesn't have a GPU but I can run llama.cpp in a different virtual machine and I'm willing to connect it to the smart home at some point. I'm aware of llama.cpp's feature to constrain it to some grammar like outputting JSON. In my opinion smaller models like Mistral 7B are surprisingly capable and still fast on a regular computer. And it knows a lot of things. Probably enough to be able to interact with me. I think with models in the size of Microsoft's phi-1 (but tuned for this use-case) we could have it run on a single board computer. I'm still not completely sold on the idea of having LLMs and smart assistants in my life. They're nice, but on the other hand I can already do lots of stuff the way it is. |
Wow, the expansion rules expand fast. I've added the HA intent sentences to turn on and off devices, lights and set brightness and color. With optional articles, prepositions and areas. (to the Vosk sentences) Now it says "Loading /share/vosk/sentences/de.yaml" for a minute and then the Vosk Addon kills the async event handler ;-) It stopped displaying the list when it got to a 4 or 5 digit length... Both limiting sentences and correcting them doesn't deal with that amount. I don't know enough about Vosk to make any recommendations here. But it seems ingesting that sentences file at runtime doesn't scale anywhere close to real-world usage. I've turned back to Faster-Whisper but it always gets most of it right, but one character or word wrong. ("Schalte das Wohnzimmerlicht ein" -> "Schalte das Wohnzimmer nicht ein" ("Don't turn on the livingroom")) Meh. |
Can you post the YAML here so I can benchmark it? |
Update: I've switched to using an sqlite database to store the sentences, and only giving vosk the available words. On a Raspberry Pi 4, it only takes 1.34 seconds to generate 22,786 sentences, and 0.01 seconds to load the recognizer. |
Well, I can still make it hang for a few minutes if I try something like the following (setting brigness in percent). After that some async worker will generate an error message but at least it seems to generate the sqlite database for the next pipeline run. sentences:
# light_HassLightSet
- "<setzen> [<artikel>] Helligkeit von <name> auf {brightness} [Prozent] [ein]"
- "[<artikel>] Helligkeit von <name> auf {brightness} [Prozent] <setzen>"
- "dimme [[<artikel>] Helligkeit [von|vom] [<artikel>]] <name> [auf|zu] {brightness} [Prozent]"
- "<name> [auf|zu] {brightness} [Prozent] dimmen"
# - in: "dimme <name>""
# out: "Setze Helligkeit von <name> auf 25"
lists:
device:
values:
- in: fernseher
out: Wohnzimmer TV
- in: licht
out: Deckenlicht Wohnzimmer
- in: wohnzimmer licht
out: Wohnzimmerlicht
- in: deko licht
out: Dekolicht
- in: flur licht
out: Flurlicht
- in: licht am esstisch
out: Esstischbeleuchtung
- in: küchen beleuchtung
out: Küchenbeleuchtung
- in: licht in der küche
out: Küchenbeleuchtung
brightness:
values:
- in: ein
out: 1
- in: eins
out: 1
- in: fünf
out: 5
- in: zehn
out: 10
- in: fünfzehn
out: 15
- in: zwanzig
out: 20
- in: fünfundzwanzig
out: 25
- in: dreißig
out: 30
- in: vierzig
out: 40
- in: fünfzig
out: 50
- in: sechzig
out: 60
- in: siebzig
out: 70
- in: fünfundsiebzig
out: 75
- in: achtzig
out: 80
- in: fünfundachzig
out: 85
- in: neunzig
out: 90
- in: fünfundneunzig
out: 95
- in: neunundneunzig
out: 99
- in: hundert
out: 100
color:
values:
- in: "wei(ß|ss)"
out: "white"
- in: "schwarz"
out: "black"
- in: "rot"
out: "red"
- in: "orange"
out: "orange"
- in: "gelb"
out: "yellow"
- in: "grün"
out: "green"
- in: "blau"
out: "blue"
- in: "violett"
out: "purple"
- in: "lila"
out: "purple"
- in: "braun"
out: "brown"
expansion_rules:
artikel_bestimmt: "(der|die|das|dem|der|den|des)"
artikel_unbestimmt: "(ein|eine|eines|einer|einem|einen)"
artikel: "(<artikel_bestimmt>|<artikel_unbestimmt>)"
name: "[<artikel>] {device}"
setzen: "(setz[e|en]|stell[e|en]|einstellen|änder[e|n]|veränder[e|n])"
licht: "[<artikel>] (Licht|Lampe|Beleuchtung)"
brightness: "{brightness} [Prozent]" |
Once I add an expansion_rules to my
de.yaml
the vosk add-on crashes on use (when speech gets sent)(Works fine if I delete the expansion_rules paragraph.)
Debug Log of the VOSK Add-on:
The text was updated successfully, but these errors were encountered: