Categories: Blog

Speech Recognition & response in web or mobile directly without Alexa/Google home dependency

In the era of voice-enabled devices like Google Assistant, Amazon Alexa it’s quite obvious that In the near future, there will be more or less support of voice-enabled services in every aspect of our routine life.

As it provides better interactivity and easy accessibility, it’ll be a game-changer for the next generation. There are already smart houses out there where every single thing in your house can talk to you and respond to your command. There is neither GUI nor content needed in voice-enabled devices, the only concerning factor is speed. You can get a faster response compared to all other technologies.

There are so many libraries and API out there, you can use to get started with your voice bot like

  1. Microsoft Bing Speech
  2. Google Web Speech API
  3. Google Cloud speech
  4. IBM speech to text
  5. Wit.ai

We are going to use Google web speech API from speechRecognition library. It’s easy to use as it has a default API key that is hard-coded into the SpeechRecognition library.
So that you can get started using it without any configuration and authentication process. Of course like every other API it has a daily limit of 50 requests. And we can’t raise the limit by any chance. so this is the best API you can use for experiment purposes. For production or live scenarios, you’ll have to purchase paid services from the above-mentioned APIs.

There will be a three-step process for every voice-enabled device –

  1. Speech to Text: In this phase, we are going to let our bot understand what we are talking about. We’ll provide either an audio file or a direct stream from our mic. The bot will convert this sound signal into text using our google speech recognition API.
  2. Processing: After converting your voice into a text bot will process your text and respond the same as a text-based bot will do. The process can be either to search a song from the web or can be to set an alarm or reminder.
  3. Text to speech: After the bot completed its processing and ready with your output stream or data, the last step is to give the user that processed response in voice form, which can be achieved using the google TextToSpeech library.

So, Let’s get started with developing your first voice-enabled bot.

Dependencies :
  1. Google Speech recognition library
    pip install SpeechRecognition
  2. Pyaudio
    pip install pyaudio
  3. Flask
    pip install Flask
Script.py


import json
import os
from flask import Flask, Response
from flask import jsonify
from flask import request, redirect
from flask_socketio import SocketIO
from flask_cors import CORS
import ss
import speech_recognition as sr
import io
from gtts import gTTS

app = Flask(__name__)
socketio = SocketIO(app)
CORS(app)


# Redirect http to https on CloudFoundry
@app.before_request
def before_request():
fwd = request.headers.get('x-forwarded-proto')
if fwd is None:
return None
elif fwd == "https":
return None
elif fwd == "http":
url = request.url.replace('http://', 'https://', 1)
code = 301
return redirect(url, code=code)


@app.route('/')
def Welcome():
return app.send_static_file('index.html')


@app.route('/api/conversation', methods=['POST', 'GET'])
def getConvResponse():
convText = request.form.get('convText')
convContext = request.form.get('context', "{}")
jsonContext = json.loads(convContext)
if convText:
response = "Did you mean, " + convText + " ?"
else:
response = "Hello There"
responseDetails = {'responseText':response,
'context':response}
return jsonify(results=responseDetails)


@app.route('/api/text-to-speech', methods=['POST'])
def getSpeechFromText():
inputText = request.form.get('text')
def generate():
if inputText:
audioOut = gTTS(text=inputText, lang='en', slow=False)
kk = audioOut.save("welcome.mp3")
f = open("welcome.mp3",'rb')
data = f.read()
else:
print("Empty response")
data = "I have no response to that."

yield data

return Response(response=generate(), mimetype="audio/x-wav")

@app.route('/api/speech-to-text', methods=['POST'])
def getTextFromSpeech():
recognizer = sr.Recognizer()
f = request.files['audio_data']
print(f,type(f))
file_obj = io.BytesIO()
file_obj.write(f.read())
file_obj.seek(0)
mic = sr.AudioFile(file_obj)
response = ss.recognize_speech_from_mic(recognizer, mic)
print('\nSuccess : {}\nError : {}\n\nText from Speech\n{}\n\n{}' \
.format(response['success'],
response['error'],
'-'*17,
response['transcription']))
return Response(response=response['transcription'], mimetype='plain/text')


port = 5000
if __name__ == "__main__":
socketio.run(app, host='0.0.0.0', port=int(port))

Ss.py

import speech_recognition as sr

def recognize_speech_from_mic(recognizer, microphone):

with microphone as source:
audio = recognizer.record(source)
response = {
"success": True,
"error": None,
"transcription": None
}

try:
response["transcription"] = recognizer.recognize_google(audio)
except sr.RequestError:
# API was unreachable or unresponsive
response["success"] = False
response["error"] = "API unavailable/unresponsive"
except sr.UnknownValueError:
# speech was unintelligible
response["error"] = "Unable to recognize speech"

return response

Run script.py file and it’ll run your server on 5000 ports. you‘ll need to call all defined functions from your front-end i.e HTML and javascript.
Let’s understand the code first.

  1. getConvResponse: This is the function that is responsible for storing the context of the conversation and printing output to your HTML front.
  2. getSpeechFromText: This function is responsible for converting your processed text output to voice output.
  3. getTextFromSpeech: This one is the most important function where we are getting voice input from the web recorder and converting it to text using speechRecognition API. This data will be passed to getConvResponse to save the context and process it.

Here in this tutorial, we developed a pretty simple example of a voice bot to make you understand how voice recognition works. You can use it in your live project by adding more functionalities. Feel free to contact us for any queries and know more about the other services we provide in Voice Assistant App development.

Lets Nurture

Share
Published by
Lets Nurture

Recent Posts

7 Powerful Psychological Triggers to Skyrocket Your Website Engagement

In the digital age, understanding the hidden forces driving user behavior is essential. By strategically…

7 months ago

What is haptics? How can we implement in in AR based mobile App? What are Haptics Use cases?

What is haptics?   Haptics refers to the use of touch feedback technology to simulate…

9 months ago

The Benefits of Using Virtual Reality in Business

In today's fast-paced and technologically driven world, businesses are constantly seeking innovative ways to stay…

1 year ago

A Closer Look at New Jersey’s Thriving Incubator Ecosystem

The Garden State, more popularly known as New Jersey, is not only known for its…

1 year ago

Why You Need a Mobile App for Your Business

In today's digital age, mobile apps have become an indispensable tool for businesses across all…

1 year ago

How to Optimize Your Website for Better User Experience

In today's digital era, a seamless and enjoyable user experience is crucial for the success…

1 year ago