A Speech Recognition Voice Control System for Visitor Directive in Osun State Polytechnic, Iree, Nigeria

*-** Department of Electrical and Electronic Engineering, Osun State Polytechnic, Iree, Osun, Nigeria.

Abstract

The purpose of this research is to demonstrate how a Voice Command System is implemented. The primary input of a user's voice is used in this method. We were able to convert voice as an input to text with speech to a text engine. The resulting text is used to process questions and collect relevant information. After collecting the information, it was converted to speech using speech-to-text and the correct output was delivered to the user. This research has realized a public API based speech recognition system with Raspberry Pi and USB microphone as hardware and several python programming languages. Osun State Polytechnic Iree has many different offices and lecture halls and the directory is done by signposts and asking people around the campus and there has been a great problem as some signposts have fallen off. While the signposts can only give a directive to buildings, it can't give the directive to offices inside the building and some visitors get embarrassed by the way they are replied to, some may be given the wrong direction while some may not get being replied at all. These prompts to design a system that gives visitors direction to offices and lecture rooms in Engineering building. The research covers the hardware specification of the speech recognition system and also maps an application of public speech APIs and shows a realization using appropriate hardware.

Keywords :

Voice,
Control,
Intelligent,
Visitor Directives,
Command

Introduction

A Voice Command System is a device that processes voice as an input, decodes or understands the meaning of the input, and provides a suitable voice output. Any voice command system requires three basic components that are text-to-text, query-to-text, and textto- specific. The speech is also a very necessary aspect of contact. Since sound and voices are processed faster than written text, vocal control systems in computer devices are omnipresent. In the field of speech recognition, there have been some good advances. The wide use of big data and in-depth learning in this area are some of the new developments (Van Der Aalst,1998). Such advances were attributed to deep learning methods in the development and usage of some of the speech recognition systems in the technology industry. Using large-scale data on speech systems, Google has lowered the number of word errors by 6% to 10%, for a system with a word error rating of 17% to 52% (Almomani et al., 2018).

This study aims to develop a system that gives direction to visitors within the SET building. Towards the achievement of this goal, the study addresses the following objectives:

To collect acoustic data for visitor directive.
To develop and implement an algorithm (Speech to Text Module, Command Module, and Text to Speech Module) for the voice command with an appropriate Graphical User Inter face (GUI) simulation.
To build a speaker-independent automatic visitor directive system

1. Literature Review

Traditional phonetic-based approaches, for example, both HMM-based language approaches, as well as acoustic and pronunciation-based approaches, involve a variety of components and preparations. It is necessary because the planning and deliver y process is streamlined. The models are common in terms of spelling errors and need to clean up the transcripts using a specific language model. Models are based on a growing approach for CTC models (World Health Organization, 2018).

The first component is to translate the numbers into written format and abbreviations. This is also known as email normalization. The second segment involves in the transformation of the signal into a comprehensible one (Amusat, 2009).

Recognition of speech is the machine's ability, for example, to understand words and phrases in all languages (Anusuya & Katti,2009). Such words or phrases are then translated according to a format that the computer can understand. The basic method of speech recognition is vocabulary systems (Panwar et al., 2017). such as small vocabulary system and Large Vocabulary- Small user system may be a voice recognition system (Olupohunda, 2012).

A computer or program has the ability to receive and interpret dictation, or to recognize and execute spoken commands is voice or speech recognition. Analog audio can be converted to digital signals for the use with computers. It has to be transformed from analog to digital. To decode the signal, a machine should have a digital dictionary of terms or syllables or a vocabulary to compare this information easily with the signals. When running the program, the talk patterns are stored on the hard disk and loaded into the memory. A comparator tests certain stored patterns for A / D converter output. The pattern, called Listen, Attend and Spell (LAS), "attracts attention" for different parts of the signal. Compared to CTC models, attentive models are not conditionally autonomous and should investigate specifically all the aspects of speech, visual, and language recognition. For example, this means that a language model is not needed during implementation, so it is practical for lowmemory applications (Javale et al., 2013) The superior performance of the CTC (with or without external language) models was good at the end of 2016.

Many changes were suggested following the original LAS model. Carnegie Mellon University, MIT, and the Brain Google have introduced latent sequence decompositions (LSDs) to handle lip reading for the first time and to tackle human achievement (Nan et al., 2017). The Oxford University and the DeepMind Google extended LAS to include the "Spelling, Viewing, Listening" In addition to the home automatics focussed on acoustics to address physical challenges, this study focuses on the design and implementation of Sound Intelligence as part of the broad perspective of the Raspberry Pi framework for the operations and control of remote sensors, motion detection and video cameras. The failure to access and track equipment from a remote location is one of the main reasons for energy loss. Users are driven by Ios or Android applications on these systems. This network uses a variety of networking mechanisms including WLAN, GSM, Bluetooth, and ZigBee. For current systems, there are various control mechanisms and configurations. So many locations for a range of applications have already found these systems. This work includes an analysis of both of these structures (Karan et al., 2017). Some researchers outlined the use of wireless technologies such as ZigBee, which can minimize the cost and ensure a stable and safe communication of the cabling of the home automation network. ZigBee is a wireless network with low data rate requirements and having features such as low-cost, low power consumption, and fast responses. ZigBee is most suited for networks such as homes in small areas. This program applies a Microsoft Protocol (SAPI), to allow voice recognition when a user commands the program. This system is a Microsoft protocol and it system consists of 3 main components: i) ZigBee module smart home server, ii) Smart environment sensor modules and iii) Module control of voice order.

2. Collection of Data

To successfully implement the visitor directive system, the following procedure was used for data acquisition.

Office numbers and occupants' names were collected with their fastest routes
The user speaks using the microphone
The language was interpreted in English by the Google Translator
Voice recognition module converts the analog message to computer digital format with the help of the analog to digital converter
The message was decoded as executed by the input/output pin of the Raspberry Pi
The recorded data was stored in the inbuilt memory of the raspberry pi
The recorded data was also stored as Commaseparated Value (CSV) file for further execution and backup recovery

3. System Architecture

3.1 Existing System

The new system suffers from the limitation of being able to store pre-defined voices and restricted voices. Therefore, the user cannot obtain all the details reliably.

3.2 Setup of the Voice Command System

The hardware setup of the voice command system is shown in Figure 1.

Figure 1. The Hardware setup of the Voice Command System

3.3 Proposed System

This research work eliminates the hectic process of tiring configurations and setups and overheating of the system, which ultimately affects the performance. The conversation of natural languages to the system and its response makes the user to feel like he is talking to another human and makes him to ignore the fact that a system performs all the tasks.

Figure 2 shows the architecture of the proposed system. The work initially seemed difficult as a whole. It was divided into modules to make the work easier and also for debugging. The three modules are as follows:

Speech to Text Module
Command Module
Text to Speech Module

Figure 2. Architecture of the Proposed System

3.4 Data Collection Protocol

For consistency and accuracy, a data collection protocol was designed. The protocol defines speaker selection process, keywords, recording parameters (sampling frequency and quantization rate), and the format of output audio files. The specific recording requirements/materials are enumerated below:

For noise correlation/reduction, a headset microphone was used, this help to reduce the amount of background noise that gets into the audio signal and to ensure the non-native speakers of a particular language familiarize themselves with the keywords in such languages before recording.
It is ensured that all recordings are done in a sound laboratory, so that the recording environment is free from external noise.
To take more than one record of keywords from each speaker for speaker variability.

4. Result

4.1 Result for Flow of Events in the Voice Command System

First, the user uses a microphone to submit the feedback when he begins the program. (Figure 3) Essentially, it takes the user's sound data and is fed to the machine for further process. Then the sound entry is transmitted to a text-totext converter, which translates audio input to text output that can be recognized and processed by the computer. Then the text will be evaluated and keywords scanned. Our voice control system is built around the keyword method where you scan for the appropriate keywords. when keywords are matched, the corresponding output is given. The result is in text type. This is then transformed to speech output by an optical character recognition (OCR) device using a text to a speech converter. OCR categorizes and understands the text and then translates it to the audio output by the wording engine. This signal is transmitted via the speakers connected to the raspberry pi audio jack.

Figure 3. Flow of Events in Voice Command System

The above-mentioned modules were systematically implemented using the necessary hardware and software and tested. The results were promising and consists of snapshots of some of the results.

Figure 4 is the environment of the RASBIAN operating system. The environment indicate that the developed program is ready to start operation.

Figure 4. RASBIAN Desktop Environment

4.2 Software Requirements

Raspbian Jessie
Google Map API
Language – Linux
Audio output

4.3 Testing of the Voice Directive

Figure 5 illustrates the voice directive environment. The user calls the name of the Lecturer and the directive will be displayed on the panel while the speaker voice out the response.

Figure 5. Voice Directive Environment

The Raspberry Pi needs a 5V/2A power supply to function properly. Any changes in the voltage rating affects the board such as rebooting often, not performing efficiently, etc., So it is necessary to give a proper power supply to it. The peripherals used in this prototype does not need a separate power supply as they get their power from the USB ports of the Raspberry Pi.

Conclusion

Machine Learning should be implemented completely into this system. That way, the system can able to learn new processes by itself and adapt to the user based on its past experiences. This makes it easier for the user to interact with the assistant. Memory can be improved and let the system store the new information gathered from the user and use it in the future if required. Many more modules can be created based on the necessities. Another important improvement done is the addition of native languages. It is a hectic process, yet it will be useful to all the people who do not understand English. It makes this device usable by almost all of the people in the world. By the addition of native languages, the device becomes much more user friendly and easily accessible.

References

[1]. Almomani, I. M., Alkhalil, N. Y., Ahmad, E. M., & Jodeh, R. M. (2018). Ubiquitous GPS theft management system. Applied Electrical Engineering and Computing Technologies (AEECT), pp.1-6.

[2]. Amusat, N. (2009). Disability care in Nigeria: The need for professional advocacy. African Journal of Physiotherapy and Rehabilitation Sciences, 1(1), 30-36. https://doi.org/10.4314/ajprs.v1i1.51313

[3]. Anusuya, A.,& Katti, S. (2009). Speech recognition by machine: A review. International Journal of Computer Science and Information Security(IJCSIS), 6(3), 181-198.

[4]. Javale, D., Mohsin, M., Nandanwar, S., & Shingate, M. (2013). Home automation and security system using Android ADK. International Journal of Electronics Communication and Computer Technology (IJECCT), 3(2), 382-385.

[5]. Karan, G. B., Kumar, D., Pai, K., & Manikandan, J. (2017, October). Design of a phoneme based voice controlled home automation system. In 2017 IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia) (pp. 31-35). IEEE.

[6]. Lee, I. J. (2011, September). An accident detection system on highway using vehicle tracking trace. In ICTC 2011 (pp. 716-721). IEEE

[7]. Nan, E., Radosavac, U., Papp, I., &Antić, M. (2017, September). Architecture of voice control module for smart home automation cloud. In 2017 IEEE 7th International Conference on Consumer Electronics- Berlin (ICCE-Berlin) (pp. 97-98). IEEE.

[8]. Olupohunda, B. (2012). The burden of illiteracy in Nigeria. The Punch Newspaper'Opinion, 37.

[9]. Panwar, A., Singh, A., Kumawat, R., Jaidka, S., & Garg, K. (2017, July). Eyrie smart home automation using Internet of Things. In 2017 Computing Conference (pp. 1368-1370). IEEE. https://doi.org/10.1109/SAI.2017. 8252269

[10]. Van Der Aalst, W. M. (1998). Three good reasons for using a Petri-net-based workflow management system. In Information and Process Integration in Enterprises (pp. 161-182). Springer, Boston, MA. https://doi.org/10.1007/ 978-1-4615-5499-8_10

[11]. World Health Organization. (2018). Report on World Disability. https://www.who.int/publications-detail/worldreport- on-disability