Exploring the Teaching Mode of English Audiovisual Speaking in Multimedia Network Environment
Abstract
Introducing multimedia network tools in English audiovisual teaching and building a new model of network-based multimedia teaching can make English audiovisual teaching more in line with students’ cognitive thinking characteristics and processes. This can improve the overall efficiency of English teaching in schools. Computers have been widely used in language evaluation and speech recognition for language learning, and speech recognition technology is an important reflection of the level of language learning. The large amount of language signal data, complex pronunciation changes, and high dimensionality of pronunciation feature parameters in the language learning process make it difficult to identify pronunciation features. The computational volume of pronunciation evaluation and recognition is too large, which requires high hardware resources and software resources to realize high-speed processing of massive pronunciation signals. To address the problem of low recognition rate of English pronunciation, this study proposes a sound recognition algorithm based on adaptive particle swarm optimization (PSO) matching pursuit (MP) sparse decomposition. The algorithm firstly improves the parameter adaptive setting of PSO based on the particle and population evolution rate, establishes parameter adaptive PSO, and realizes the optimization of adaptive PSO optimized MP sparse decomposition. The continuous Gabor super-complete atomic set is constructed based on the continuous space search property of PSO to improve the optimal atomic matching of the evolutionary process. Finally, the recognition of English pronunciation is realized by the support vector machine (SVM) algorithm. The test results show that the misjudgement rate for different mispronunciations is less than 1% when the system is used to evaluate the English pronunciation level. It proves that the method can effectively detect the mispronunciation and has high evaluation accuracy.
1. Introduction
At the time when the postindustrial age of human society is gradually going away, the knowledge economy is advancing rapidly and getting unprecedented development and attention. In particular, the development and application of network technology and multimedia technology have gradually popularized online education in Chinese education [1]. This kind of communication and popularization is changing the traditional learning time, learning method, and learning environment of contemporary Chinese students. At present, many schools carry out English audiovisual teaching and training directly on LAN or campus network, so that network technology is spreading rapidly and gradually popularized in English teaching.
Multimedia technology relies on the Internet and applies it to English audiovisual teaching [2]. It can also simulate the audiovisual content that is difficult to be expressed by traditional teaching methods through images, sounds, videos, and other forms. This makes the expression of English teaching more visual and intuitive.
Multimedia audiovisual teaching is characterized by informationization, diversification, and autonomy of learning environment, which fully embodies the learning theory of “input hypothesis” put forward by American applied linguist Krashen. According to Krashen, the ideal learning input should have four characteristics: comprehension, interesting and relevant, nongrammatical program ordering (not grammatically sequenced), and sufficient input (I + 1). By increasing the amount of language input and reducing the emotional filter factor, students’ listening and speaking ability can be improved based on high-quality language output (speaking). In the multimedia network environment, dynamic and real corpus and scene input cultivate students’ interest, create a relaxed language atmosphere, effectively promote the diversification and individuation of students’ learning behavior, fully mobilize students’ enthusiasm and initiative to acquire knowledge, and effectively improve students’ English listening and speaking ability.
As an international language, English has been valued by many countries. The enthusiasm for English learning in China is constantly rising, with various kinds of English learning software and platforms emerging in an endless stream. However, in the whole learning process, due to the lack of evaluation and feedback correction of spoken English pronunciation, most of the learning ability of listening and speaking is weak, and it is difficult to achieve standard colloquial communication. Speech recognition technology is used to assist English pronunciation learning, which effectively corrects learners’ wrong pronunciation [3] such as FLUENCY foreign language pronunciation system, EduSpeak voice system, and PLASER voice pronunciation training system, which are more maturely applied at present. Different pronunciation systems provide recognition and capture of speech signals, classification based on English pronunciation, feedback scoring based on language popularity and duration, etc., but all kinds of platforms have certain defects. For example, dynamic time warping (DTW) is used to train and identify English words and sentences, which effectively reduces the matching computation. However, the similarity between words is not taken into account in pronunciation, which makes it difficult to achieve the comparison of pronunciation evaluation. The PLASER speech system is based on the confidence of the factor score of English words, which makes the speech signal very fuzzy and difficult to achieve accurate matching. These systems mainly focus on computer platform system, and it is difficult to achieve the current portable, timely training requirements.
Referring to the research results in the field of speech signal detection, most of the existing algorithms extract signal characteristics such as MCFF, short-term energy spectrum, correlation coefficient, acoustic spectrum, ESMD permutation entropy, multiband energy, and sparse synthesis NMF. α-distribution analysis and T distribution can be used to reduce the interference of the external environment. Then, traditional classifiers such as K-nearest neighbor, mixed Gaussian model, SVM, DNN, and their combination patterns can be used to detect and recognize English pronunciation. Although good recognition effect is achieved, the above algorithms often need large sample size to support training, and there are problems such as difficult to set the order reasonably, multi-hidden layer error, and strong background noise of public environment has strong interference to the feature extraction and detection and recognition of the existing algorithms.
With the rise of deep learning, deep neural network has also been applied to deal with English pronunciation recognition and classification [4]. Based on the improved deep learning convolutional network, Lin et al. [5] extracted multi-scale normalized local features with stacked decreasing convolution kernel, improved the convergence speed and stability of the algorithm with dynamic learning rate, and achieved better recognition rate. Jia et al. [6] used three-hidden layer deep neural network to identify MCFF features of acoustic signals and achieved better recognition results than SVM and GMM. Compared with traditional classifiers, deep learning network improves the detection accuracy of pronunciation, but its huge parameter requirements, complex parameter settings, and calculation requirements require further optimization and improvement in practical applications [7].
Matching pursuit (MP) achieves sparse signal decomposition and noise reduction without prior information and has a good adaptability to the acoustic signal under the interference of external environment. Zhou et al. [8] extracted acoustic signal features with the help of over-MP sparse decomposition and detected abnormal acoustic signals with DBN. Wang et al. [9] used secondary sparse decomposition and reconstruction of sound signals to eliminate background noise interference and then extracted features of reconstructed signals for recognition. Wang and Ding [10] used PCA and LDA to extract the features of acoustic signals for acoustic signal detection and recognition after MP sparse decomposition of acoustic signals.
Building on these studies, this study proposes a sound recognition algorithm based on adaptive particle swarm optimization MP sparse decomposition. Firstly, the algorithm improved the parameter adaptive setting of PSO based on the particle and population evolution rate, established the parameter adaptive PSO, and realized the optimization of MP sparse decomposition of the adaptive PSO optimization. Based on the continuous space search feature of PSO, the continuous Gabor super-complete atom set was constructed to improve the optimal atom matching degree in the evolutionary process. Finally, the recognition of English pronunciation was realized by the SVM algorithm. The results demonstrate the effectiveness and robustness of the proposed algorithm.
- (1)
The adaptive setting of particle swarm parameters is improved, and the objective function of sparse decomposition based on the evolution rate of particle and population is optimized.
- (2)
The continuous super-complete Gabor atom set is established based on the continuous set search characteristic of the adaptive particle swarm optimization algorithm to improve the matching degree of the best matched atom and the acoustic signal and speed up the atom matching search.
- (3)
SVM classifier is used to realize compound feature recognition of English pronunciation.
This study consists of five main parts: the first part is the introduction, the second part is state of the art, the third part is methodology, the fourth part is result analysis and discussion, and the fifth part is the conclusion.
2. State of the Art
As most English teachers grow up under the background of examination-oriented education, their teaching philosophy is largely influenced by the examination-oriented education philosophy, which leads them to follow the traditional teacher-centered teaching mode to carry out English audiovisual teaching [11]. In the phonic class, the teacher constantly explains, plays the recording to the students, and demonstrates and leads the students to pronounce. In this teaching mode of repeated playback, mechanical imitation, and parroting, there is no effective interaction between teachers and students. Students can only passively accept phonetic knowledge and lack flexible practice opportunities. The way is boring, and it is difficult for students to develop interest in learning.
With the continuous advancement of education informationization construction in China, most schools are equipped with multimedia phonetic teaching equipment, which provides convenience for English audiovisual speaking teaching [12]. Although most teachers can use multimedia technology, it is limited to using multimedia technology for the presentation of phonetic knowledge. Unable to make good use of multimedia technology leads to the advantages of multimedia technology that cannot be fully reflected, resulting in the waste of multimedia teaching resources.
Students’ listening and speaking ability is generally low. The factors causing this phenomenon are diversified, such as inappropriate learning methods, being influenced by mother tongue, and lack of language environment. [13]. This also makes a considerable number of students in the pronunciation course, when listening to the recording, and cannot accurately carry out oral pronunciation; as time goes by, students will gradually lose interest in English pronunciation learning. In the pronunciation class, those students showed low enthusiasm, were even afraid of the teacher to call on the roll to speak English, and tried to avoid all kinds of English communication activities. The existence of this problem will seriously affect the learning confidence of students and is not conducive to the improvement of students’ English listening and speaking ability.
In the present situation of audiovisual English teaching in schools, students’ subjectivity has not been fully reflected. Phonetic courses mainly focus on teachers’ explanation of basic phonetics knowledge and test students’ pronunciation learning effect by means of written knowledge assessment. However, students lack the opportunity to practice phonetic skills. This teacher-centered explanation replaces the student-centered learning mode, which has a negative impact on students’ English listening and speaking ability.
Students at school have high self-esteem. Although they crave approval and praise from others, they are also afraid of making a fool of themselves in front of others, which makes many students dare not speak English in public for fear of being laughed at by other students. Therefore, few students take the initiative to speak in school English pronunciation class, and those who are asked to speak are more likely to make mistakes because of greater psychological pressure. In addition, there are some students with weak psychological quality; after making mistakes in class, they will become more inferior and dare not speak and speak English in class. This makes speech errors more difficult to correct. In this vicious cycle, their English listening and speaking ability is more difficult to improve.
3. Methodology
3.1. English Pronunciation Recognition Algorithm
Let the element ax in set D = {ax, x = 1,2, …, V} be the unit vector of space B = RT. f = g · a, where g is the expansion coefficient and a = {a1, a2, …, aw} is the sparse decomposed atom set. Among all the expressions, the minimum value of m is the sparse decomposition of f ∈ B.
In practical application, the discreteness of atomic set and the redundancy of super-complete set are contradictory to some extent. PSO has the property of continuous space search. If it is introduced into MP sparse decomposition process, it can improve the influence of atom set discretization.
To avoid this problem, the particle needs a higher speed inheritance in the initial iteration to maintain the global search capability and a higher local search capability in the later iteration to maintain the stable solution. Based on this, a parameter adaptive adjustment strategy is proposed.
According to formulas (2) and (3), a larger population evolution rate indicates that after iteration, some of its particles can obtain better solutions than the current iteration; that is, they have better exploration ability. In subsequent iterations, these particles should conduct global optimization with a larger inertia factor. Otherwise, a local search is performed with a smaller value. When the of particle x is large, its evolution rate will be affected and decreased, and the corresponding inertia factor will be affected by the evolution rate and inherited more information of the particle in the last iteration. On the contrary, the particle is small and is less affected by the information of the previous iteration. In formula (3), the parameter values are minit = 0.9 and men d = 0.4.
It can be seen that the particle inertia factor of the next iteration is affected by the particle and population evolution ability of the previous iteration. The global optimization ability of a particle affects its search range, and the particle itself determines the setting of its independent inertia factor. It can be seen that the particle inertia factor of the next iteration is affected by the particle and population evolution ability of the previous iteration. The global optimization ability of a particle affects its search range, and the particle itself determines the setting of its independent inertia factor.
After recombination, the particles enter the iterative process together with the particles with strong evolutionary ability. The recombination strategy not only widens the search range of particles effectively but also ensures the speed and search accuracy of the algorithm.
Here, λ is the normalized parameter. The parameter set γx = {s, p, q, ω} is used to describe the characteristics of the atom, and its parameter set constitutes the spectral characteristics of the signal to be sparsely decomposed. The continuous Gabor set makes its atom number far exceed that of the discrete set, which ensures the redundancy of the atom set and the matching program of the optimally matching atom to the original signal structure.
- (1)
Initialize the relevant parameters of the improved particle algorithm. The boundary conditions were set as [imin, imax] and [qmin, qmax], the initial position and velocity of particles were randomly generated, and the fitness value f[ix(z)] was calculated.
- (2)
Update the velocity and position of particles, and limit the transgression according to the boundary values uhx and ahx.
(9) -
where r(·) is a random function with a random value between (0,1]. m is the inertia factor of the adaptive value. If m value is too large, the particle will over speed and jump out of iteration, while if m value is too small, it is not conducive to algorithm convergence. Therefore, based on the adaptive value, the inertia factor is further adjusted as follows:
(10) - (3)
Judge whether the particle velocity and position are out of bounds. If so, the boundary value is used instead. is updated to update population and individual optimality. Let iteration z = z+1 and go to Step (2) iterate until z≥ zmax, record abest and corresponding γbest , and update the residual of sparse decomposition.
(11) - (4)
Acoustic signals are reconstructed from formula (9) for subsequent detection and recognition.
Here, P and P1 represent the total number of evaluation samples and the number of accurately classified samples, respectively.
3.2. System Design
The architecture diagram of the system is shown in Figure 1. The voice recognition engine is invoked to provide English learning services for users through the Apache server. The database mainly includes user management database, basic words, and grammar, which were used to manage the system user information (basic information, learning, curriculum information, etc.), basic word information (spelling, polysemy, and other information), and grammatical information (common syntax information and correlation information). By running the speech recognition engine and the intelligent processing middleware on the server, the accuracy of the user’s English sentences can be judged according to the grammar rules.

Apache works using URL to request corresponding resources. The server will operate according to the corresponding identification algorithm of the program according to the user request, return the resources found to the client, that is, to complete a request, and then wait for the next request.
Software is mainly divided into background part and foreground part, according to the actual needs of the software design backend and frontend function modules. The background module mainly completes user management, data management, and system operation and maintenance. The foreground module is mainly customer operation module, including user login, English listening and speaking, and other functions. The functional composition structure is shown in Figure 2.

3.2.1. Background Functions
The user management module mainly completes the management personnel’s operation response, including the system administrator’s account, password, email, and other information. This section describes how to add and delete administrators. The system background is logged as a superadministrator and the preceding functions are performed, while a common administrator can only manage some common basic data.
The data management module mainly contains data recording, and data download two main functions. The data recording function is mainly to input the basic data required by the system, such as commonly used words and grammar rule information, mainly including textbook management, article management, and sentence management units. The data download module is to respond to the user URL request, complete the allocation of resources download on the Apache server, and return the customer request information.
The system operation and maintenance module is for the administrator to performance optimization and other work, including the maintenance of the system foreground and background interface (see Table 1).
Parameter | Description |
---|---|
Id | Statement id |
English | Original English |
Chinese | Chinese translation |
Voice URL | Audio files |
- Users can obtain the required data through the above interfaces.
3.2.2. Foreground Functions
After entering the user name and password, request message can be defined as < iq type = “get” id = regl><query xmlns = “jabber: iq: Register “/> </iq >, the server side after parsing back to the client side login success or not information.
The English listening and speaking module includes text statement selection, native statement playback, recording, voice playback, and other functions. Users select different function buttons based on their own needs, and the server responds to the user’s request in combination with the voice recognition engine. As the main function of the software, this module accounts for 80% of the system functional requirements.
3.2.3. User Usage Process
The user usage flow is shown in Figure 3.

4. Result Analysis and Discussion
4.1. English Pronunciation Error Detection
In the process of spectrogram extraction, FFT window size, frameshift, maximum frequency, and jump size were set to 20 ms, 10 ms, 16000 Hz, and 160.2 d, respectively. The dimensions of convolution kernel and pooling layer are 3 × 3 and 2 × 2. The dropout value during convolution is 0.5. Because the model contains CTC loss function, it is implemented by TensorFlow and Keras. Softmax of the input tag (annotated tag sequence), tag length, input length, and model output is passed to the CTC loss function to calculate the loss.
Experimental results | Explanation |
---|---|
True acceptance (AT) | AT is the number of correct pronunciation |
True rejection (RT) | RT is the number of incorrect pronunciation |
False rejection (RF) | RF is the number of correct pronunciation |
False acceptance (AF) | AF is measured by the number of incorrect pronunciations |
In the above three evaluation indexes, it is hoped to reduce the error rate of the other two types as much as possible while ensuring high diagnostic accuracy. The key is to avoid undermining learners’ learning confidence by judging their correct pronunciation as incorrect pronunciation, so the experiment aims at a high diagnosis rate and a low false rejection rate. Experimental results of different models for English pronunciation error detection are listed as follows (see Table 3 and Figure 4).

By comparing the proposed model with other 5 models, the results show that the proposed model achieves better results in false rejection rate (FRR) and diagnostic accuracy (DA). Compared with the model in the literature [16–20], the model in this study has achieved better results in false acceptance rate (FAR).
In English pronunciation errors, the 64 errors are divided into three types: pronunciation errors, intonation errors, and speed errors. Statistical results of these three types of errors are shown in Figure 5.

- (1)
American English has a distinct “r” sound, while British English does not. For example, the word worker is pronounced as |’w∂:rk∂| in American English and |’w∂:k∂| in British English.
- (2)
The word |a:| is pronounced in British English and |æ| in American English. For example, the word pass is pronounced |pa:s| in American English and |pæs| in American English and similar words such as ask.
- (3)
British English reads the sound of |O|, American English reads |a:| such as the word box, British reads | bOks |, and British reads |ba:ks|, and similarly watch.
- (4)
British English is used to skim words, while American English is used to read each syllable in its entirety. For example, the word interesting is pronounced as |’intristiŋ| in British style and |’int∂ristiŋ| in American style.
- (5)
The British English pronounces |i | sound, and the American English pronounces | ∂ |; e.g., the word system is pronounced as |’sistim| in British style and |’sist∂m| in American style.
- (6)
There are some words that are pronounced completely differently in British English and American English. For example, leisure in British is |’leз∂| and in American is |’li:z∂r|.
Figure 6 shows the pronunciation bias of English students for words with 5 middle syllables, and the results show that they perceive the r syllable poorly.

4.2. Evaluation of English Pronunciation Quality
To test the practical validity of the English pronunciation evaluation system based on the model of this study, the following experiment was designed.
The experimental environment is as follows: a student majoring in business English in a foreign language school is selected as the experimental object, and MATLAB software is used to program this system. Eight linear FM signals of the student’s spoken English pronunciation were collected, and the time width and relative bandwidth of the collected speech samples were 1.5 s and 0.5 dB, respectively, and the collected frequencies of the spoken English pronunciation signals with different vocal cords and baseband signals were 1024 kHz and 3∼9 kHz, respectively.
The good classification performance of the support vector machine helps the system to evaluate the English pronunciation level accurately. Based on this, the system was used to evaluate the English pronunciation level of the collected 8 speech segments (see Table 4).
Voice sequence number | Tone/points | Speed/points | Intonation/points | Rhythm/points | Emotion/points | Final score/points |
---|---|---|---|---|---|---|
1 | 8.6 | 8.5 | 8.4 | 8.5 | 6.9 | 8.8 |
2 | 8.6 | 8.6 | 8.5 | 8.6 | 7.6 | 8.6 |
3 | 8.7 | 8.2 | 6.7 | 6.7 | 7.5 | 7.8 |
4 | 9.3 | 4.9 | 7.4 | 5.9 | 6.2 | 6.9 |
5 | 8.7 | 8 | 5.2 | 4.3 | 8 | 6.8 |
6 | 7.2 | 8.5 | 8.7 | 8 | 6.8 | 8 |
7 | 7.6 | 5.9 | 5 | 10.6 | 5.2 | 6.9 |
8 | 8.9 | 8.7 | 9.6 | 8.5 | 8.9 | 9.3 |
Table 4 shows that the system can effectively evaluate five indicators of English pronunciation, tone, speed, intonation, rhythm, and emotion, and use the evaluation results of each indicator to make the final evaluation of English pronunciation level. It shows that this system can evaluate learners’ English pronunciation level from different directions and has high evaluation validity.
Six systems were used to evaluate the pronunciation level of the student’s 8 segments of spoken English, and the comparison of the accurate agreement rate of the six systems is shown in Figure 7.

5. Conclusion
Multimedia network technology integrates text, pictures, animation, video, sound, and other media forms into one. It has the advantages of combining visual and auditory senses into one, and its application to English audiovisual teaching can provide a broader platform and more choices for students’ English audiovisual learning. The famous American linguist Krashen believes that language learning is mainly done by language input. The same is true for English audiovisual teaching, which requires students to continuously make phonetic input. Aiming at the present situation of inaccurate pronunciation, the traditional English pronunciation learning lacks pronunciation evaluation and error correction guidance. In this study, an adaptive MP sparse decomposition algorithm for abnormal acoustic signal recognition is proposed. Firstly, the adaptive setting of PSO parameters was improved based on the evolution rate of particle and population, and a new objective function was constructed to realize the adaptive MP sparse decomposition. Then, the feature matching degree between the optimal atom and the acoustic signal is improved by continuous super-complete set. Finally, SVM is used to realize the accurate recognition of English pronunciation. The results show that compared with the existing algorithms, this algorithm has the best recognition rate of English pronunciation and has better recognition robustness for different pronunciation systems. In the subsequent research, more complex characteristic parameters will be used to further improve the detection accuracy of English pronunciation recognition.[21].
Conflicts of Interest
The author declares no conflicts of interest.
Acknowledgments
This work was supported by the Nanning College of Technology.
Open Research
Data Availability
The labeled dataset used to support the findings of this study is available from the author upon request.