Page 124 - Kỷ yếu hội thảo khoa học quốc tế - Ứng dụng công nghệ mới trong công trình xanh , lần thứ 8
P. 124
th
HỘI THẢO KHOA HỌC QUỐC TẾ ATiGB LẦN THỨ TÁM - The 8 ATiGB 2023 107
study are limited to Vietnamese and English, multilingual NMT models, achieving higher BLEU
international universities require support for diverse scores [12]. Additionally, techniques used in
languages reflecting their global student bodies. By multilingual text translation, such as increasing the
training OCR models on datasets encompassing similarity of semantically similar sentences in
languages such as English and Vietnamese, this different languages, can be applied to speech
project aims to create an inclusive identification translation to improve few-shot speech translation
system accessible to all students. Both the accuracy of using limited data [13]. These approaches aim to
multi-lingual OCR and its optimized deployment on overcome the challenges of data scarcity and improve
mobile devices will be evaluated. If successful, the the efficiency and accuracy of machine translation for
system will significantly improve the convenience additional languages [14].
and accessibility of student services, registration, Multi-lingual OCR systems face several
access control, and other functions. Our study challenges. One of the main difficulties is language
provides a strong basis for the techniques required, barriers, which can lead to requirements inconsistency
including deep learning for text recognition, mobile and incompleteness in the elicitation process [15].
model deployment, and user-friendly interfaces. By Another challenge is the growing diversity of internet
expanding these capabilities to new languages, this users, with different languages and cultural
research can break down informational barriers and preferences, which requires OCR systems to be able
streamline administration for international students to handle a wide range of languages [16].
from all backgrounds. The development and Additionally, multi-lingual OCR systems need to
evaluation of the multi-lingual OCR system will consider competing objectives, such as
assess the feasibility of this approach and provide recommendation quality at the individual and
direction for further improvements. aggregate level, stakeholder objectives, and long-term
II. LITERATURE REVIEW vs. short-term objectives [17]. These competing goals
Existing research has focused on OCR for make it necessary to develop multi-objective
languages like English and Vietnamese. One study by recommender systems that can optimize multiple
Chinh Ngo et al. introduces MTet, a large parallel objectives simultaneously. Overall, the challenges of
corpus for English-Vietnamese translation, and multi-lingual OCR systems include language barriers,
releases the first pretrained model EnViT5 for these diversity of languages and cultural preferences, and
languages. Their model outperforms previous state- the need to balance competing objectives.
of-the-art results in translation BLEU score. Another III. METHODOLOGY
paper by H. V. T. Chi et al. proposes a method based A. Technologies Used
on MT-DNN to detect similarities between English
and Vietnamese sentences for paraphrase • Firebase ML;
identification. They achieve improved accuracy and • TensorFlow Lite Model trained by BERT;
F1 scores by changing the shared layers of the
original MT-DNN. Thi-Vinh Ngo et al. addresses the • React Native Framework;
rare word issue in multilingual MT systems for • Android Virtual Device of Android Studio;
French-Vietnamese and English-Vietnamese pairs. • Actual Android Device;
They propose strategies to learn word similarity and
augment the translation ability of rare words, resulting • Visual Code.
in significant improvements in BLEU points. Duc B. Data Collection and Augmentation
Toan Truong et al. explore context-aware models for To create the multi-lingual training dataset,
English-Vietnamese translation tasks, aiming to student ID card images (Figure 1) will be collected
improve translation quality and human readability by for languages including Vietnamese and English. We
considering contextual information from consecutive made sure that there should be a coordination with
sentences.[6]–[9]. international student groups and synthesis using
Machine translation techniques to support graphical editing will be used to generate samples.
additional language translation include Statistical Data augmentation techniques like rotation, resizing,
Machine Translation (SMT), Rule-based Machine and noise injection will expand the training data
Translation (RBMT), Example-based Machine diversity.
Translation (EBMT), and Neural Machine Translation
(NMT) [10]. Multilingual NMT models leverage
information from multiple languages to improve
translation performance [11]. Data augmentation
techniques can further enhance the performance of
ISBN: 978-604-80-9122-4