زمن فاضل جبر

[صفحه اصلی ]

بخش‌های اصلی

درباره دانشکده

مدیریت دانشکده

اعضای هیات علمی

معرفی افراد

امور آموزش و اطلاعیه دفاعیه ها

امور فرهنگی

امور پژوهشی

اخبار و رویدادهای دانشکده

فضاهای آموزشی و تحقیقاتی

تسهیلات پایگاه

تماس با ما

ورود به سایت دروس

دانشجویان روزانه و پردیس
دانشجویان مرکز آموزش الکترونیکی

اطلاعیه ها

اطلاعیه های آموزشی

فراخوان ها

فراخوان های همکاری با صنعت و سازمان ها

دفاعیه‌ها

دفاعیه های دکتری

جمیله خلیلی شهروز

علی مختاری

زمن فاضل جبر

دفاعیه های کارشناسی ارشد

شایان صادق عمل نیک رفتار

جستجو در پایگاه

دریافت اطلاعات پایگاه

زمن فاضل جبر

| تاریخ ارسال: 1404/6/29 |

دانشجو زمن جبر دانشجوی دکتری، دکتر ناصر مزینی مورخ: ۱۴۰۴/۰۶/۳۰ساعت: ۱۴ الی ۱۷ از رساله دکتری خود با عنوان " Arabic Speech Recognition from Visual Cue UsingDeep Learning " دفاع خواهند نمود.

ارائه دهنده:

زمن فاضل جبر

استاد راهنما:

دکتر ناصر مزینی
استاد مشاور: دکتر اعتمادی

هیات داوران:
دکترمینائی
دکترمحمدی
دکترصامتی
دکتر زنیالی

زمان ۳۰ شهریورماه ماه ۱۴۰۴

ساعت:۱۴ الی ۱۷

مکان: اتاق دفاع طبقه دوم

Abstract

Visual speech recognition (VSR), or lip-reading, is crucial in human communication and speech understanding. Lip-reading is a challenging task that requires deep learning models to achieve high accuracy. The researchers introduced many deep learning models using Deep Neural Networks (DNNs) with letters, digits, words, and sentences for other languages, but not Arabic. The main reason for the low number of lip-reading studies in Arabic is the unavailability of a large-scale dataset that can be used to train a DNN.
The work in this thesis contributes to automatic Arabic lip-reading at the word and sentence levels using DNN with visual cues only. We attempted to find a solution to the problem of lacking a large-scale Arabic dataset for training a DNN model. To this end, we propose an end-to-end Arabic lip-reading model that can be trained on a limited dataset, which combines a Visual module consisting of a multi-layer Convolutional Neural Network (CNN) and a Temporal module comprised of Gated Recurrent Unit (GRU) and soft-max layers, taking into account the balance between the size of the dataset and the number of model parameters. To train this model, we created a limited Arabic dataset comprising ۲۰ words spoken by ۴۰ native Arabic speakers. At the word level, our proposed method is evaluated on ۱) our dataset, where we obtained an accuracy equal to ۸۳.۰۲%; ۲) the Dweik et al. dataset, where we obtained an improvement rate of ≈ ۳% on the result recorded by their work. In addition, we employed the Visual module for person identification using the viseme image and obtained a high-performance result.
At the sentence level, we modified the same end-to-end model to address the problem from two perspectives: first, as a classification problem, and second, as a sequence prediction problem. The modification is only applied to the Temporal module, while the Visual model remains unchanged. In the classification problem, the Temporal module consists of a stack of GRUs and a fully connected layer. In the sequence prediction problem, the Temporal module is the encoder-decoder network; the encoder consists of three GRU layers, while the decoder consists of two GRU layers with an attention mechanism. To train the end-to-end model, we collected a sentence-level dataset for the Arabic language, comprising ۵۵ sentences with ۱۳۹ unique words uttered by ۴۰ individuals, including ۲۸ declarative sentences, ۲۰ interrogative sentences, and ۷ request sentences. This dataset is the largest sentence-level Arabic language dataset addressing lip-reading problems. We made this dataset involve all ۲۸ phonemes in Arabic; this attribute is only in our dataset and is missing in all previous works for the Arabic language.
For the sentence classification problem, the end-to-end model was first applied to our dataset, yielding recognition accuracies of ۹۰.۴۵% for person-dependent and ۷۱.۵۳% for person-independent experiments. Then, it was used in the BlidAVS۱۰ dataset, and an accuracy of ۸۳.۰۹ was obtained for the person-independent experiment. For the sequence prediction problem, the end-to-end model was applied to our dataset, yielding an ۸۰.۵۱% Word Error Rate (WER).

دفعات مشاهده: 1864 بار | دفعات چاپ: 313 بار | دفعات ارسال به دیگران: 0 بار | 0 نظر


سایر مطالب این بخش	نسخه قابل چاپ	ارسال به دوستان

Persian site map - English site map - Created in 0.14 seconds with 55 queries by YEKTAWEB 4722