عاطفه پاکزاد

خانم عاطفه پاکزاد دانشجوی دکترای آقای دکتر مرتضی آنالویی روز یکشنبه مورخ 1400/09/21 ساعت 10:30 از رساله دکتری خود با عنوان "بازنمایی جملات در فضای معنایی با استفاده از روش‌های تخمین پارامتر"دفاع خواهند نمود.

ارائه دهنده:
عاطفه پاکزاد
استاد راهنما:
دکتر مرتضی آنالویی
هیات داوران:

دکتر کامبیز بدیع ؛ دکتر مهرنوش شمس‌فرد؛
دکترمحمدرضا جاهد مطلق؛ دکتر بهروز مینایی بیدگلی

زمان : 21 آذرماه 1400

ساعت 10:30

محل برگزاری: : http://meeting.iust.ac.ir/

چکیده پایان نامه :

مدل‌های معنایی توزیعی، معنای کلمات را به صورت بردار بازنمایی می‌کنند. برای بدست آوردن بردارهای معنایی کلمه دو روش مبتنی بر شمارش و مبتنی بر پیش‌بینی وجود دارد. بردارهای حاصل از روش‌های مبتنی بر شمارش دارای ابعاد زیادی هستند و معمولا از روش‌های کاهش ابعاد برای کاستن ابعاد بردار کلمه استفاده می‌شود. بردارهای مبتنی بر پیش‌بینی با استفاده از روش‌های یادگیری عمیق تعبیه‌های کلمه فشرده با ابعاد کم تولید می‌کنند. این بردارها کارایی خوبی در کاربردهای NLP از خود ارائه می‌کنند. مولفه‌های تعبیه کلمه اعداد حقیقی هستند و بردارهای پایه معادل مفهومی ندارند. در بردارهای کلمه بدست آمده با روش‌های مبتنی بر شمارش، هر بعد معادل واژگانی دارد. این بردارها با روش‌های کاهش ابعاد به بردارهای ضمنی تبدیل می‌شوند. ما در این پژوهش با توجه به حوزه هوش مصنوعی قابل توضیح یک رویکرد ترکیبی برای بازنمایی کلمه صریح با ابعاد کم پیشنهاد می‌کنیم که هر بردار پایه در فضای معنایی معادل یک کلمه پایه است. این رویکرد ترکیبی ابعاد بردارهای کلمه را به گونه‌ای کاهش می‌دهد که هر بعد یک معادل مفهومی داشته باشد و کارایی بردارهای کلمه بر روی وظیفه شباهت کلمه افت نکند.
در رویکرد ترکیبی پیشنهادی، برای شمارش هم‌رخدادی‌های کلمه هدف و کلمه‌های بافتار، ایده به‌کارگیری از روش محلی‌سازی را پیشنهاد می‌کنیم که به جای استفاده از پنجره با طول ثابت از یک تابع نمایی برحسب فاصله کلمه هدف و کلمه بافتار برای شمارش هم‌رخدادی بهره می‌برد. ما دو معیار یعنی شباهت کلمه و تعداد مولفه‌های صفر را علاوه بر فراوانی کلمه، به عنوان ویژگی‌ برای کلمات پیکره معرفی می‌کنیم. سپس تعدادی قاعده برای بدست آوردن کلمات پایه اولیه با استفاده از درخت تصمیم رسم شده براساس سه ویژگی، استخراج می‌کنیم. در این رویکرد ترکیبی از یک روش انتخاب کلمه برای یادگیری فضای برداری استفاده می‌کنیم که هر یک از ابعادش یک کلمه طبیعی است. روش انتخاب کلمه از پرتکرارترین کلمه‌ها شروع می‌کند و زیرمجموعه‌ای انتخاب می‌کند که دارای بهترین کارایی است. با استفاده از روش انتخاب کلمه 1000 کلمه پایه به دست می‌آوریم. همچنین با استفاده از روش وزن‌دهی دودویی براساس الگوریتم بهینه‌سازی ازدحام ذرات دودویی، کلمات طلایی پیکره را انتخاب کرده و به عنوان کلمات بافتار طلایی به 1000 کلمه پایه انتخاب شده با روش انتخاب کلمه می‌افزاییم. در این پژوهش از پیکره ukWaC برای ساخت بردارهای کلمه استفاده می‌شود. ما بردارهای کلمه صریح با ابعاد کم حاصل را بر روی وظیفه شباهت کلمه ارزیابی می‌کنیم. همچنین، قابلیت تفسیرپذیری بردارهای کلمه صریح بدست آمده را به صورت کیفی و کمی ارزیابی می‌نماییم. در آزمایش‌های این پژوهش، نتایج ارزیابی بردارهای کلمه بر روی وظیفه شباهت کلمه با نتایج مدل پایه مبتنی بر شمارش که دارای 5000 کلمه بافتار پرتکرار است و از پنجره ثابت به جای روش محلی‌سازی برای شمارش هم‌رخدادی استفاده می‌کند، مقایسه‌ می‌شود. با مقایسه بردارهای کلمه با ابعاد کم حاصل در مقایسه با بردارهای مدل پایه، ضریب همبستگی اسپیرمن برای مجموعه‌های آزمون MEN، RG-65 و SimLex-999 به ترتیب به میزان 4.66%، 14.73% و 1.08% افزایش می‌یابد. همچنین قابلیت تفسیرپذیری بردارهای کلمه به صورت کیفی و کمی نسبت به مدل‌های مبتنی بر پیش‌بینی به میزان قابل ملاحظه‌ای افزایش می‌یابد.

Abstract:
Distributional semantic models represent the meaning of words as vectors. There are two models for obtaining semantic word vectors namely count-based and prediction-based models. Word vectors derived from count-based models have many dimensions. Usually, dimension reduction methods are used to reduce the word vector's dimensions. Prediction-based models produce compact word embeddings with low dimensions using deep learning methods. The word embeddings provide good performance in NLP applications. The word embedding components are real numbers, and the base vectors have no conceptual equivalent. In word vectors obtained by the count-based models, each dimension has a lexical equivalent. These vectors are transferred to the implicit vectors by dimension reduction methods.
In this study, according to the field of explainable artificial intelligence, we propose a hybrid approach to represent the low-dimensional explicit word vectors that each base vector in the semantic space is equivalent to one basis word. The hybrid approach reduces the dimensions of word vectors in such a way that each dimension has a conceptual equivalent, and the word vector's performance do not diminish on the word similarity task. In the hybrid approach, we propose the idea of using a localization method for counting the co-occurrence of target words and context words. The localization method uses an exponential function based on the distance between the target word and the context word for counting the co-occurrence instead of considering a fixed-length window. We introduce the word similarity and number of zeroes criteria in addition to word frequency for the target words. Then, we extract some rules from the decision tree drawn based on three features for obtaining the initial basis words. In the hybrid approach, we use a word selection method to learn a vector space that each of its dimensions is a natural word. The word selection method starts from the most frequent words and selects a subset, which has the best performance. Then, we use the word selection method to get 1000 basis words. Also, we select golden words of the corpus using a binary weighting method based on the binary particle swarm optimization algorithm and add them to 1000 basis words selected by the word selection method as golden context words. In this study, we use the ukWaC corpus for constructing the word vectors. We evaluate the low-dimensional explicit word vectors on the word similarity task. Also, we evaluate the interpretability of the low-dimensional explicit word vectors qualitatively and quantitatively. In the experiments of this study, the evaluation results of word vectors are compared with the results of a count-based baseline model, which has 5,000 most frequent context words and uses a fixed window instead of the localization method on the word similarity task. The resulting low-dimensional explicit word vectors in comparison to the baseline model can increase the Spearman correlation coefficient for the MEN, RG-65, and SimLex-999 test sets by 4.66, 14.73, and 1.08%, respectively. Also, the interpretability of the resulting word vectors is increased qualitatively and quantitatively in comparison to the prediction-based models.

محل برگزاری: به صورت مجازی
دانشکده مهندسی کامپیوتر مدیریت تحصیلات تکمیلی

دفعات مشاهده: 3830 بار | دفعات چاپ: 620 بار | دفعات ارسال به دیگران: 0 بار | 0 نظر


سایر مطالب این بخش	نسخه قابل چاپ	ارسال به دوستان

Persian site map - English site map - Created in 0.13 seconds with 55 queries by YEKTAWEB 4709