DSpace Repository

A Topic Extraction Approach for Clustering Documents

Show simple item record

dc.contributor.author Muhammad Irfan, 01-243171-022
dc.date.accessioned 2022-01-17T08:00:02Z
dc.date.available 2022-01-17T08:00:02Z
dc.date.issued 2019
dc.identifier.uri http://hdl.handle.net/123456789/11621
dc.description Supervised by Dr. Arif Ur Rahman en_US
dc.description.abstract Topic extraction is an active research area in natural language processing and information retrieval. Often a topic model is trained on raw textual data for a specific goal like summarization, topic extraction and translation. The model then used in real scenarios. It is an application of data mining where a document is represented as a data point. The LDA is often used by researchers for extraction of hidden topics from unstructured data which are then used for analysis and indexing purpose. However, the best model selection is subjective and difficult as ground truth is typically not available in advance. Therefore, analytical skills are required to retrieve the best number of topics for a given corpus. Presently, researchers have not produced an easy approach to discover an appropriate number of topics. The proposed model takes the state-of-the-art step further by integrating a step of off-topic document detection in the set of documents which are initially considered relevant to a topic. The off-topic documents are filtered on the basis of similarity score i-e. cosine between topics and document using Word2Vec model. On the basis of off-topic documents, we calculate the F-score for the different number of topics. The F-score is then used to find the appropriate number of topics. The model was implemented using Python and Blog Authorship Corpus was used. en_US
dc.language.iso en en_US
dc.publisher Computer Sciences BUIC en_US
dc.relation.ispartofseries MS (CS);T-9655
dc.subject Topic Extraction Approach en_US
dc.subject Clustering Documents en_US
dc.title A Topic Extraction Approach for Clustering Documents en_US
dc.type MS Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Advanced Search

Browse

My Account