| dc.contributor.author | Muhammad Irfan, 01-243171-022 | |
| dc.date.accessioned | 2022-01-17T08:00:02Z | |
| dc.date.available | 2022-01-17T08:00:02Z | |
| dc.date.issued | 2019 | |
| dc.identifier.uri | http://hdl.handle.net/123456789/11621 | |
| dc.description | Supervised by Dr. Arif Ur Rahman | en_US |
| dc.description.abstract | Topic extraction is an active research area in natural language processing and information retrieval. Often a topic model is trained on raw textual data for a specific goal like summarization, topic extraction and translation. The model then used in real scenarios. It is an application of data mining where a document is represented as a data point. The LDA is often used by researchers for extraction of hidden topics from unstructured data which are then used for analysis and indexing purpose. However, the best model selection is subjective and difficult as ground truth is typically not available in advance. Therefore, analytical skills are required to retrieve the best number of topics for a given corpus. Presently, researchers have not produced an easy approach to discover an appropriate number of topics. The proposed model takes the state-of-the-art step further by integrating a step of off-topic document detection in the set of documents which are initially considered relevant to a topic. The off-topic documents are filtered on the basis of similarity score i-e. cosine between topics and document using Word2Vec model. On the basis of off-topic documents, we calculate the F-score for the different number of topics. The F-score is then used to find the appropriate number of topics. The model was implemented using Python and Blog Authorship Corpus was used. | en_US |
| dc.language.iso | en | en_US |
| dc.publisher | Computer Sciences BUIC | en_US |
| dc.relation.ispartofseries | MS (CS);T-9655 | |
| dc.subject | Topic Extraction Approach | en_US |
| dc.subject | Clustering Documents | en_US |
| dc.title | A Topic Extraction Approach for Clustering Documents | en_US |
| dc.type | MS Thesis | en_US |