Abstract:
Orion Intelligence needs advance capabilities to automate the classification and extraction of actionable intelligence out of the hidden network where leak sites contain enormous volumes of unstructured and inconsistent HTML information. The existing web scraping, web classification and NLP tools developed for the clear web are unsuitable because of a non-standardized and fragmented structure of hidden network sites, as well as their active masking. In order to overcome this issue, we built a specialized pipeline of structured leak onion sites classification and data extraction, which comprises data gathering, cleaning and normalization in addition to fine-tuning of the model that is particular to the dark web leak sites.
This project discusses the use of machine learning based onion site classification and transformer-based NLP models in HTML-to-structured-information extraction that provided an balance between input capacity, model size and extraction accuracy. Sites classified as leaks were cleaned, cleaned HTML data were converted into a specialized JSONL data format and data normalization was applied so that NLP model training could be compatible with the data. The model was optimized to detect and harvest important entities in the form of titles, dates, web links, records and other appropriate metadata of various dark web resources.
The system is very accurate in converting unstructured HTML from hidden networks to structured and analyzable formats. This module improves the functionality of OSINT, reduce the manual work and assist downstream processes like threat prediction and analysis. It is a framework that is both modular and scalable and can be improved in future as per the further requirements of Orion Intelligence and can be modified to suit the needs of emerging challenges in dark web monitoring.