Withdraw
Loading…
Blacklist filtering for security research: bridging the gap between domain blacklists and malicious web content
Wang, Kaishen
Loading…
Permalink
https://hdl.handle.net/2142/105269
Description
- Title
- Blacklist filtering for security research: bridging the gap between domain blacklists and malicious web content
- Author(s)
- Wang, Kaishen
- Issue Date
- 2019-04-26
- Director of Research (if dissertation) or Advisor (if thesis)
- Bailey, Michael Donald
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- Blacklist,Filter
- Abstract
- Blacklists are a collection of origins (URLs, domains or IP addresses) associated with malevolent activities like the dissemination of malware, facilitation of command and control (C&C) communications, and delivery of spam messages. Blacklists are a simple and convenient way to protect users from these known malicious websites. Although blacklists are not designed for security research, they are commonly used in security research projects as a scan target list for retrieving malicious web contents. However, domain blacklist scans are noisy, as a large portion of websites scanned do not perform malicious activities during the time they are visited. Many blacklisted websites are offline or parked, even though some of them may have contained malicious contents before. Consequently, blacklists cannot be used out-of-box for security research. Kuhrer et al. [1] evaluated the effectiveness of major blacklists in 2014, and proposed a heuristic mechanism to collect training data to build a machine learning classifier to identify parked domains in blacklists. They found up to 10.9% of entries are parked. In this work, we reproduced their approach and found that most of the heuristics and features they used have become stale after five years. We modernized the prior work approach to identify offline domains and parked domains in blacklists. First, we built and open-sourced an efficient blacklist filter, MGRAB, that can filter domains at several layers: domains with no DNS resolution, closed TCP ports, or error HTTP response code. Using MGRAB, we found that only 43% of all domains in our blacklist (aggregated from 27 domain blacklists) have valid IPv4 address and only 40% of the total domains can be successfully grabbed. Second, we implemented an updated mechanism to detect parked domains using new heuristic strings and created a new random forest classifier. Using the updated mechanism, we found that around 4% of the successfully grabbed domains are parked. Overall, only 33% of the total domains contain meaningful content. Researchers can use MGRAB and the parked domain detection methodology to filter blacklisted website scans for future security research.
- Graduation Semester
- 2019-05
- Type of Resource
- text
- Permalink
- http://hdl.handle.net/2142/105269
- Copyright and License Information
- Copyright 2019 Kaishen Wang
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Computer Science
Dissertations and Theses from the Dept. of Computer ScienceManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…