Blacklist filtering for security research: bridging the gap between domain blacklists and malicious web content

Wang, Kaishen

Blacklist filtering for security research: bridging the gap between domain blacklists and malicious web content

Wang, Kaishen

Permalink

https://hdl.handle.net/2142/105269

Description

Title: Blacklist filtering for security research: bridging the gap between domain blacklists and malicious web content
Author(s): Wang, Kaishen
Issue Date: 2019-04-26
Director of Research (if dissertation) or Advisor (if thesis): Bailey, Michael Donald
Department of Study: Computer Science
Discipline: Computer Science
Degree Granting Institution: University of Illinois at Urbana-Champaign
Degree Name: M.S.
Degree Level: Thesis
Date of Ingest: 2019-08-23T20:48:27Z
Keyword(s): Blacklist,Filter
Abstract: Blacklists are a collection of origins (URLs, domains or IP addresses) associated with malevolent activities like the dissemination of malware, facilitation of command and control (C&C) communications, and delivery of spam messages. Blacklists are a simple and convenient way to protect users from these known malicious websites. Although blacklists are not designed for security research, they are commonly used in security research projects as a scan target list for retrieving malicious web contents. However, domain blacklist scans are noisy, as a large portion of websites scanned do not perform malicious activities during the time they are visited. Many blacklisted websites are offline or parked, even though some of them may have contained malicious contents before. Consequently, blacklists cannot be used out-of-box for security research. Kuhrer et al. [1] evaluated the effectiveness of major blacklists in 2014, and proposed a heuristic mechanism to collect training data to build a machine learning classifier to identify parked domains in blacklists. They found up to 10.9% of entries are parked. In this work, we reproduced their approach and found that most of the heuristics and features they used have become stale after five years. We modernized the prior work approach to identify offline domains and parked domains in blacklists. First, we built and open-sourced an efficient blacklist filter, MGRAB, that can filter domains at several layers: domains with no DNS resolution, closed TCP ports, or error HTTP response code. Using MGRAB, we found that only 43% of all domains in our blacklist (aggregated from 27 domain blacklists) have valid IPv4 address and only 40% of the total domains can be successfully grabbed. Second, we implemented an updated mechanism to detect parked domains using new heuristic strings and created a new random forest classifier. Using the updated mechanism, we found that around 4% of the successfully grabbed domains are parked. Overall, only 33% of the total domains contain meaningful content. Researchers can use MGRAB and the parked domain detection methodology to filter blacklisted website scans for future security research.
Graduation Semester: 2019-05
Type of Resource: text
Permalink: http://hdl.handle.net/2142/105269

Blacklist filtering for security research: bridging the gap between domain blacklists and malicious web content

Wang, Kaishen

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Blacklist filtering for security research: bridging the gap between domain blacklists and malicious web content

Wang, Kaishen

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In