Multi-Campus Cyber-Security Data Curation for Research and Education

Funding Agency: National Science Foundation
Award: Planning Grant
Dates: 2020-2021
PI: Jack W. Davidson
Co-PIs: H. Howie Huang (George Washington University), Von S. Welch (Indiana University)

The overall objective of this project is to create a multi-campus data collection and sharing infrastructure for use by machine-learning cybersecurity and privacy researchers. Specifically, such a federated infrastructure will be invaluable for detecting zero-day (new, previously unseen) attacks and large-scale attacks with complex kill-chains, e.g., the Wannacry ransomware attack, Mirai Distributed Denial of Service (DDoS) attacks and Advanced Persistent Threat (APT) attacks. At the University of Virginia (UVA), as part of a DARPA-funded project in a program called Cyber Hunting at Scale (CHASE), the PI has led the development and implementation of such an infrastructure, but for a single campus. Critical components of the infrastructure include (i) data collection, (ii) data anonymization for privacy preservation without loss of fidelity, and (iii) engagement with UVA IT divisions, especially the Security Operations Center (SOC). The anonymized datasets have been made available to both UVA and non-UVA CHASE cybersecurity researchers for testing novel machine learning algorithms. Included in this set of data users is the George Washington University (GWU) co-PI. The Indiana University (IU) co-PI leads the ResearchSOC, and works closely with OMNISOC, and brings these experiences to this planning grant.

Our motivation for this project is to extend this UVA model to a larger-scale academic consortium, which would include IU as a data provider, and would support the UVA and GWU efforts past 2022 when the CHASE program ends. The specific methods that will be employed in this planning grant are to:

  1. organize a workshop to engage the community (starting with the ten organizations who have provided us letters of collaboration) to formulate a vision and roadmap for this infrastructure and to discuss legal, ethical, privacy, organizational and sustainability considerations, and,
  2. create a team of multiple data providers and data users and submit a Grand Ensemble CCRI proposal to build and use the data infrastructure.