申请试用
HOT
登录
注册
 
Lifecycle Inference on Unreliable Event Data
1 / 26

Lifecycle Inference on Unreliable Event Data

Spark开源社区
/
发布于
5年前
/
8134
人观看
A common motif in data science tasks is the inference of a latent, time-variant property from observations of singular timestamped events. This task is particularly prevalent in the cybsersecurity domain where a substantial portion of analysis efforts are dedicated toward system and network event logs. While such an inference can be straightforward in circumstances where data provenance is known and controlled, consider the situation where erroneous events exist or the events represent an unknown fraction of a total population. Such confounding factors complicate inference on third party datasets. Determining the lifetime that an organization operates a digital asset (e.g. an IP address) based on observations of the asset is one manifestation of this inference task that is critical to security ratings services which require an asset inventory of rated organizations. Leveraging third party data sources can improve the coverage of an asset inventory; however, false positives and unknown sampling rates of observation events in external, unmanaged data sources can degrade the veracity of inferred asset attributions. This talk will describe Spark DataFrames tradecraft for addressing these challenges by generating attribution lifetime windows from asset observations that are reinforced and extended by future observations. While the concept of event “refresh” is not traditionally considered within the MapReduce paradigm; it is trivial to implement within Spark through creative application of multiple Window functions. The described approach provides multiple desirable functionalities including a parameter-tunable observation reinforcement threshold to exclude false positives or outlying observations, event deduplication through time box partitioning, as well as a natural mechanism to age-off asset assignments that are no longer valid. This technique, entirely encapsulated in Spark, enables BitSight to run digital asset attribution simulations across billions of records and months of collection to evaluate, improve, and
0点赞
0收藏
1下载