Operationalization
The ability to scale, manage and tune your detections.
Introduction:
First and foremost, detection engineering is an operational process that should be defined and managed by your security operations team. As explained in Wikipedia, "Operationalization thus defines a fuzzy concept so as to make it clearly distinguishable, measurable, and understandable by empirical observation".
For this purpose we can define the following three drivers:
Detection Management:
Define a detection strategy.
Assess your detection capabilities, coverage and maturity.
Detection Scalability:
Detection-as-Code
Detection Tuning:
Precision & Recall
Detection Impact
Durability
Detection Management:
Detection development strategy:
Within the scope of your organization governance, detection engineering starts by defining a strategy to focus the effort of your security operations because we simply cannot detect everything, we can choose one of these approaches to drive our strategy:
Threat Actor-Driven : For example focusing our detection efforts on attacks, techniques and tools used by specific threat actors.
Business-Driven : Focusing on attacks that target a specific business field like banks or energy sector.
Compliance-Driven: Focusing on covering regulatory requirements like PCI or HIPAA
Behavior-Driven : Focusing on specific tactics, techniques and procedures to detect suspicious behaviors whether using machine learning or baselining.
You can specify what your SOC needs to focus on the most, and define your own strategy and goals, the above ones were just examples.
Assessment
Through creating and maintaining a detection knowledge base you can increase your SOC efficacy with reliable insights and improvements using measurement of detection coverage and security posture.
Detection Scalability:
In this context, scalability is the ability of a detection to operate as the number of variables increases. I mean here by variables things such as number of monitored assets, different OS types, different OS versions and different telemetry sources like Endpoints, Firewalls, Cloud, Docker...etc.
Scalability is a big challenge in detection and is where the engineering aspect of detection kicks in since you don't only need your detections to be accurate and resilient with less noisy alerts for all the different event format you're collecting, you also need it to be all those things across all of your assets regardless of their functionalities and types.
A detection logic developed in a lab might be inefficient once it is deployed across your environment because your logic might intertwine with legitimate third party software behaviors for example or your telemetry traceability cannot adopt it, remember that you're using different solutions from different vendors that send you different events in different formats and if you're an MSSP/MDR multiply that by number of distinct solutions you're managing because you can't write every detection in AQL, Lucene, SPL, KQL, EQL... You got the picture.
Detection scalability is strongly related to your ability to execute which we will talk about in the next dimension "Execution". However, there is a new concept known as Detection-as-Code that can help address this challenge which is a code-driven approach to developing, reviewing and implementing detections.
Detection Tuning:
Precision & Recall:
You may wonder how can we determine False Negatives. The talk goes through this question by specifying some methods like :
Adversary emulation tools.
Red team engagements.
Third parties offensive security missions.
Detection Impact:
Not all detections are created equal and their impact can be defined on many levels depending on your environment and team structure. Within the realm of a security operations center, we can find two main types of detection impacts:
Impact on the analysts: A detection is just the start, it needs to be qualified, investigated and responded to, so if your front line suffers from alert fatigue due to lack of automation, lack of enrichment or noisy logic rule definitions, trust me you will have a hard time with both, maintaining good service capabilities and convincing top management that you're doing a good job.
Impact on the toolset: Behind every detection there is a query, and whether you're paying your SIEM/EDR/XDR solution provider by number of EPS, by MPM, by data size or by agent you know that queries are expensive. Why? because your data quality matters and running a query every 5 minutes that looks for a chain of characters in raw event payloads through unindexed data is slow and resource consuming and can be caused by:
Sizing Issues: Before you start a SOC or a monitoring project, detail your current needs in terms of
number of data sources
,raw event size
,parsed event size
,enriched event size
,number of indexed fields
,EPS
,peak EPS
,maximum query search time
, etc... and plan for the future.Network Latency: Network congestion can blind you, hence missing important telemetry if your log sources don't have enough network bandwidth or the necessary network firewall configurations. This is bad because your correlation engine will be receiving your logs minutes or seconds late from the queries execution and this might interrupt your correlation logic. The second part of this blog "Execution" will cover related concepts.
Low Data Quality: Unparsed and non normalized telemetry limits your detection engineering capabilities by pushing you to develop heavy queries that will run on unindexed raw data.
Here are some tips to keep in mind to minimize the impact of your detection queries:
Create a Log Collection Strategy (I talked more about this in the next part)
Avoid resource consuming queries like event body contains "*Invoke-Mimikatz*"
Opt for reusable logic in building blocks that don't trigger an alert to help with advanced correlations. for example an "Incoming Remote Command Execution via WinRM" can be used as a logic building block to be correlated with other evidence of execution detections do detect for instance a Lateral Movement attempt.
Use historical correlations to go and look back days or months ago for potential lost or late data. This scenario is typical in case you discovered that one of your log collectors was unreachable or down for a period of time.
Add a look back period to your detection queries. Instead of running a detection every 5 minutes on a search period of 5 minutes you might want to add couple of minutes to your search period per rule execution in case your logs are late to the party.
Durability:
In the context of detection engineering, IOC-based rules can be very effective in recognizing known threats that might target your environment also in historical correlations to identify previously compromised assets. However, a detection based on an IOC feed is more suited for a reactive approach, it tends to be perishable due to the Threat Aging effect so continually tuning and updating this type of detections becomes a necessity. In this case automation is the answer.
TTP-based detections are durable because they are based on what the attackers use, IOCs are perishable because they are based on what the attackers leave behind. They know we will get the hash values, IPs and domain names, they're just hoping by then it would be too late.
This is not to state that detections based on IOCs are bad it is just to say that if you chose a detection strategy approach leaning more towards IOCs or IOAs your detection tuning process will be much more difficult to manage. In this case, an adoption of automation becomes primordial.
Last updated