Operationalization

The ability to scale, manage and tune your detections.

Introduction:

First and foremost, detection engineering is an operational process that should be defined and managed by your security operations team. As explained in Wikipedia, "Operationalization thus defines a fuzzy concept so as to make it clearly distinguishable, measurable, and understandable by empirical observation".

For this purpose we can define the following three drivers:

  • Detection Management:

    • Define a detection strategy.

    • Assess your detection capabilities, coverage and maturity.

  • Detection Scalability:

    • Detection-as-Code

  • Detection Tuning:

    • Precision & Recall

    • Detection Impact

    • Durability

Detection Management:

Detection development strategy:

Within the scope of your organization governance, detection engineering starts by defining a strategy to focus the effort of your security operations because we simply cannot detect everything, we can choose one of these approaches to drive our strategy:

  • Threat Actor-Driven : For example focusing our detection efforts on attacks, techniques and tools used by specific threat actors.

  • Business-Driven : Focusing on attacks that target a specific business field like banks or energy sector.

  • Compliance-Driven: Focusing on covering regulatory requirements like PCI or HIPAA

  • Behavior-Driven : Focusing on specific tactics, techniques and procedures to detect suspicious behaviors whether using machine learning or baselining.

You can specify what your SOC needs to focus on the most, and define your own strategy and goals, the above ones were just examples.

Assessment

Through creating and maintaining a detection knowledge base you can increase your SOC efficacy with reliable insights and improvements using measurement of detection coverage and security posture.

You can use Rob van Os's free tools like SOC-CMM framework to assess and measure your SOC capabilities and MaGMa framework to document and measure your use cases coverage. Palantir also provides a great framework called ADS which you can adopt to document and manage your detections.

If you're interested in defining and measuring your SOC metrics, Expel's @jhencinski makes great blog articles about measuring your SOC performance. You can also listen to his great Blueprint podcast interview bellow:

Detection Scalability:

In this context, scalability is the ability of a detection to operate as the number of variables increases. I mean here by variables things such as number of monitored assets, different OS types, different OS versions and different telemetry sources like Endpoints, Firewalls, Cloud, Docker...etc.

Scalability is a big challenge in detection and is where the engineering aspect of detection kicks in since you don't only need your detections to be accurate and resilient with less noisy alerts for all the different event format you're collecting, you also need it to be all those things across all of your assets regardless of their functionalities and types.

A detection logic developed in a lab might be inefficient once it is deployed across your environment because your logic might intertwine with legitimate third party software behaviors for example or your telemetry traceability cannot adopt it, remember that you're using different solutions from different vendors that send you different events in different formats and if you're an MSSP/MDR multiply that by number of distinct solutions you're managing because you can't write every detection in AQL, Lucene, SPL, KQL, EQL... You got the picture.

Detection scalability is strongly related to your ability to execute which we will talk about in the next dimension "Execution". However, there is a new concept known as Detection-as-Code that can help address this challenge which is a code-driven approach to developing, reviewing and implementing detections.

You can read more about it in this article by @anton_chuvakin:

Detection Tuning:

Precision & Recall:

Detection scalability and logic definition might introduce many challenges to measure the effectiveness of your detections and tune it to an acceptable level. One of the best metrics that we can rely on during our detection tuning process are precision and recall. I recommend watching the talk bellow by @Cyb3rWard0g and @_devonkerr_ in order to have a better understanding of the differences between precision and recall in detection and threat hunting.

You may wonder how can we determine False Negatives. The talk goes through this question by specifying some methods like :

  • Adversary emulation tools.

  • Red team engagements.

  • Third parties offensive security missions.

Another great talk more related to detection engineering was given by @jaredcatkinson during SO-CON last year where he went much deeper into these concepts.

Detection Impact:

Not all detections are created equal and their impact can be defined on many levels depending on your environment and team structure. Within the realm of a security operations center, we can find two main types of detection impacts:

  • Impact on the analysts: A detection is just the start, it needs to be qualified, investigated and responded to, so if your front line suffers from alert fatigue due to lack of automation, lack of enrichment or noisy logic rule definitions, trust me you will have a hard time with both, maintaining good service capabilities and convincing top management that you're doing a good job.

  • Impact on the toolset: Behind every detection there is a query, and whether you're paying your SIEM/EDR/XDR solution provider by number of EPS, by MPM, by data size or by agent you know that queries are expensive. Why? because your data quality matters and running a query every 5 minutes that looks for a chain of characters in raw event payloads through unindexed data is slow and resource consuming and can be caused by:

    • Sizing Issues: Before you start a SOC or a monitoring project, detail your current needs in terms of number of data sources, raw event size, parsed event size, enriched event size, number of indexed fields, EPS, peak EPS, maximum query search time, etc... and plan for the future.

    • Network Latency: Network congestion can blind you, hence missing important telemetry if your log sources don't have enough network bandwidth or the necessary network firewall configurations. This is bad because your correlation engine will be receiving your logs minutes or seconds late from the queries execution and this might interrupt your correlation logic. The second part of this blog "Execution" will cover related concepts.

    • Low Data Quality: Unparsed and non normalized telemetry limits your detection engineering capabilities by pushing you to develop heavy queries that will run on unindexed raw data.

Here are some tips to keep in mind to minimize the impact of your detection queries:

  1. Create a Log Collection Strategy (I talked more about this in the next part)

  2. Avoid resource consuming queries like event body contains "*Invoke-Mimikatz*"

  3. Opt for reusable logic in building blocks that don't trigger an alert to help with advanced correlations. for example an "Incoming Remote Command Execution via WinRM" can be used as a logic building block to be correlated with other evidence of execution detections do detect for instance a Lateral Movement attempt.

  4. Review detection queries before implementing them. For example, if you wish to use a project like sigma where sigmac can be really useful to convert sigma rules into your favorite SIEM solution query language it is better to review it and customize it beforehand since the output can be really resource consuming. If you go to uncoder.io you may find plenty of examples like Detects Suspicious Commands on Linux systems where the converted default QRadar query will use the AQL function Payload Contains which is very resource and time demanding. This is of course easily customizable using sigmac file.

  5. Use historical correlations to go and look back days or months ago for potential lost or late data. This scenario is typical in case you discovered that one of your log collectors was unreachable or down for a period of time.

  6. Add a look back period to your detection queries. Instead of running a detection every 5 minutes on a search period of 5 minutes you might want to add couple of minutes to your search period per rule execution in case your logs are late to the party.

  7. Push intelligence to the edge. This is a concept known and used by multiple emerging technologies and network routing protocols which simply means pushing data processing to the endpoints instead of letting your SIEM do all the heavy work. The following diagram illustrates such concept with some examples. At the endpoint you can generate custom telemetry aimed to improve detection capabilities by using ETW or some open source tools such as AutorunsToWinEventLog by Palantir or LogCampaign by HAsecuritySolutions. Sysmon configuration files are very powerful where you can start correlating and tuning your events at the endpoint, sysmon-modular by @olafhartong can be a good start. Your endpoint's agent should give you flexible capabilities to chose what events to collect, the duration period and which parsing schema to use. Winlogbeat is very powerful in this aspect. Before reaching your SIEM, your data can be processed by a second layer where you can transform and tune even further your data by an event processor like Logstash (The lord of the stack).

Durability:

In the context of detection engineering, IOC-based rules can be very effective in recognizing known threats that might target your environment also in historical correlations to identify previously compromised assets. However, a detection based on an IOC feed is more suited for a reactive approach, it tends to be perishable due to the Threat Aging effect so continually tuning and updating this type of detections becomes a necessity. In this case automation is the answer.

If we take a look at the pyramid of pain by @DavidJBianco we can see a normal relationship between the pyramid layers and detection durability. The more it is difficult to change the longer the detection will live.

TTP-based detections are durable because they are based on what the attackers use, IOCs are perishable because they are based on what the attackers leave behind. They know we will get the hash values, IPs and domain names, they're just hoping by then it would be too late.

This is not to state that detections based on IOCs are bad it is just to say that if you chose a detection strategy approach leaning more towards IOCs or IOAs your detection tuning process will be much more difficult to manage. In this case, an adoption of automation becomes primordial.

Last updated