The ability to collect, quantify, evaluate and enrich your data.
SOC's ability to develop good detections for different techniques and tactics relies heavily on its ability to execute them. Short story is logging is the dark side of detection.
Evaluating your data quality and visibility to measure your detection execution capabilities is not an easy task. In this part we will be talking about two main drivers of this second dimension which are:
- Event Visibility : Data prioritization, collection and processing.
- Identify critical data sources.
- Define your collection strategy.
- Define your log storage approach.
- Data Observability
- Event Traceability: Data source quality and richness
- Evaluate the quality of your data
Data is air the detection breathes, without it, it is dead. Insuring you have good visibility over data sources that matter to you is crucial. There are 4 main things to keep in mind derived from 4 questions when you need to evaluate your visibility from a SOC perspective:
- What do you need to collect ?
- How you want to collect it ?
- How are you planning on storing it ?
- How could you know when you're not collecting it ?
Prioritize your data sources types on a high level then go deeper. In the next part we will be talking about a much deeper level of evaluation "Event Traceability". You can adopt @cyb3rops tweet bellow as a first step then adapt it to your environment needs. This will give you a first overview of your visibility where you can add more criteria like number of instances per log source type to have an initial list of solutions integrated in your SIEM or being able to include high availability instances and replication servers you can also consider event types as each asset might have multiple like applications logs, service logs, systems logs...etc.
Other useful resources like NIST SP800-92 Guide to Computer Security Log Management or MITRE ATT&CK Framework can help you at the beginning. You can also use Attack-Python-Project Jupyter notebooks by @Cyb3rWard0g to interact with MITRE ATT&CK Framework and identify most relevant telemetry and the necessary data sources to cover techniques of interest to you.
AttackCon2018 Presentation by Roberto and Jose Rodriguez
Your collection strategy can be driven by security operations like compliance requirements or threat detection like detecting post-exploitation techniques used in windows environments. Your collection strategy can also be helpful for tuning purposes to reduce alert fatigue, false positives, data storage costs and optimizing EPS licensing price. Here are some concepts to take in consideration.
- Volume vs Relevance: Depending on what drives your log collection strategy (SecOps or Threat Detection or hybrid) if you did the previous step "Identify Critical Data Sources" you now know what do you need to collect and you can build a balanced and hybrid list of when you need to collect everything and when you will go with most relevant events for your use cases.
- Log Retention Policy : Your log retention duration can be influenced by regulatory requirements and your detection needs like how far your analysts look back when they're investigating alerts or hunting threats, usage of historical correlations can also be impacted by your log retention policy. We will be talking about this next in Log Storage.
- Agent vs Agentless: Technically there is no agentless approach in log collection, there are built-in agents or third party agents. Going with agent or not is simply a matter of scale, the bigger the environment the more difficult it is to manage it. For a Windows environment there is a built-in mechanism that allows you to automatically forward logs called Windows Event Forwarding. Here is a video from SANS by @SecurityMapper and @packetengineer explaining advantages and disadvantages of WEF/WEC in depth. I put a summary of this in the figure bellow.
WEC/WEC Pros and Cons from SANS talk
If you're looking for how to configure Windows Event Forwarding here is a great blog from Elastic on how to configure WEF/WEC :
Your SIEM's database type and log storage approach matters and affects your execution capabilities if you care about speed of your queries and availability of your data.
There are two main types of SIEM databases for a logging use case:
Schema-on-Write databases define the schema of your data; i.e. fields, structure, and mappings of data at ingestion time but Schema-on-Read applies the schema at search time, meaning that if you run a query on a Schema-on-Write database type the search time decreases but the ingestion time increases because your data is already indexed mapped and the heavy work was done at ingestion time, however in the other hand Schema-on-Read would prioritize ingesting data to avoid data loss hence increasing your queries' search time.
Several SIEMs claim they can do both however trust but verify and going for a hybrid approach is recommended in a security monitoring use case, so here are some pros and cons for Schema-on-Write approach:
- Faster search
- Less computing resources are needed
- Easier correlations.
- Writing data to disk speed could be affected and data loss is a risk too (may impact forensic evidence compliance for a court case).
- Requires knowledge of your data models
- Difficult to handle unstructured data.
Each SIEM vendor have their own methods for sizing a security monitoring solution storage capacity but most of them adopt the same approach internally, Hot, Warm and Cold architecture :
- Hot: Most recent and active logs to monitor. These nodes are know to have fast disk writing capabilities (SSDs) and low storage capacity. Most analysts or threat hunters use a duration from last 15 minutes to last 7 days. Depending on your query rate and look back time, log retention on this type of nodes must preferably be set on 30 days or more.
- Warm: Once past the time frame of the most use, logs can be moved from SSDs to slower but larger mediums like Hard Disk or Tape. These are typically stored for at least 90 days.
- Cold: Beyond the first 90 days, the chances of needing a particular log file is slim, but not none. Cold storage is a cheap long term solution, but will take a long time to spool back up for use if needed.
Do you get notified when a data source is down or a new one is integrated ? Can you tell when the data stream of last week is much less than the week before or the month before? Do you know when an important event is no longer collected from a specific asset? do you notice when an event field is no longer populated?
Observability is maintaining a data pipeline with minimal downtime and maximum reliability by running regular health checks. Data observability is important for your detection engineering and can be applied on many level
- Index level : When an index stores much less data than the usual.
- Log source type level : When a data source type like a firewall cluster or web servers stop sending events.
- Asset level : When a log source stops sending data.
- Event level : When an Event ID for example stops being recorded from a data source
- Field level : When Process Command Line field for example stops being populated.
If your SIEM have an API you can use it for daily automated health checks and measure specific metrics on the levels of observability you care about. I consider combining Jupiter notebooks and Vega visualizations like Trends as demonstrated here can be very effective. The following tweet by @nas_bench started a great conversation related to data observability.
After approaching Event Visibility and defining relevant data sources, Event Traceability comes to help you estimate the reliability of your data on a much deeper level in order to have simplified and trusted detection implementations. The following are some use cases for evaluating event traceability of some data sources.
I started by defining the event types that I will need in my security operations and threat detections development no matter who is the vendor. For example I will need to be informed when a virus is detected or when an asset's license expires or when a virus deletion fails...etc. After that, I listed the event field that must be present for each event type and evaluated it based on the following color scale:
- GREEN : Collected, Parsed and Normalized
- YELLOW: Collected, Parsed but Not Normalized
- ORANGE: Only collected
- RED: Not Collected
- GREY: Not Applicable
Example of AV Event Traceability Evaluation
Example of Perfect Score of Windows Authentication Events Traceability Evaluation
As you can notice in the following figures, doing such exercise let you know what events you can rely on with relevant data during your detection engineering process. For example, Process Command Line is not audited by default in EID 4688 and Process is very helpful for correlations as it can be used by SIEM platforms like Elastic to create a process tree context of execution.
Process Creation Event Traceability Evaluation
This example highlights the differences between Windows DNS Debug logs which are written to a file on the Windows DNS Server and Sysmon's EID 22 which is recorded on Windows Event Logs and generated on the client side. It is important to know these differences since your agent should be able collect Windows Server DNS logs only if it can read and parse them from the dnslog.txt file written to disk. If you're using WEF/WEC to collect logs you won't be able to collect them from a file also using an agent like Winlogbeat won't do it unless you're using Filebeat (yet another agent). QRadar's Wincollect for example can collect both Windows Event Logs and DNS Debug logs but custom filtering can be more complex.
We should also take in consideration the OS version limitations. As per Sysmon documentation The telemetry for EID 22 was added for Windows 8.1 so it is not available on Windows 7 and earlier in the other hand DNS Debug Logging is available on Windows Server 2003, 2008, 2012, 2016 and 2019.
DNS Request Event Traceability Evaluation