General Introduction and simplified guide on how to install Elastic Stack
What is Elastic Stack?
Elastic stack is a set of tools that can be used for collecting, ingesting, and searching data. In a context of SIEM, Elastic stack can be a powerful solution for monitoring purposes.
Formerly called ELK stack, now contains four major components:
Elasticsearch: The database where your events will be indexed, stored, and queried for search.
Kibana: The user interface to interact (i.e. searching, creating visualizations and dashboards, and managing your stack) with Elasticsearch.
Logstash: My favorite tool of the stack, is equivalent of what you may call in a SIEM architecture a Data Collector or Processor. Although Logstash works in an ETL workflow and doesn't store any data, it supports many destinations where it can send data.
Beats: Agents and events collectors. Elastic has a set of agents that support different datasets from different systems like Windows, Linux, Metrics, Uptime, Cloud, ...etc.
Quick notes about Elasticsearch for a SIEM use case
Elasticsearch is fast, faster than most SIEMs you will see out there. One reason for its search speed that Elasticsearch is a Schema-on-Write kind of databases which means the schema; the fields, structure, and mappings of data are all defined first for the specific purpose that database will serve.
In this article Elastic refers to the way it stores logs as both Schema-on-Write and Schema-on-read they call it minimal schema (which is by default Elasticsearch stores data with minimal fields like @timestamp, message and some metadata like tags and host) but, in my point of view, in a logging use case it's best used with a schema-on-write approach since you would want to map and parse your IPs as IP data type and port numbers as long instead of adding a lot of scripted fields with some painlessscripts at search time that may affect the performance of your cluster.
This approach comes with pros and cons:
Increase in search time
Good for event-centric correlations
Writing data to disk could be affected
Time-base correlations would need more work at ingestion time.
The logging use case need appropriate sizing of your Elasticsearch nodes for both data indexing and searching as well as a proper understanding of your data.
Indexing is the method by which search engines organize data for quick retrieval. The resulting structure is called, an index. An analogy of an index could be made to a table in a relational database (not exactly accurate but you got the idea).
An index is a collection of documents or a logical grouping of documents that often have a similar structure and is used to store and read documents.
Which are our log events in a SIEM context. A document can be similar to a 'row in a table in a relational database.'
A shard is a collection of documents. It is similar to a smaller data partition. An index is stored in multiple distributed shards. There are two types of shards, Primary shards and their copies which called Replicas. Elasticsearch facilitates resiliency to node failover thanks to its horizontal scalability and cluster design. So, replicating your data across multiple nodes makes it high-available and resilient to single-node failover.
A cluster is a set on nodes that each runs an instance of Elasticsearch. There are distinct roles that can be defined for each node (see more):
Data node (hot, warm, and cold)
Machine Learning node
The bellow diagram can put all these Elasticsearch data structure concepts together:
Elasticsearch Data Structure
Searching data in Elasticsearch uses inverted indices which means that when we look for data we don't interact with JSON documents stored in our database but with inverted indices.
Using an inverted index is a lot like finding a book page that contains a certain keyword by scanning the index on the end of the book instead of scanning each page from start to finish. Think of when you start typing a phone number in your cell phone and the contact names start showing up according to what you are typing.