Introduction
The goal of this series is to put readers in the right mindset when thinking aboutĀ SIEMĀ and describeĀ how toĀ set themselves up for success. While Iām not a Data Scientist and donāt claim to be, I can confidently say that expecting results inĀ security analyticsĀ without first having āgood dataā to work with is folly. This is why I always say that āsecurityĀ analyticsĀ is, first and foremost, aĀ data collectionĀ problemā and why part 1 of the SIEM Fundamentals blog is focused on how to approach data collection.
(Image from āĀ https://thelead.io/data-science/5-steps-to-a-data-science-project-lifecycle)
ThisĀ imageĀ is aĀ visualizationĀ of the OSEMNĀ frameworkĀ that many data scientists utilize when planning aĀ projectĀ and is effectively how teams turnĀ data into information. The entire purpose of using a SIEM is not to store data; but, to create new and useful information that can be used to improve security.
Obtaining data and scrubbing it is no small feat. In fact, many Data Scientists view each of these phases as distinct specialties orĀ domainsĀ within the practice. Itās not uncommon to find individuals dedicated to a single phase. Not appreciating this is why many teams fail to derive value from a SIEM; all of the marketing hype makes it easy to overlook just how much effort is required by theĀ end-userĀ at every phase of theĀ processĀ and is something that good teams will continually iterate through as their environment and theĀ threat landscapeĀ changes.
Letās spend a moment talking aboutĀ ETL, as this helps describe some of the next sections.
- Extract ā Actually getting the data source to outputĀ logsĀ somewhere in some format. Important to note that this is the only phase in which new original data can be introduced to a data set via configuration.
- E.G. Configuring theĀ logĀ source to report fields āX, Y, and Zā usingĀ remoteĀ syslogĀ with a specific format such as key-value pairs or comma delimited lists.
- Transform ā Modifying the format and structure of the data to better suit your needs.
- E.G. Parsing aĀ JSONĀ log file into distinct plain-textĀ eventsĀ with a parsing file,Ā mappingĀ file, customĀ script, etc.
- Load āĀ WritingĀ the data to the database.
- E.G. Using software that interprets the plain-text events and sends them into the database with INSERT statements or other public APIs.
- Post-Load Transform ā Not an official part of the ETL process; but, a very real component of SIEM.
- E.G. Using data modeling, field extractions, and field aliases.
Obtain
Collecting data is, at a smallĀ scale, simple. However, SIEM is not small scale, and figuring out how to reliably obtain data relevant data is critical.
Reliable Delivery
For this section, weāre going to focus on Extraction and Loading.
- Extraction
- What is my data source capable of outputting?
- What fields and formats can be used?
- What transport methods are available?
- Is this a device that we can āpushā data from to a listener or do we have to āpullā it via requests?
- Loading
- How do we ensure that data is delivered in a timely and reliable manner?
- What happens if a listener goes down? Will we miss data that gets pushed during an outage?
- How do we ensure that pull requests complete successfully?
In the SIEM world, itās often the case that āextractionāĀ functionalityĀ is provided alongside āloadingā functionality; especially in cases where additional software (connectors, beats, and agents) are used.
However, there is a hidden space between these two where āevent brokersā fit in. Because data must be delivered over aĀ network,Ā eventĀ brokers are technologies likeĀ KafkaĀ andĀ RedisĀ that can handleĀ load balancing, event caching, and queuing. Sometimes event brokers can be used to actually write data into the targetĀ storage; but, may also be outputting to a traditional āloaderā in a sort of daisy-chain fashion.
Thereās not really a right or wrong way to build your pipeline in respect to these factors. Most of this will be dictated by the SIEM technology that you use. However, it is important to be aware of how these things work and prepared toĀ addressĀ the unique challenges of each throughĀ engineeringĀ solutions.
Choosing Log Sources
Donāt go out collecting data from everything just because your SIEMĀ vendorĀ told you to; always have a plan and always have good justification for collecting the data youāve chosen. At this point in the process, we should be asking ourselves the following questions when choosingĀ relevantĀ data:
- Understanding Information
- What activity does this provideĀ visibilityĀ of?
- How authoritative is this data source?
- Does this data source provide all of the visibility needed or are there additional sources that are required?
- Can this data be used to enrich other data sets?
- Determining Relevance
- How can this dataĀ helpĀ meetĀ policies,Ā standards, or objectives set by security policy?
- How can this data enhanceĀ detectionĀ of a specific threat/threat actor?
- How can this data be used to generate novel insight into existing operations?
- Measuring Completeness
- Does the device already provide the data we need in the format we want?
- If not, can it be configured to?
- Is the data source configured for maximum verbosity?
- Will additional enrichment be required to make this data source useful?
- Analyzing Structure
- Is the data in aĀ humanĀ readable format?
- What makes this data easy to read and should we adopt a similar format for data of the same type?
- Is the data in a machine readable format?
- What file-type is the data and how will a machine interpret this file type?
- How is the data presented?
- Is it a key-value format, comma delimited, or something else?
- Do we have goodĀ documentationĀ for this format?
- Is the data in aĀ humanĀ readable format?
Common problems encountered in understanding your information stem from poor internal documentation or expertise of the log source itself and network architecture. Determining relevance requires input from security specialists andĀ policyĀ owners. In all cases, having experienced and knowledgeable personnel participating early is a boon for the entire operation.
Scrub
Now onto the more interesting topic of data scrubbing. Unless youāre familiar with this, you may end up asking yourself the following questions:
- Hello, shouldnāt it just work?
- Why is the data not already clean, did an engineer spill coffee on it?
- Hygiene sounds like a personal issue ā is this something we needĀ HRĀ to weigh in on?
The reality is that, as awesome as machines are, they arenāt thatĀ smartĀ (*yet). The onlyĀ knowledgeĀ they have is whatever knowledge we build them with.
For example, a human can look at the text āRob3rtā and understand it to mean āRobertā. However, a machine doesnāt know that the number ā3ā can often represent theĀ letterĀ āeā in the englishĀ languageĀ unless it has been pre-programmed with such knowledge. A more real-world example would be in handling differences in format like ā3000ā vs ā3,000ā vs ā3Kā. As humans, we know that these all mean the same thing; but, a machine gets tripped up by the ā,ā in ā3,000ā and doesnāt know to interpret āKā as ā000ā.
For SIEM, this is important when analyzing data across log sources.
Example 1 ā Exhibit A
Device | Timestamp | SourceĀ IP Address | SourceĀ HostĀ Name | RequestĀ URL | Traffic |
Web Proxy | 1579854825 | 192.168.0.1 | myworkstation.domain.com | https://www.example.com/index | Allowed |
Device | Date | SRC_IP | SRC_HST | RQ_URL | Action |
NGFW | 2020-01-24T08:34:14+00:00 | ::FFF:192.168.0.59 | webproxy | www.example.com | Permitted |
In this example, you can see that the āField Nameā and āField Dataā are different betweenĀ log sourcesĀ āWeb Proxyā and āNGFWā. Attempting to build complexĀ use casesĀ with this format is extremely challenging. Hereās a breakdown of problematic differences:
- Timestamp:Ā WebĀ ProxyĀ is inĀ Epoch (Unix)Ā format while NGFW is inĀ Zulu (ISO 8601)Ā format.
- SourceĀ IP: Web Proxy has anĀ IPv4Ā address while NGFW has anĀ IPv4-mapped IPv6 address.
- Source Host: Web Proxy uses aĀ FQDNĀ while NGFW does not.
- Request URL: Proxy uses the full request while NGFW only uses the domain.
- Traffic/Action: Proxy uses āallowedā and NGFW uses āpermittedā.
This is in addition to the actual field names being different. In a NoSQL database with poor scrubbing, this means that the query terms used to find Alpha logs will vary significantly when usingĀ BetaĀ logs.
If I havenāt already driven this point home hard enough yet; letās take a look at a sample detectionĀ use case:
- Use Case:Ā DetectĀ usersĀ successfullyĀ visiting knownĀ maliciousĀ websites.
- Environment: The Web Proxy sits in front of the NGFW and is the first device to see web traffic.
- Caveats
- The Web Proxy and NGFW do not have identical block lists. A web request could make it through the Web Proxy only to be later denied by the NGFW.
- Requests are forwarded from the proxy to the NGFW in a non-transparent manner. I.E. The Source IP and Host Name are replaced with the Web Proxyās IP and Host Name and analyzing only the NGFW logs will not show you the true source of the request.
- Explanation:
- In this example, letās assume that āMaliciousā is some type of variable which compares the URL against a lookup-table of known malicious URLs stored in the SIEM.
Our query would look like this:
- SELECT RQ_URL, SRC_IP, SRC_HST
WHERE Device == NGFW AND RQ_URL = Malicious AND Action = Permitted - SELECT Request URL, Source IP, Source Host,
WHERE Device == Web Proxy AND Request URL = Malicious AND Traffic = Accepted
However, given the known caveats, analyzing the results of a single query would only tell us the following:
- NGFW ā The ultimate block/deny status is known. The true source is unknown.
- Web Proxy ā The ultimate block/deny status is unknown. The true source is known.
We have 2 related pieces of information that now have to be joined using some fuzzy timestamp logic that is really just a ābest guessā according to two events that happened around the same time period (yikes).
How2Fix?
Remember these from earlier in the article?
- Transform ā Modifying the format and structure of the data to better suit your needs.
- E.G. Parsing a JSON log file into distinct plain-text events with a parsing file, mapping file, custom script, etc.
- Post-Load Transform ā Not an official part of the ETL process; but, a very real component of SIEM.
- E.G. Using data modeling, field extractions, and field aliases.
There are entirely too many technologies and options for me to explain every one; but, Iāll cover some basic vocabulary for understanding what theĀ transformationĀ techniquesĀ are:
- ConfigurationĀ ā Not technically a transformationĀ technique; but, typically the best way to address data structure and format problems. Fix the problem at the source and skip everything else.
- Parsing/Field Extractions ā A transformĀ operationĀ (pre-ingestion) that utilizesĀ regular expressionsĀ (regex) to slice a string into characters (or groups of strings) based on patterns. Handles dynamic values well provided that the overall structure is static; but, can be performance prohibitive with too many wildcards.
- Mapping ā A transform operation that uses a library of static inputs and outputs. Can be used to assign field names and values. Does not handle dynamic input well. However, can be considered to be more efficient than parsing if the mapping table is small.
- Field Aliasing ā Similar to mapping; but, occurs post-load and doesnāt necessarily change the actual stored in the SIEM.
- Data Models ā Similar to field aliasing; occurs at search-time.
- Field Extractions ā Similar to parsing and can occur pre orĀ postĀ ingestion depending on the platform.
Letās say that we created a bunch of parsers to enforce a common field schema, mapped the field values for traffic from āallowedā to āpermittedā, configured our web-proxy to forward the original source IP and host, configured our NGFW to log host names with their FQDN, and utilized functions toĀ convertĀ timestamps and extract IPv4 addresses. Our data now looks like this:
Example 1 ā Exhibit B
Device | Time | Source IPv4 | Source FQDN | Request URL | Traffic |
Web Proxy | January 24, 2020 ā 8:00 AM | 192.168.0.1 | myworkstation.domain.com | https://www.example.com/index | Permitted |
Device | Time | Source IPv4 | Source FQDN | Request URL | Traffic |
NGFW | January 24, 2020 ā 8:01 AM | 192.168.0.1 | myworkstation.domain.com | www.example.com | Permitted |
Letās also assume that the NGFW simply couldnāt give us the full request URL because this information is enriched at the source throughĀ DNSĀ resolution. Our āideal pseudo-logicā now looks like this:
- SELECT Source IPv4, Source FQDN, Request URL
WHERE Device == NGFW AND Request URL == Malicious AND Traffic == Permitted
Because weāve configured our proxy to forward the source information, we no longer have to rely on two data sources and fuzzy timestamp logic to attribute activity to a particular source. If we use lookup tables and some fancy logic we can also easily figure out what the full request URLs associated with the traffic are by using the commonly formatted source information as inputs.
Example 2
As one final example, letās say we wanted to build a report which shows us all permitted malicious web traffic across our network; but, onlyĀ SegmentĀ Aās traffic goes through the Web Proxy and only Segment Bās traffic goes through the NGFW.
Our query would look like this with bad data scrubbing:
- SELECT Request URL, RQ_URL, Source IP, SRC_IP, Source Host, SRC_HST
WHERE (Request URL = Malicious AND Traffic = Accepted) OR (RQ_URL = Malicious AND Action = Permitted)
And like this with good data scrubbing:
- SELECT Request URL, Source IPv4, Source FQDN,
WHERE Request URL = Malicious AND Traffic = Accepted
The common formats, schema, and value types gives us better query performance and makes searching and building content much easier. Thereās a limited set of field names to remember and the field values will, for the most part, look identical with the exception of the Request URL for the NGFW.
I canāt stress enough how much more elegant and effective this is for quickĀ analysisĀ and content development.
Conclusion
This has been a very long-winded way of saying that effective SIEM usage requires (a) a plan, (b) strong cross-functionalĀ collaboration, and (c) a clear intent to structure data early on.Ā InvestingĀ in these early phases sets yourself up for quick-wins down the road.
If you liked thisĀ article, please share it with others and keep an eye out for āSIEM Fundamentals (Part 2): Using Alerts, Dashboards, and Reports Effectivelyā. If you really liked this article and want to show some support, you canĀ check out ourĀ Threat Detection MarketplaceĀ (free SIEM content), ourĀ ECS Premium Log Source PackĀ (data scrubbers for Elastic) andĀ Predictive MaintenanceĀ (solves the data collection problems discussed here).