SIEM Fundamentals (Part 1): First and Foremost, A Data Collection Problem

[post-views]
January 30, 2020 Ā· 13 min read

Introduction

The goal of this series is to put readers in the right mindset when thinking aboutĀ SIEMĀ and describeĀ how toĀ set themselves up for success. While Iā€™m not a Data Scientist and donā€™t claim to be, I can confidently say that expecting results inĀ security analyticsĀ without first having ā€œgood dataā€ to work with is folly. This is why I always say that ā€œsecurityĀ analyticsĀ is, first and foremost, aĀ data collectionĀ problemā€ and why part 1 of the SIEM Fundamentals blog is focused on how to approach data collection.

(Image from ā€“Ā https://thelead.io/data-science/5-steps-to-a-data-science-project-lifecycle)

ThisĀ imageĀ is aĀ visualizationĀ of the OSEMNĀ frameworkĀ that many data scientists utilize when planning aĀ projectĀ and is effectively how teams turnĀ data into information. The entire purpose of using a SIEM is not to store data; but, to create new and useful information that can be used to improve security.

Obtaining data and scrubbing it is no small feat. In fact, many Data Scientists view each of these phases as distinct specialties orĀ domainsĀ within the practice. Itā€™s not uncommon to find individuals dedicated to a single phase. Not appreciating this is why many teams fail to derive value from a SIEM; all of the marketing hype makes it easy to overlook just how much effort is required by theĀ end-userĀ at every phase of theĀ processĀ and is something that good teams will continually iterate through as their environment and theĀ threat landscapeĀ changes.

Letā€™s spend a moment talking aboutĀ ETL, as this helps describe some of the next sections.

  1. Extract ā€“ Actually getting the data source to outputĀ logsĀ somewhere in some format. Important to note that this is the only phase in which new original data can be introduced to a data set via configuration.
    • E.G. Configuring theĀ logĀ source to report fields ā€œX, Y, and Zā€ usingĀ remoteĀ syslogĀ with a specific format such as key-value pairs or comma delimited lists.
  2. Transform ā€“ Modifying the format and structure of the data to better suit your needs.
    • E.G. Parsing aĀ JSONĀ log file into distinct plain-textĀ eventsĀ with a parsing file,Ā mappingĀ file, customĀ script, etc.
  3. Load ā€“Ā WritingĀ the data to the database.
    • E.G. Using software that interprets the plain-text events and sends them into the database with INSERT statements or other public APIs.
  4. Post-Load Transform ā€“ Not an official part of the ETL process; but, a very real component of SIEM.
    • E.G. Using data modeling, field extractions, and field aliases.

Obtain

Collecting data is, at a smallĀ scale, simple. However, SIEM is not small scale, and figuring out how to reliably obtain data relevant data is critical.

Reliable Delivery

For this section, weā€™re going to focus on Extraction and Loading.

  • Extraction
    • What is my data source capable of outputting?
    • What fields and formats can be used?
    • What transport methods are available?
    • Is this a device that we can ā€œpushā€ data from to a listener or do we have to ā€œpullā€ it via requests?
  • Loading
    • How do we ensure that data is delivered in a timely and reliable manner?
    • What happens if a listener goes down? Will we miss data that gets pushed during an outage?
    • How do we ensure that pull requests complete successfully?

In the SIEM world, itā€™s often the case that ā€œextractionā€Ā functionalityĀ is provided alongside ā€œloadingā€ functionality; especially in cases where additional software (connectors, beats, and agents) are used.

However, there is a hidden space between these two where ā€œevent brokersā€ fit in. Because data must be delivered over aĀ network,Ā eventĀ brokers are technologies likeĀ KafkaĀ andĀ RedisĀ that can handleĀ load balancing, event caching, and queuing. Sometimes event brokers can be used to actually write data into the targetĀ storage; but, may also be outputting to a traditional ā€œloaderā€ in a sort of daisy-chain fashion.

Thereā€™s not really a right or wrong way to build your pipeline in respect to these factors. Most of this will be dictated by the SIEM technology that you use. However, it is important to be aware of how these things work and prepared toĀ addressĀ the unique challenges of each throughĀ engineeringĀ solutions.

Choosing Log Sources

Donā€™t go out collecting data from everything just because your SIEMĀ vendorĀ told you to; always have a plan and always have good justification for collecting the data youā€™ve chosen. At this point in the process, we should be asking ourselves the following questions when choosingĀ relevantĀ data:

  1. Understanding Information
    • What activity does this provideĀ visibilityĀ of?
    • How authoritative is this data source?
    • Does this data source provide all of the visibility needed or are there additional sources that are required?
    • Can this data be used to enrich other data sets?
  2. Determining Relevance
    • How can this dataĀ helpĀ meetĀ policies,Ā standards, or objectives set by security policy?
    • How can this data enhanceĀ detectionĀ of a specific threat/threat actor?
    • How can this data be used to generate novel insight into existing operations?
  3. Measuring Completeness
    • Does the device already provide the data we need in the format we want?
    • If not, can it be configured to?
    • Is the data source configured for maximum verbosity?
    • Will additional enrichment be required to make this data source useful?
  4. Analyzing Structure
    • Is the data in aĀ humanĀ readable format?
      • What makes this data easy to read and should we adopt a similar format for data of the same type?
    • Is the data in a machine readable format?
      • What file-type is the data and how will a machine interpret this file type?
    • How is the data presented?
      • Is it a key-value format, comma delimited, or something else?
    • Do we have goodĀ documentationĀ for this format?

Common problems encountered in understanding your information stem from poor internal documentation or expertise of the log source itself and network architecture. Determining relevance requires input from security specialists andĀ policyĀ owners. In all cases, having experienced and knowledgeable personnel participating early is a boon for the entire operation.

Scrub

Now onto the more interesting topic of data scrubbing. Unless youā€™re familiar with this, you may end up asking yourself the following questions:

  • Hello, shouldnā€™t it just work?
  • Why is the data not already clean, did an engineer spill coffee on it?
  • Hygiene sounds like a personal issue ā€“ is this something we needĀ HRĀ to weigh in on?

The reality is that, as awesome as machines are, they arenā€™t thatĀ smartĀ (*yet). The onlyĀ knowledgeĀ they have is whatever knowledge we build them with.

For example, a human can look at the text ā€œRob3rtā€ and understand it to mean ā€œRobertā€. However, a machine doesnā€™t know that the number ā€œ3ā€ can often represent theĀ letterĀ ā€œeā€ in the englishĀ languageĀ unless it has been pre-programmed with such knowledge. A more real-world example would be in handling differences in format like ā€œ3000ā€ vs ā€œ3,000ā€ vs ā€œ3Kā€. As humans, we know that these all mean the same thing; but, a machine gets tripped up by the ā€œ,ā€ in ā€œ3,000ā€ and doesnā€™t know to interpret ā€œKā€ as ā€œ000ā€.

For SIEM, this is important when analyzing data across log sources.

Example 1 ā€“ Exhibit A

Device Timestamp SourceĀ IP Address SourceĀ HostĀ Name RequestĀ URL Traffic
Web Proxy 1579854825 192.168.0.1 myworkstation.domain.com https://www.example.com/index Allowed
Device Date SRC_IP SRC_HST RQ_URL Action
NGFW 2020-01-24T08:34:14+00:00 ::FFF:192.168.0.59 webproxy www.example.com Permitted

In this example, you can see that the ā€œField Nameā€ and ā€œField Dataā€ are different betweenĀ log sourcesĀ ā€œWeb Proxyā€ and ā€œNGFWā€. Attempting to build complexĀ use casesĀ with this format is extremely challenging. Hereā€™s a breakdown of problematic differences:

  1. Timestamp:Ā WebĀ ProxyĀ is inĀ Epoch (Unix)Ā format while NGFW is inĀ Zulu (ISO 8601)Ā format.
  2. SourceĀ IP: Web Proxy has anĀ IPv4Ā address while NGFW has anĀ IPv4-mapped IPv6 address.
  3. Source Host: Web Proxy uses aĀ FQDNĀ while NGFW does not.
  4. Request URL: Proxy uses the full request while NGFW only uses the domain.
  5. Traffic/Action: Proxy uses ā€œallowedā€ and NGFW uses ā€œpermittedā€.

This is in addition to the actual field names being different. In a NoSQL database with poor scrubbing, this means that the query terms used to find Alpha logs will vary significantly when usingĀ BetaĀ logs.

If I havenā€™t already driven this point home hard enough yet; letā€™s take a look at a sample detectionĀ use case:

  • Use Case:Ā DetectĀ usersĀ successfullyĀ visiting knownĀ maliciousĀ websites.
  • Environment: The Web Proxy sits in front of the NGFW and is the first device to see web traffic.
  • Caveats
    • The Web Proxy and NGFW do not have identical block lists. A web request could make it through the Web Proxy only to be later denied by the NGFW.
    • Requests are forwarded from the proxy to the NGFW in a non-transparent manner. I.E. The Source IP and Host Name are replaced with the Web Proxyā€™s IP and Host Name and analyzing only the NGFW logs will not show you the true source of the request.
  • Explanation:
    • In this example, letā€™s assume that ā€œMaliciousā€ is some type of variable which compares the URL against a lookup-table of known malicious URLs stored in the SIEM.

Our query would look like this:

  • SELECT RQ_URL, SRC_IP, SRC_HST
    WHERE Device == NGFW AND RQ_URL = Malicious AND Action = Permitted
  • SELECT Request URL, Source IP, Source Host,
    WHERE Device == Web Proxy AND Request URL = Malicious AND Traffic = Accepted

However, given the known caveats, analyzing the results of a single query would only tell us the following:

  • NGFW ā€“ The ultimate block/deny status is known. The true source is unknown.
  • Web Proxy ā€“ The ultimate block/deny status is unknown. The true source is known.

We have 2 related pieces of information that now have to be joined using some fuzzy timestamp logic that is really just a ā€œbest guessā€ according to two events that happened around the same time period (yikes).

How2Fix?

Remember these from earlier in the article?

  • Transform ā€“ Modifying the format and structure of the data to better suit your needs.
    • E.G. Parsing a JSON log file into distinct plain-text events with a parsing file, mapping file, custom script, etc.
  • Post-Load Transform ā€“ Not an official part of the ETL process; but, a very real component of SIEM.
    • E.G. Using data modeling, field extractions, and field aliases.

There are entirely too many technologies and options for me to explain every one; but, Iā€™ll cover some basic vocabulary for understanding what theĀ transformationĀ techniquesĀ are:

  • ConfigurationĀ ā€“ Not technically a transformationĀ technique; but, typically the best way to address data structure and format problems. Fix the problem at the source and skip everything else.
  • Parsing/Field Extractions ā€“ A transformĀ operationĀ (pre-ingestion) that utilizesĀ regular expressionsĀ (regex) to slice a string into characters (or groups of strings) based on patterns. Handles dynamic values well provided that the overall structure is static; but, can be performance prohibitive with too many wildcards.
  • Mapping ā€“ A transform operation that uses a library of static inputs and outputs. Can be used to assign field names and values. Does not handle dynamic input well. However, can be considered to be more efficient than parsing if the mapping table is small.
  • Field Aliasing ā€“ Similar to mapping; but, occurs post-load and doesnā€™t necessarily change the actual stored in the SIEM.
  • Data Models ā€“ Similar to field aliasing; occurs at search-time.
  • Field Extractions ā€“ Similar to parsing and can occur pre orĀ postĀ ingestion depending on the platform.

Letā€™s say that we created a bunch of parsers to enforce a common field schema, mapped the field values for traffic from ā€œallowedā€ to ā€œpermittedā€, configured our web-proxy to forward the original source IP and host, configured our NGFW to log host names with their FQDN, and utilized functions toĀ convertĀ timestamps and extract IPv4 addresses. Our data now looks like this:

Example 1 ā€“ Exhibit B

Device Time Source IPv4 Source FQDN Request URL Traffic
Web Proxy January 24, 2020 ā€“ 8:00 AM 192.168.0.1 myworkstation.domain.com https://www.example.com/index Permitted
Device Time Source IPv4 Source FQDN Request URL Traffic
NGFW January 24, 2020 ā€“ 8:01 AM 192.168.0.1 myworkstation.domain.com www.example.com Permitted

Letā€™s also assume that the NGFW simply couldnā€™t give us the full request URL because this information is enriched at the source throughĀ DNSĀ resolution. Our ā€œideal pseudo-logicā€ now looks like this:

  • SELECT Source IPv4, Source FQDN, Request URL
    WHERE Device == NGFW AND Request URL == Malicious AND Traffic == Permitted

Because weā€™ve configured our proxy to forward the source information, we no longer have to rely on two data sources and fuzzy timestamp logic to attribute activity to a particular source. If we use lookup tables and some fancy logic we can also easily figure out what the full request URLs associated with the traffic are by using the commonly formatted source information as inputs.

Example 2
As one final example, letā€™s say we wanted to build a report which shows us all permitted malicious web traffic across our network; but, onlyĀ SegmentĀ Aā€™s traffic goes through the Web Proxy and only Segment Bā€™s traffic goes through the NGFW.

Our query would look like this with bad data scrubbing:

  • SELECT Request URL, RQ_URL, Source IP, SRC_IP, Source Host, SRC_HST
    WHERE (Request URL = Malicious AND Traffic = Accepted) OR (RQ_URL = Malicious AND Action = Permitted)

And like this with good data scrubbing:

  • SELECT Request URL, Source IPv4, Source FQDN,
    WHERE Request URL = Malicious AND Traffic = Accepted

The common formats, schema, and value types gives us better query performance and makes searching and building content much easier. Thereā€™s a limited set of field names to remember and the field values will, for the most part, look identical with the exception of the Request URL for the NGFW.

I canā€™t stress enough how much more elegant and effective this is for quickĀ analysisĀ and content development.

Conclusion

This has been a very long-winded way of saying that effective SIEM usage requires (a) a plan, (b) strong cross-functionalĀ collaboration, and (c) a clear intent to structure data early on.Ā InvestingĀ in these early phases sets yourself up for quick-wins down the road.

If you liked thisĀ article, please share it with others and keep an eye out for ā€œSIEM Fundamentals (Part 2): Using Alerts, Dashboards, and Reports Effectivelyā€œ. If you really liked this article and want to show some support, you canĀ check out ourĀ Threat Detection MarketplaceĀ (free SIEM content), ourĀ ECS Premium Log Source PackĀ (data scrubbers for Elastic) andĀ Predictive MaintenanceĀ (solves the data collection problems discussed here).

Was this article helpful?

Like and share it with your peers.
Join SOC Prime's Detection as Code platform to improve visibility into threats most relevant to your business. To help you get started and drive immediate value, book a meeting now with SOC Prime experts.

Related Posts