A SIEM developer goes fishing in the data lake. What happens next?

TLDR: he misses a flow-based functional processing language.

Before getting into the topic, let me provide you some background and motivation, especially if you are about to embark on the 'Data Lake' wagon aiming at security use cases (SIEM).

The SIEM is not the default gateway for log data

This year I had a chance to work for a large American corporation where the scale of operations was so massive I was not able to focus on anything but log telemetry analysis and other cost/efficiency related work.

While that's less specialized and fun when compared to log-based Threat Detection Engineering, that's a nice challenge to have since most big customers face similar problem: high data collection and storage costs.

How to approach that?

After many discussions, we came up with the following research actions:

  1. Consumer Analysis
    Given the data already at Splunk, which (scheduled/user) queries or knowledge objects (KOs) were actually consuming the logs?
  2. Ingestion pipeline management
    How could low-value, useless data be discarded or filtered out before hitting the Splunk indexers? Can it be routed to multiple platforms?
  3. Data Platform alternatives
    For use cases not requiring full analytics capabilities (simple log retention, post-mortem, etc). How much effort is required ($/time)?

For #1, besides _audit logs it's quite simple to tap into Splunk's REST endpoints and extract every index/sourcetype observed from its KOs contents. In the end, you can build an insightful metrics dashboard:

  • Number of scheduled searches per index/sourcetype/datamodel
  • Absolute and relative amounts of data completely untouched or unconsumed (don't be surprised to get values between 30–40% here!)
  • How much potential license & storage can be saved or migrated

The #2 involves evaluating Cribl and other 'Observability Pipeline' technologies such as Vector (now Datadog). The idea is to avoid storing known-useless events or pieces of them (Windows eventlog's description).

The last proposed research action is about evaluating lower cost Splunk alternatives, from traditional/cloud SIEMs to emerging Data Lake platforms such as Databricks and Snowflake.

And here's where things can get tricky.

Every SIEM problem is a use case design problem

Splunk is known for its focus on SIEM market while still delivering well on application monitoring and other use cases.

However, the price to pay for log collection, storing, analytics and reporting capabilities is sometimes quite high. So the use case value should justify the investment. And here's one way to face that situation:

Data Engineering leaders should have a very good understanding of use cases and how they plan to leverage log data -before- it is collected.

This is not new, here's the same seen through different lens:

There is little to no value in collected log data — until someone starts to continuously consume an alert or a report based on it, generating fruitful outcome or insights.

In a SIEM project, writing a query or crafting a report/dashboard is the very final engineering step before generating value from log data.

If we assume that must be the primary goal once data is ready, we can easily come to the conclusion that the query language or the developer interface/experience is as important as the other platform attributes.

No wonder why MS Excel is still one of the most used interfaces today when it comes to easy and fast data analysis and reporting.

So how to query on a Data Lake?

Structured Query Language (SQL) is basically what I have seen from most demos during that period. I immediately asked myself: how old is SQL??

Among some interesting references, I found this article particularly appealing, written by Paul Dix (InfluxDB author):

I don’t want to live in a world where the best language humans could think of for working with data was invented in the 70's.

That matches my frustration after attending multiple vendor demos. I-just-could-not-believe-it. Maybe I am immersed into SPL for too long so I was expecting to see some innovation in that regards. Nope. None.

I guess the last SQL statement I wrote was more than 10 years ago using PHP + MySQL to build a Syslog web front-end…

It's almost impossible to write any advanced query nowadays without getting into long nested, not to say nasty, join operations.

What about general purpose languages? That is, Python.

One of the demos I got was from a vendor using Snowflake in the backend. The product and the idea seem pretty cool! But wait… my team needs to write Python to implement every single detection?

Python developers will call that home, of course! However, I can barely find, enable and retain Splunk developers, let alone Python (and SQL) developers with security background.

Similarly, consider how hard it should be to hire a team of specialized Jupyter Notebook developers to work on SIEM use cases. Perhaps not the best strategy for big enterprises…

What makes a great SIEM/Hunting platform?

There are many attributes of course, but fostering a passionate user community is one of them. How to enable that? Making the platform freely available is definitely a good start, providing great UX is a must.

Go with the flow!

Splunk, Microsoft Sentinel (Log Analytics), Sumo Logic, CrowdStrike’s Humio/LogScale (for now, still Splunk under the hood!), they all leverage flow-based functional processing language to query on log data (time-series).

That programming paradigm (dataflow) treats the processed data in a flow-based functional model. SQL is based on relational algebra and table sets.

For instance, here’s a moving average calculated in SQL assuming each record holds the date (day) and its corresponding stock price:

AS moving_average
FROM stock_values;

And the same logic written in Splunk’s SPL (one of the ways):

| table _time, stock_price
| trendline sma3(stock_price) AS moving_average
| eval moving_average=round(moving_average, 2)

Only executing the first 2 lines would already generate some output.

The data (flow) piped to the trendline command generates a 3rd field (column) which is later rounded by the last command (eval).

SPL's syntax was originally based upon the Unix pipeline ('|' char) and SQL,
and is optimized for time series data.

The fact is besides its ability to collect and store data, Splunk is super powerful as an Analytics and Reporting engine alone, hence some comparisons to Tableau and Grafana.

It's super easy to generate killing charts and dashboards. The latter is now being improved with the Dashboard Studio (ReactJS).

Microsoft got that!

Last year I started playing with MS Sentinel and the experience was great — thanks almost entirely to KQL (Kusto Query Language) and Azure Data Explorer, not Sentinel itself.

Turns out it’s super easy to migrate as a Splunk developer and it should be similar when it comes to other technologies using similar approach.

Be sure to check this quick SPL-to-KQL Cheatsheet for Splunkers:

So my recommendation for Product Managers out there: be sure to consider a nice query language for your product! ✨

A community colleague is working on a (server-less) data lake project and has asked me to share which SPL commands to potentially implement in its product. This might come in the next article…

So to Data Lake or not?

Before anything: Data Lakes are addressing a bigger problem. It's sort of naive to compare such platforms to SIEMs.

Nevertheless, we seem to be going in the same direction we went with SIEMs in the past: collect it all and extract the value later.

Perhaps Anton Chuvakin is right again?


While it makes sense to have a common data platform for multiple consumers sharing the same data source (data duplication issue), collecting and storing the data is just part of the challenge.

Yes, it needs to be fast but also easy to use.

The following is an oversimplified version assuming pipeline management in place and selectively routing data to the SIEM (Splunk):

Dotted line implies a potential 'default target route' for log data

The idea is to avoid Splunk for use cases NOT requiring full analytics capabilities (advanced reports, interactive dashboards, scoring models, etc) and simply store those logs in a lower cost platform/storage (S3).

Another approach would be to query on Data Lake directly from the SIEM interfaces. Is that feasible/scalable? Please share your thoughts!



Get the Medium app