The Shift-Left strategy applied to Threat Detection
This is a quick one just to share a win I've been recently through that might be applicable or inspirational (why not?) to some while sharing my unasked opinion on SOAR — from a detection engineer perspective.
Before exploring the topic deeper, check if you agree on the following needs:
- We all need a proper, investigation driven case/ticket/incident management system. Extra points if it makes collaboration easier!
- We all need to provide reports on alerts or incidents handled. Bonus points if we can have custom and digestible metrics and charts in there.
- We all need to perform sometimes labor-intensive, post-alert actions such as analyzing email headers or checking for a domain's reputation.
- We all LOVE automation!
- We all HATE Alert Fatigue! (#1 hit in sales presos)
Now, how many of those problems are solved with SOAR? It depends on the product? It depends on who implement or runs the product (skillset)? It depends on many factors, not arguing SOAR cannot help here.
It's hard to say but I've seen quite a few, including Swimlane, SIEMplify and other enterprise-grade options. There are two observations I can share:
- SIEM overlap: where’s the limit of applicability? It comes with no surprise many SIEM vendors have built or acquired/incorporated SOAR as part of their portfolio.
- It creates lots of expectations. Sometimes to the point one throws SOAR at every single new use case that pops up (same happened to SIEM BTW).
- Detection delivery cycles just grow, not to mention the troubleshooting of existing detections given some complicated playbook logic in place.
Oh the playbooks…
If you are into SIEM and not following the work from Anton Chuvakin, you are probably not into SIEM for long.
It's pretty common for many thought leaders like him to collect signals and input via polls. I once
t̶r̶o̶l̶l̶e̶d̶answered one of his Twitter polls like this:
At the end of the day, isn't that what many expect from their investment in SOAR, to reduce noise (and alert fatigue)?
So why not working on crafting better detections then? That also seems to resonate well with the poll results.
The Shift Left Strategy
This year I came across an interesting concept widely used in SDLC called Shift Left Testing.
Yes, again! Just like an epiphany I had in 2017 when a Software Tester told me about Agile/Jira and how that could help our team better manage the detection use cases lifecycle.
What is this concept about?
Shift Left is a practice intended to find and prevent defects early in the software delivery process. The idea is to improve quality by moving tasks to the left as early in the lifecycle as possible. Shift Left testing means testing earlier in the software development process.
I keep saying every single SIEM problem is a "Use Case Design" problem. This seems (again) to be applicable to Software Development domain.
A CTI Use Case
That shift-left strategy isn't limited to testing only but pretty much anything we can push (to happen) towards the very beginning of the process or lifecycle, preventing other issues (noise, triage/SOAR overload, etc).
Without stressing on how many use cases we have in Threat Detection that could benefit from this, I’m going to share just one I’ve successfully delivered in a project I was leading (talk the talk versus walk the walk).
VT Hits as a strong indicator for alerting
This detection leverages the VirusTotal (VT) API which allow us to check file hashes against its massive hash database.
Why would that fit into this strategy? Now, just assume an ordinary alert containing a file hash (an email attachment, for instance).
Usually hash checks are performed AFTER an alert is generated, at the other (right) end. Also, most SOC analysts will only engage on an investigation or escalate it once they find enough positive hits in VT.
And don't forget this one: if the SOAR playbook do not observe enough hits, it sometimes auto-closes the alert — which is well perceived most times!
So why can't we anticipate that and save many cycles along the way?
Before you say "Hey, but my zero-day does not have hits!", let me tell you: Threat Detection is another engineering challenge. As defense specialists, we need to maximize high-fidelity alerts — given the resources we have.
Therefore, if a hash in an environment is linked to 30+ hits in VT, the likelihood of that being linked to a threat (or a preventive control gap) is high and should be assessed, regardless of an APT or a commodity threat case.
Here's the recipe:
1 - Define the scenarios in which you want to collect hashes from
This can be done at EDR/Endpoint level or at the SIEM itself, depending on the telemetry at disposal. What that means in concrete terms?
Write queries that output unique hashes observed within the following scenarios, annotating the first/last time seen and the reason (scenario):
- Child processes and files dropped by scripts (PowerShell, WMI, VBS, etc)
- Child processes and files dropped by MS Office
- Processes establishing network connections (exclude towards know domains initially)
- Any low-prevalence file executed (track via hash, build the baseline with simple stats such as # of endpoints running a given hash value)
Of course, that list is endless and depending on the scale of your environment and SIEM resources, you can pretty much apply to all hashes.
2 - Build a SIEM query that consumes the output from #1
Now that you have such list (summary/lookup), the challenge is de-duplicating hash values and checking them against VT DB. There are many ways to do that and the API docs are pretty easy to understand.
In my case, I simply leveraged a custom Splunk command which is a Python script under the hood that fires POST requests against VT.
3 - Build a results summary with cache capability
The tricky part here is dumping those results to another summary/lookup and making sure you prevent recurring requests from happening by implementing a local cache mechanism.
That alleviate on performance impact while avoiding API request limits.
This is definitely the most challenging part and there are many approaches here. Again, the size or time scope of that cache will be limited to your environment resources.
A good metric to have here is the # of unique hashes seen from those scenarios per day and when they appear for the first time given a time window (say, 30d). This will help you define the cache size.
The contents of those results/cache should contain, of course, the number of hits, AV engine names and the VT URL (alert context).
4 - Generate alerts based on VT results (# of hits, AV engines seen)
Build a rule or saved search (in Splunk/SIEM) that consumes from those results and alert based on a predefined threshold or criteria.
It's pretty common to have some legit tools flagged in VT as well as some minor or low-relevancy AV engines flagging on unwanted cases. I cannot define this one for you but happy to share some ideas offline.
Expanding it further
There's definitely space for SOAR when it comes to automation. Nevertheless, as any new product introduced in your security arsenal, its use cases should be carefully scoped and designed.
If you made it until here, I suggest you also explore another concept I am working on called Hyper Query. In short, the output of the detection described above is simply treated as another indicator (input) to be processed as part of a scoring framework.
Another SUPER important point to highlight here, and that is going to push this concept even further is Ingestion Pipeline Management, another key component of data engineering.
Think like this: what if we could flag those hashes upon ingesting them? 😉
Hope you have enjoyed and happy (automated) hunting!