Githubify the SOC
Full disclosure: This post was in draft for months, and I recently stumbled upon Anton’s blog post and Augusto Paes de Barros' answer so I decided to just release my original draft without modification, it is not an answer to anyone, peace ✌️ :)
The few SOCs I know are suffering from their growth: expanding team, legacy detection rules, backward-incompatible changes in engine’s update, or uncontrolled performance. Eventually, the ruleset keeps growing in a pure “fire and forget” mode.
Worse, SOCs need an easy way to accept contributions from various parts of the organization without putting in danger their detection pipeline: in the best of the World, anybody could submit and deploy a detection rule and let it be handed over to the RUN team. Yet, the current situation is extremely fragile: a bad query could overload your SIEM, a wrong filter, and here is a storm of false positives to bulk-close, etc.
Said differently, they see a non-scalable lifecycle, bad quality assurance, regressions, and an extreme fragility to changes. The kind of problems that state-of-the-art developers have fixed in the last decade with the introduction of Agile principles: continuous integration and deployment, end-to-end accountability, unit testing, and a strong focus on the end-user.
What is the solution for the SOCs? Take The Phoenix Project book, replace all references to “developers” with “SOC” and voila, you have your roadmap.
Needs
To make this change, we need first to githubify the SOC and we are far from it, we need to redesign the whole detection pipeline: a consistent process from the idea of a detection rule to its deployment in the SIEM.
Individual components exist:
- Sigma for writing the SIEM query
- ADS to document the alert and give the rationale behind the detection
- Elastic’s detection-rules or Splunk stories merge the query and the context in one file directly ingestable by their engine.
- Github for closing the feedback loop and iterate faster: collaborative editing, peer-reviewing, rollback, continuous integration, and deployment, reporting bugs
Nonetheless, we are missing this little thing that will glue these components together; Elastic Detection engine looks promising but it misses the Github workflow and is limited to Elastic’s stack obviously (full disclosure: It is not really a con in fact since I don’t believe in “universal” solutions anyway).
Detection as code
If I had a magic wand, I wish we would realize Donald Knuth’s dream: literate programming where the detection logic would be embedded in the document (in ADS format). And actually, it looks like that Red Canary is doing exactly that for years (in this screenshot) , gg! And I wonder if I did not see something similar by Expel.io in one of their presentations.
Lately, I discovered Panther which looks 🤩, but as far as I googled it, the Windows event logs use case is not ready yet (I wonder what happened to panther#1101). There is the very promising Grapl which fulfills all requirements!
Now, imagine if, instead of writing SPL in your Splunk, you would write normal stateless Python code that would be automatically executed by an AWS Lambda when a new .evtx
is uploaded in an S3 bucket? Instead of having to learn a specific SIEM’s silo and get around its quirks and limitations, you would use normal Python, its extensive libraries, interfaced with other tools and services. Suddenly, you could leverage all the progress made by the Agile thinkers in the last decade: unit testing, performance profiling, easy refactoring of code, code deployment, static typing system, code analytics, telemetry, etc.
Furthermore, coupled with Jupyter, we may have amazing capabilities. I never had the chance to test Azure Sentinel yet, but its Notebooks seem to be spot on.
Limits
In this utopia, I guess it would be great for detecting (or punctual hunting) but it would be insufficient for investigations: We would eventually miss the interactivity of a (good) SIEM. And, as nothing is indexed, when there is an incident to investigate, how would you do? Retrieve the raw .evtx from the S3 bucket and reprocess it somehow into a Splunk instance for further inspection?
At the end of the day, are we doomed to redevelop “Agile enhancers” for each SIEM technology? Or are we going to move “the SIEM” to a commonality like Python to benefit from everything that was already developed and battle-tested?
I hope I am wrong but I am not an optimist about the latter, unfortunately: because “Nobody gets fired for buying IBM”. Who will have the courage to say to its management “hey, screw $SIEM_VENDOR, I will do everything with a shiny new way of working used by almost nobody in the industry”?