Friday, February 21, 2014

Ostrich Approach

Although it is far less common than it used to be, there are still some organizations that take an "ostrich" approach to information security. It's no surprise that I much prefer an analytical approach to information security (that is the name of this blog after all). An analytical approach involves collection and analysis, using a variety of tactics and techniques, of network traffic data for the purposes of detection, alerting, and forensics.

So what is the "ostrich" approach to information security? There are still some people who believe that they can deploy a few different alerting technologies, configure them with a bunch of different signatures, and then sit back and watch the alert queue. This tactic is a start, and in fact, it's an important part of an enterprise's overall security operations and incident response posture. Why isn't this enough? Because you're essentially looking for "known knowns". In other words, you will only be alerted to what you know to look for. You may have a great set of signatures, but they are never going to cover everything you need to be concerned about. You're essentially sticking your head in the sand regarding everything else flying across the network (hence the terminology "ostrich" approach). What's the issue with this approach? The most interesting and worrisome activity lies in the "unknown unknowns" -- things we don't yet know we need to be concerned with. There will come a time when you will need to analyze data and/or perform network forensics for something that wasn't a "known known". When this time comes, you want to be sure that you have the data and the situational awareness required to be successful. Can you then take what you've learned and incorporate it into your alerting? Absolutely, but only if you have the necessary data to allow you to draw those conclusions.

Don't be an ostrich. It will come back to bite you.

Thursday, February 20, 2014

Virtuous Feedback Loop

Many large enterprises have several different groups contained within their security organization. Often included among these groups are the security architecture, security engineering, and security operations/incident response groups. Let's examine briefly what each group does.

Security architecture: This group's main function is to identify security technologies and design solutions that address operational gaps and needs.

Security engineering: This group's main function is to engineer, deploy, and maintain the solutions that the architecture group has designed.

Security operations/incident response: This group's main function is to use security technologies to perform security operations and incident response, as well as to identify operational gaps and needs.

So, as you can see, these three groups are all tightly coupled and depend on one another in order to function at peak effectiveness. There is a virtuous feedback loop here. Architecture relies on security operations/incident response to identify operational gaps and needs. Engineering relies on architecture to design solutions to be implemented. Security operations/incident response relies on engineering to implement technologies required to perform security operations and incident response.

This may seem obvious, but it has always amazed me how many organizations struggle with this inter-dependency. For security executives, it can be helpful to take a moment and ensure that architecture, engineering, and security operations/incident response are all working smoothly together. Having that virtuous feedback loop in place goes a long way toward helping to deliver the right technology solutions that address the most pressing operational gaps and needs. It is difficult for any organization to meet its full potential without technology that suits its true needs.

Wednesday, February 19, 2014

Intelligent Detection Requires Intelligent Alerting

Many of my recent blog postings have been focused on incident response. I've written these blog postings from the vantage point of an organization already in the midst of an incident response. In other words, many of the recent blog postings assume that an organization has already detected or been notified of a breach. Some people have asked, rightfully so, about the detection piece of the picture. This is an excellent question, and I would like to discuss this question in this blog posting.

At a high level, the modern threat landscape necessitates fairly sophisticated detection techniques. Several years ago, it was often possible to identify malicious activity by looking for traffic to a specific IP address or domain. Those days have left us, and as we've discussed previously, attackers have moved up the stack. Modern detection techniques need to be granular, flexibly leveraging layer 7 meta-data fields deep into the packet. At the same time, alerts, and the rule logic that goes into those alerts, need to be sophisticated and tightened to hone in on the specific activity of concern without generating high volumes of false positives.

To illustrate this concept, consider these two different approaches to alert writing. The first approach uses a domain name watch list approach to alerting, which was very popular about a decade ago, and to some extent remains popular today. The second approach uses the domain name watch list as one element of a much more complex logical structure. In my estimation, this is the new frontier of alert writing, and it is already being used in some organizations with more mature incident response functions.

Approach 1: Alert on all traffic to dynamic DNS provider domain names.

Approach 2: Alert on all Office documents leaving the network destined for dynamic DNS provider domain names that currently resolve to public, routable IP addresses, but previously resolved to private, non-routable IP addresses.

As you can see, approach 2 is far more likely to zero in on suspect traffic than approach 1, which is far more likely to generate high volumes of false positives. So why don't more organizations write alerting in the style of approach 2, rather than in the style of approach 1? The answer is often that many organizations do not have technology solutions that allow them to write the incisive alerting they need.

Based on my experience designing and implementing alerting and workflow over the past decade+, technologies that allow the analyst to design sophisticated, granular, incisive, and intelligent alerts are long overdue. Isn't it about time we allowed our alert writers to move up the stack with the attackers?

Friday, February 14, 2014

Why We Love Databases

Databases are one of the most common, popular, and widely deployed technologies in use today.  Databases support a wide variety of business and technology purposes in almost every organization.  You might ask yourself why I'm talking about databases, rather than a topic more closely related to security operations and incident response.  I believe that taking a look at why we love databases can help explain why.

It seems to me that the reason we love databases so much is because they scratch our burning itch to turn data into information.  It's as easy to get data out of the database as it is to put data into the database.  Furthermore, we can get out precisely the data we are interested in, with little to no data we are not interested in.  Through this process, we create information from data.  Why is this?  Let's examine the process someone might go through when interacting with a database:
  • Understand the business need (i.e., what is the desired outcome)
  • Create human language question to ask of the data (i.e., what question, when asked, will achieve the desired outcome)
  • Translate human language question into SQL (i.e., in what data repositories and via what query syntax will lead to the desired outcome)
  • Receive timely and accurate answer (i.e., obtain the correct results in seconds and minutes, rather than hours and days)
If we abstract this model more generally, we see that, in fact, the steps described above also fit the network forensics model quite well.  As described in previous blog postings and elsewhere, network forensics is about asking targeted, incisive questions and, in turn, receiving timely and accurate answers.  When we look at network forensics from this angle, we see that a powerful, flexible query language is a must have for performing network forensics.

When the next breach hits, will you be able to issue targeted and incisive queries over your network traffic data and receive timely and accurate answers?  If not, then it pays to think about how you will answer the tough questions when they come.

Thursday, February 13, 2014

Whack-a-mole

As a member of several information sharing groups, and as someone who has worked with many different enterprises in the security operations and incident response area, I'd like to discuss the concept of whack-a-mole. Allow me to explain. On any given day, an organization will detect or receive notification regarding multiple infected systems on the network. The organization will then perform incident response accordingly, as we might expect. For those of us that have worked in the field of incident response for a while, we recognize this as a routine part of our day -- just like drinking our morning coffee. As part of our incident response, we will improve our controls to prevent what happened today from happening tomorrow. Makes sense, right? Yes, absolutely -- except for the fact that tomorrow, the attackers will be onto something else that we probably don't have controls in place for. If we take a step back, we see that from this perspective, incident response can begin to feel a bit like the arcade game whack-a-mole. Kill 12 infected systems today and their associated infection vectors, and tomorrow, 15 more will pop up. I'm not suggesting that we abandon this -- incident response absolutely needs to be performed for systems we know are infected. Rather, I'm suggesting that we think about treating the cause of the infections, rather than the symptoms. If we can treat the cause of the infections, we will have far fewer symptoms to treat.

One thing I often see missing from discussions in information sharing groups or from within the enterprise incident response function is root cause identification. In other words, what specifically enabled or facilitated the infection? It's important to remember that root cause and infection vector are two different things. Identifying the infection vector allows us to know how the malicious payload was delivered. Identifying the root cause allows us to understand why the malicious payload succeeded in infecting the system. There is a subtle difference there. Consider the all-too-common example of a drive-by re-direct attack delivering an exploit to a vulnerable version of Java. The infection vector tells us that an unsuspecting user (the innocent bystander) was re-directed to a malicious site that delivered an exploit. If we block the malicious site, there will be another one (or another 1,000) tomorrow. The root cause, on the other hand, tells us that the version of Java on the infected system was vulnerable, and it is upon this that the attackers preyed.

So how can we identify the root cause of infection? In order to identify root cause, we need to re-construct exactly what transpired during the infection to fully understand the sequence of events. In order to fully understand the sequence of events, we need to precisely extract only the relevant network traffic. In order to precisely extract only the relevant traffic, we need to issue precise, targeted, and incisive queries across the network traffic data. In other words, we need to perform network forensics to re-construct and understand what occurred.

What can we do once we identify the root cause? We can work to address it. For example, if vulnerable versions of Java are the root cause of 80% of our malicious code infections, we can work with IT to understand why we are running a vulnerable version of Java and correct that. Think of the ramifications here: By performing network forensics to identify root cause and subsequently addressing the root cause, we could potentially achieve a five-fold decrease in malicious code infections. How do I know this? I've seen it happen with my own eyes inside an enterprise.

As an added benefit, when there are less commodity malicious code infections to respond to, we can focus on other questions that are often overlooked because of lack of time. For example, we might want to analyze our data looking for more sophisticated threats, or perhaps understand if we have particularly unusual traffic on our network that requires additional investigation. There is no shortage of good ways to invest newly liberated human resources.

Root cause analysis is a great thing, unless you like playing whack-a-mole that is.

Wednesday, February 12, 2014

Granular Indicators

This week, Kaspersky Lab unveiled its research on the Careto APT malware (aka "The Mask"). The analysis was presented at Kaspersky's Security Analysis Summit, and a detailed, 65-page report entitled "Unveiling 'Careto' - The Masked APT" was also released. There has already been much discussion and analysis of this report, and I will not recycle what has already been discussed in other forums. Instead, I would like to highlight something regarding the indicators of compromise (IOCs) used in the various attacks.

If we look at the IOCs used in the attacks and detailed in the Kaspersky Lab report, we see that many of the IOCs are incredibly granular. For example, the URLs used as part of the exploit, payload delivery, callback, and command and control (C2) phases of the attacks are extremely specific and very detailed. The level of specificity and detail goes right down to the last character of the URL in many cases. This is, in its essence, an example of attackers "Moving up the Stack". In this example, the difference between "routine noise" and a successful exploit/compromise lives deep inside the packet. This subtle difference can only be identified by exploiting layer 7 enriched meta-data, and the ability to differentiate here can result in rapid detection and response versus staying compromised for months on end.

The information security community seems to be in agreement that Careto was written and weaponized by sophisticated attackers. Does your network forensics technology allow you to move up the stack with the attackers? If not, how will you identify sophisticated malicious code intrusions perpetrated by attackers who have expertise in "Moving up the Stack"?

Tuesday, February 11, 2014

The Scarcest Resource

As with any profession, security operations is subject to constraints and limitations. Traditionally, there have been several factors that have posed challenges to enterprise security operations. These include:
  • Data
  • Processing Power
  • Storage
  • Technology
  • Process
  • Analysts
Let's examine these constraints one by one.

Data: In the early days of security operations, it was difficult for enterprises to collect the data necessary for security operations, incident response, and network forensics for a variety of reasons. As we all know, this is no longer the case. In fact, we find ourselves in quite the opposite situation nowadays. The velocity, volume, and variety of data collected in the enterprise's data repository of choice are all higher than ever.

Processing Power: Two decades ago, we had to worry about having too many firewall rules or too many mail filters because of processing power. Per Moore's law, which states that processing power doubles approximately every two years, today's processors are about 1024 times more powerful than those of two decades ago. The shackles of processing power limitations have been lifted.

Storage: Although storage can be made much more plentiful than was possible years ago, it is still relatively expensive. As such, it is a resource to be used wisely. Lots of storage is a necessity for data retention, but using it wisely (giving preference to data of higher value) can lower cost for the same retention or increase retention for the same cost.

Technology: Whereas years ago, security professionals needed to cobble together network forensics tools with duct tape, chewing gum, and band-aids, it is now possible to purchase commercial tools for many security operations and incident response needs. Does technology address every need that an organization might have? No. Is today's technology perfect? Of course not. But, it is far better than it used to be.

Process: It was once the case that incident response was a new field where it was difficult to find guidance and formalized processes. This is no longer the case. The incident handling life cycle and incident response process are both formalized, and many enterprises have a rigorous and formal incident response process to follow both during the course of normal security operations and during a breach response.

Analysts: Unfortunately, the number of analysts working within an organization's security operations and incident response function has not kept pace with the growing demands of the function. There are several reasons why this is the case, but the difficulty in finding qualified professionals and persistent budget limitations are two of the biggest reasons.

If we look across people, process, and technology, it seems that people -- the analysts -- are the scarcest resource. This should come as no surprise to those of us working day to day in the security operations and incident response field. Given the scarcity of our human resources, don't we owe it to ourselves and our organizations to choose process and technology that streamline workflow, reduce inefficiencies, and optimize the analyst's cycles?

Monday, February 10, 2014

Host Data

Network traffic data provides a wealth of insight into all of the traffic that has crossed the network. It is an excellent data source through which we can understand precisely what is traversing our networks. But what do we do when we need to understand something that may or may not have occurred on a host? For example, consider these questions:
  • Did the exploit that I just saw fly across the network succeed?
  • Did the malicious binary downloaded by host X successfully execute and maintain persistence?
  • What process on host X was responsible for the malicious command and control activity I just observed in the network traffic?
  • What activity was completed on the host to stage the data I just saw exfiltrated before it was sent out of the network?
The answers to these and other important questions come from the correlation of host data with network data. The network data is the data of record regarding what crosses the network. Once those bits and bytes disappear "over the hill" and make their way onto the host, we lose sight of them. This is where host data can provide us additional information to complete the picture and help us correlate information. Examples of host data include host-based intrusion detection systems (HIDS), anti-virus (AV), Windows security event logs, and others.

If we revisit the questions posed above, we can imagine answering them as follows:
  • The exploit destined for host X at time T1 (network data) successfully exploited process Z at time T2 on host X (host data).
  • The malicious binary downloaded by host X at time T1 (network data) successfully executed on host X at time T2, runs as process Z, and as of time T3, is still maintaining persistence on host X (host data).
  • Process Z on host X (host data) initiated the malicious command and control traffic to site S observed at time T (network data).
  • Documents A, B, and C were compressed and encrypted by process Z into file F at time T1 on host X (host data). File F was exfiltrated from the network to site S at time T2 (network data).
That is a level of certainty and knowledge that is, unfortunately, all too rare in the security operations and incident response realms. That level of precision can only come from the timely and accurate correlation of host data with network data, as it requires different viewpoints to reconstruct. Pretty neat stuff.

Wednesday, February 5, 2014

The Old Model

In the early days of security operations and incident response, organizations collected every network traffic log they could get their hands on.  There were many reasons why this was done, but some of those reasons included:
  • It was not clear which data did or did not provide value to security operations and incident response
  • It was not clear when/how often the data should be reviewed and analyzed or how exactly to review and analyze it.
  • The velocity at which the data streamed into centralized collection points was far lower that it is today
  • The volumes of data being collected were far lower than they are today, in part because network speeds were lower, and in part because networks were less well instrumented for collection
  • The variety and diversity of the data being collected were far lower than they are today, in part because networks were less well instrumented for collection, and in part because there were less specialized technologies collecting data
Organizations followed this model through the years, and for good reason -- there was no reasonable alternative.  Over the years, the incident response community has grown, organizational knowledge has increased, and capabilities have matured.  We are now to the point where we can, with some effort, assess the value of each available data source to security operations and incident response.  The incident response process and the incident handling life cycle are both well documented and well understood.  The velocity, volume, and variety of data have increased tremendously and continue to increase.  All of these factors contribute to the new reality -- that for security operations, incident response, and network forensics purposes, it is not possible to collect every data source available.  The operational complexity, workflow inefficiency, storage requirements, and query performance simply do not allow for this.  Rather, each data source should be evaluated based upon its value-add to security operations, incident response, and network forensics, while at the same time being weighed against the volume of the data produced by the data source.  This is something we routinely do in other aspects of our lives -- we opt to carry one $20 bill, rather than 80 quarters because it scales better.  Why shouldn't security operations use the same approach?

Critics of this approach will say that if they omit certain types of data from their collection, they run the risk of losing visibility and/or not being able to perform incident response.  To those critics, I would ask two questions: 1) What makes you so certain that you cannot retain the same level of visibility using fewer data sources of higher value?  And, 2) If it takes 8 hours to query 24 hours worth of non-prioritized log data looking for the few log entries that are relevant, are you really able to perform timely and accurate incident response?  Clearly there is a balance that needs to happen.  In these cases, I am a big fan of the Pareto rule (sometimes called the 80/20 rule).  I have seen organizations that don't ever look at 80% (or more) of the log data they collect.  So, with 20% of that data, the same visibility can be retained, and as an added bonus, retention can be increased five-fold (say from 30 days to 150 days) at the same storage cost.  With the need to be ready to perform rapid incident response as critical as ever, it pays to think about how less allows us to do more.

Tuesday, February 4, 2014

Seconds and Minutes

Richard Bejtlich, an industry thought leader on Incident Response, advises us that our goal in incident response should be one hour from detection of an incident to its containment.  In other words, once we learn of a breach or intrusion, we should seek to have it contained within 60 minutes. That is a noble goal, and I would like to take a moment to discuss this concept.

To better understand what it means to go from detection to containment in one hour, let's begin by reminding ourselves of the incident response/incident handling life cycle:
  • Detection
  • Analysis
  • Containment
  • Remediation
  • Recovery
  • Lessons Learned
For the purposes of this post, let's assume that the detection piece is in place.  In other words, we have either detected an intrusion through our own alerting, or we have been notified by an external entity or third party in a timely manner.  Looking at the incident response life cycle, we see that before we can think about containment, we must perform analysis.  This is actually intuitive, as before we can perform containment, we need to understand what exactly needs to be contained.  The process by which we understand what needs to be contained is called analysis.

In this context, analysis may consist of network forensics, malware forensics, and/or media forensics, depending on the incident.  Let's place malware forensics and media forensics to the side for a moment and think about network forensics.  As has been discussed in previous blog postings and elsewhere, network forensics provides the means through which timely, accurate answers to important questions are uncovered in the enterprise's network data.  If we take a step back, in order to successfully answer the right questions in one hour, a few things need to be in place:
  • The enterprise's network data has been collected by its network forensics platform at all required network points of presence with no data loss
  • The network forensics platform allows the incident response team to ask incisive questions of the data to receive answers to the relevant questions (necessitates a powerful and flexible query language)
  • The network forensics platform provides answers to the relevant questions in seconds and minutes, rather than in hours and days (necessitates performance at enterprise scale)
This is a tall order for most enterprises, mainly because the three points listed above form the modern network forensics frontier.  When an incident hits, can your organization perform incident response in seconds and minutes, or will you need hours and days?  To me, it seems important to answer that question honestly and realistically.  Once we are honest with ourselves, a strategy for incident response in seconds and minutes can be designed and implemented.  Only then can we be prepared to perform incident response in seconds and minutes.

Saturday, February 1, 2014

O&M

When people began moving from the cities to the suburbs in the post-war United States in the 1950s, new infrastructure was built to serve the shifting population.  The infrastructure served its population well for 50 years or so, until the 2000s, when the physical lifetime of water mains, electric power lines, and other infrastructure was reached.  What people quickly realized is that although money and resources had been allocated to build and deploy infrastructure, money and resources had not been allocated to operate and maintain the infrastructure for the long term.  In other words, O&M would be required to repair or replace the aging infrastructure, but the resources for that O&M would have to be found elsewhere.

Similarly, in the information security realm, as new business needs arise, new security technologies are often deployed to address them.  Enterprises often forget to include O&M when calculating total cost.  Another way to think of this is that each new security technology requires people to deploy, operate, and maintain it.  If head count were increased each time a new security technology were deployed, the model would work quite well.  However, as those of us in the security world know, head count seldom grows in parallel with new business needs.  This presents a big challenge to the enterprise.

O&M cost (including the human resources required to deploy, maintain, and operate technology) is an important cost to keep in mind as technologies reach end of life and come up for renewal or replacement.  O&M cost is a large part of the overall cost of technology, but it is one that is often overlooked or underestimated.  In an effort to lower total overall O&M costs, it pays to take a moment to think about the purpose of each technology.  Is this specific technology a highly specialized technology for a highly specialized purpose?  Could I potentially retain the functionality and visibility provided by several specialized technologies through the use of a more generalized technology?  If the answer to these two questions is yes, it pays to think about consolidating security technologies that are up for renewal or replacement.  This can be a great option provided it doesn't negatively affect security operations.  Fewer specialized security technologies means fewer resources to deploy, maintain, and operate them.  That, in turn, means lower overall O&M costs.  Lower O&M costs are always a powerful, motivating factor to consider.