Saturday, 9 November 2024

Network of Networks

 


When we're searching for analogies to describe the operation of the internet we often fall back on that of posting a letter. Each packet of data we need to send and receive can be compared to a parcel we need to get from A to B.

In order to achieve that you need to identify the best route to take between those two points. For parcels these routes are relatively static, for the internet that isn't the case with routes being much more dynamic.

Border Gateway Protocol (BGP) is a path vector routing protocol that makes routing decisions on how to get data packets to their intended destination. As internet users we may not directly interact with it but without it the internet wouldn't be functional across the globe as it is today.

Autonomous Systems

The internet as the name suggests is a network of networks, it isn't one cohesive systems, it is a large number of individual networks. These individual networks in the context of routing traffic are referred to as Autonomous Systems (AS).

If we continue our postal analogy then each AS is like an individual postal region covering a town or county.

Each AS consists of routers that know how to route traffic internally but need to rely on connections to neighbouring AS to route traffic outside of its region. Because the internet is a dynamic system with routes appearing and disappearing each AS needs to be kept up to date with routes each other AS can help with.

This is achieved via peering sessions where AS connect to each other in order to build up the full picture of how to route traffic to destinations outside of each others boundary. BGP is the mechanism that allows this routing information to be exchanged.

Not Just The Shortest

In order to become a AS you must register with the Internet Assigned Numbers Authority (IANA) which will assign the AS to a Regional Internet Registry (RIR) as well as allocating a 16 or 32 bit identifier. As of the late 2010s there were around 64,000 AS registered worldwide a number which will have continued to grow.

AS tend to be managed and run by large organisations, typically Internet Service Providers (ISP) but also large tech companies, governments or large institutions such as universities.

Often there are multiple possible routes for a packet to reach its destination. In order to allow the best route to be chosen BGP allows AS to apply attributes to each route that can be factored into the decision on which route to take.

These attributes may indicate hop count (how many steps are involved in getting to the destination) as well as weight where an AS can indicate which route it would prefer traffic to take. 

Because some AS are managed by commercial business this creates an interesting quirk in how traffic is routed. It would be assumed that the shortest or fastest route would be chosen. But because some companies will charge for their AS to handle traffic or may not want to help competitors then commercial relationships are sometimes factored into which route to take.

When BGP Goes Wrong

As described earlier the management of routing traffic through the internet is autonomous, it is far too complicated and dynamic to be managed by hand. An AS uses BGP to announce which routes it can provide access to, other AS then use this information to route traffic from within inside their network.

There have been several examples of AS accidentally announcing they can provide access to routes which they can't in fact handle. In 2004 a Turkish ISP accidentally announced it could handle any route on the internet, as this misinformation spread to more and more AS internet access ground to a halt across a large part of the globe. In 2008 a Pakistani ISP which was blocking traffic internally to YouTube accidentally started announcing externally it could handle connections to YouTube and caused an outage in the region to that service.

These are both examples of BGP hijacking, in these cases the announcement of false routes was accidental but this isn't always the case. This can sometimes be done maliciously in order to lure traffic meant for a legitimate site to an imposter, in 2018 hackers announced bad BGP routes for traffic being hosted in Amazon AWS and were able to steal a large amount of cryptocurrency.

In an effort to combat this kind of activity Resource Public Key Infrastructure (RPKI) allows BGP data to be cryptographically signed in order to validate that an AS is authorised to announce a route for a particular resource.

The internet is such an important part of our day to day lives its easy to consider it a slick homogeneous system that always works, and it is true to say that its a miracle of engineering that its been able to scale to the network of networks that we use today. However it is actually a very large collection of individual elements and sometime its fragility is exposed and we see it is all too easy for it to falter. 

Saturday, 19 October 2024

Underpinning Kubernetes

 


Kubernetes is the de facto choice for deploying containerized applications at scale. Because of that we are all now familiar with its building blocks that allow us to build our applications such as deployments, ingresses, services and pods.

But what is it that underpins these entities and how does Kubernetes manage this infrastructure. The answer lies in the Kubernetes control plane and the nodes it deploys our applications too.

The control plane manages and makes decisions related to the management of the cluster, in this sense it acts as the clusters brain. It also provides an interface to allow us to interact with the cluster for monitoring and management.

The nodes are the work horses of the cluster where infrastructure and applications are deployed and run.

Both the control plane and the nodes comprise a number of elements each with their own role in providing us with an environment in which to run our applications.

Control Plane Components

The control plane is made up of a number of components responsible for the management of the cluster, in general these components run on dedicated infrastructure away from the pods running applications.

The kube-apiserver provides the control plane with a front end via a suite of REST APIs. These APIs are resource based allowing for interactions with the various elements of Kubernetes such as deployments, services and pods.

In order to manage the cluster the control plane needs to be able to store data related to its state, this is provided by etcd in the form of a highly available key value store.

The kube-scheduler is responsible for detecting when new pods are required and allocating a node for them to run on. Many factors are taken into account when allocating a node including resource and affinity requirements, software or hardware restrictions and data locality.

The control plane contains a number of controllers responsible for different aspects of the management of the cluster, these controllers are all managed by the kube-controller-manager. In general each controller is responsible for monitoring and managing one or more resources within the clusters, as an example the Node Controller monitors for and responds to nodes failing. 

By far the most common way of standing up a Kubernetes cluster is via a cloud provider. The cloud-controller-manager provides a bridge between the internal concepts of Kubernetes and the cloud provider specific API that is helping to implement them. An example of this would be the Route Controller responsible for configuring routes in the underlying cloud infrastructure.

Node Components

The node components run on every node in the cluster that are running pods providing the runtime environment for applications.

The kubelet is responsible for receiving PodSpecs and ensuring that the pods and containers it describes are running and healthy on the node.

An important element in being able to run containers is the Container Runtime. This runtime provides the mechanism for the node to act as a host for containers. This includes pulling the images from a container registry as well as managing their lifecycle. Kubernetes supports a number of different runtimes with this being a choice to be made when you are constructing your cluster.

An optional component is the kube-proxy that maintains network rules on the node that plays an important role in implementing the services concept within the cluster.

Add Ons

In order to allow the functionality of a cluster to be extended Kubernetes provides the ability to define Add ons.

Add ons cover many different pieces of functionality.

Some relate to networking by providing internal DNS for the cluster allowing for service discovery, or by providing load balancers to distribute traffic among the clusters nodes. Others relate to the provisioning of storage for use by the application running within the cluster. Another important aspect is security with some add ons allowing for security policies to be applied to clusters resources and applications.

Any add ons you choose to use are installed within the cluster with the above examples by no means being an exhaustive list.

As an application developer deploying code into a cluster you don't necessarily need a working knowledge of how this infrastructure is being underpinned. Bit I'm a believer that having an understanding of the environment where your code will run will help you write better code.

That isn't to say that you need to become an expert, but a working knowledge of the building blocks and the roles they play will help you become a more well rounded engineer and enable you to make better decisions. 

Sunday, 13 October 2024

Compiling Knowledge

 


Any software engineer who works with a compiled language will know the almost religious concept of the build. Whether you've broken the build, whether you've declared that it builds on my machine, or whether you've ever felt like you are in a life or death struggle with the compiler. The build is a process that must happen to turn your toil into something useful for users to interact with.

But what is actually happening when your code is being complied? In this post we are certainly not going to take a deep dive into compiler theory, it takes a special breed of engineer to work in that realm, but an understanding of the various processes involved can be helpful on the road to becoming a well rounded developer.

From One to Another

To start at the beginning, why is a compiler needed? Software by the time it runs on a CPU is a set of pretty simple operations knows as the instruction set. These instructions involve simple mathematical and logical operations alongside moving data between registers and areas of memory.

Whilst it is possible to program at this level using assembly language, it would be an impossibly difficult task to write software at scale. As engineers we want to be able to code at a higher level.

Compilers give us the ability to do that by acting as translators, they take the software we've written in a high level language such as C++, C#, Java etc and turn this into a program the CPU can run.

That simple description of what a compiler does belies the complexity of what it takes to achieve that outcome, so it shouldn't be too much of a surprise that implementing it takes several different process and phases.     

Phases

The first phase of compilation is Lexical Analysis, this involves reading the input source code and breaking it up into its constituent parts, usually referred to as tokens. 

The next phase is Syntax Analysis, also know as parsing. This is where the compiler ensures that the tokens representing the input source code conform to the grammer of the programming language. The output of this stage is something called the Abstract Syntax Tree (AST), this represents the structure of the code as described by a series of interconnected nodes in a tree structure representing paths through the code.

Next comes Semantic Analysis, it is at this stage that the compiler checks that the code actually makes sense and obeys the rules of the programming language including its type system. The compiler is checking that variables are declared correctly, that functions are called correctly and any other semantic errors that may exist in the source code. 

Once these analysis phases are complete the compiler can move onto Intermediate Code Generation. At this stage the compiler generates an intermediate representation of what will become the final program that is easier to translate into the machine code the CPU can run.

The compiler will then run an Optimisation stage to apply certain optimisations to the intermediate code to improve overall performance.

Finally the compiler moves onto Code Generation in order to produce the final binary, at this stage the high level language of the input source code has been converted into an executable that can be run on the target CPU.    

Front End, Middle End and Backend

The phases described above are often segregated into front end, middle end and backend. This enables a layered approach to be taken to the architecture of the compiler and allows for a certain degree of independence. This means different teams can work on these areas of the compiler as well as making it possible for different parts of compilers to be re-used and shared.

Front end usually refers to the initial analysis phases and is specific to a particular programming language. Should any of the code fail this analysis errors and warnings will be generated to indicate to the developer which lines of source code are incorrect. In this sense the front end is the most visible part of the compiler that developers will interact with.

The middle end is generally responsible for optimisation. Many compilers will have settings for how aggressive this optimisation is and depending on the target environments may also distinguish between optimising for speed or memory footprint.

The backend represents the final stage where code that can actually run on the CPU is generated.

This layering allows for example for front ends related to different programming languages to be combined with different backends that produce code for particular families of CPUs, with the intermediary representation acting as the glue to bind them together.

As we said at the start of this post understanding exactly how compilers work is a large undertaking. But having any appreciation of the basic architecture and phases will help you deal with those battles you may have when trying to build your software. Compiler messages may sometimes seem unclear or frustrating so this knowledge may save valuable time in figuring out what you need to do to keep the compiler happy.

Saturday, 14 September 2024

Terraforming Your World

 


Software Engineers are very good at managing source code. We have developed effective strategies and tools to allow us to branch, version, review, cherry pick and revert changes. It is for this reason we've been keen to try and control all aspects of our engineering environment in the same way.

One such area is the infrastructure on which our code runs.

Infrastructure as Code (IaC) is the process of managing and deploying compute, network and storage resources via files that can be part of the source control process.

Traditionally these resources may have been managed via manual interactions with a portal or front end application from your cloud provider of choice. But manual processes are prone to error and inconsistency making it difficult and time consuming to manage and operate infrastructure in this way.

Environments might not always be created in exactly the same way, and in the event of a problem there is a lack of an effective change log to enable changes to be reverted back to a known good state.

One of the tools that attempts to allow infrastructure to be developed in the same way as software is Terraform by Hashicorp.

Terraform

Terraform allows required infrastructure to be defined in configuration files where the indicated resources are created within the cloud provider via interaction with their APIs.

These interactions are encapsulated via a Provider which defines the resources that can be created and managed within that cloud. These providers are declared within the configuration files and can be pulled in from the Terraform Registry which acts like a package manager for Terraform.

Working with Terraform follows a three stage process.

Firstly in the Write phase the configuration files which describe the required infrastructure are created. These files can span multiple cloud providers and can include anything from a VPC to compute resource to networking infrastructure.

Next comes the Plan phase. Terraform is a state driven tool, it records the current state of your infrastructure and applies the necessary changes based on the updated desired state in the configuration files. As part of the plan phase Terraform creates a plan of the actions that must be taken to move the current state of the infrastructure to match the desired state, whether this be creating, changing or deleting elements. This plan can then be reviewed to ensure it matches with the intention behind the configuration changes.

Finally in the Apply phase Terraform uses the cloud provider APIs, via the associated provider, to ensure the deployed infrastructure aligns with the new state of the configuration files.

Objects

Terraform supports a number of different objects that can be described in the configuration files, presented below is not an exhaustive list but describes some of the more fundamental elements that are used to manage common infrastructure requirements.

Firstly we have Resources which are used to describe any object that should be created in the cloud provider, this could be anything from a compute instance to a DNS record.

A Data Source provides Terraform with access to data defined outside of Terraform. This might be data from pre-existing infrastructure, configuration held outside of Terraform or a database.

Input Variables allow configuration files to be customised to a specific use case increasing their reusability. Output Variables allow the Terraform process to return certain data about the infrastructure that has been created, this might be to further document the infrastructure that is now in place or act as an input to another Terraform process managing separate but connected infrastructure. 

Modules act like libraries in a software development context to allow infrastructure configuration to be packaged and re-used across many different applications.

Syntax

We have made many references in this article to configuration files but what do these actually consist of?

They are defined using the Hashicrop Configuration Language (HCL) and follow a format similar to JSON, in fact it is also possible for Terraform to work with JSON directly.

All content is defined with structures called blocks:

resource "my_provider_object_type" "my_object" {
  some_argument = "abc123"
}

Presented above is an extremely simple example of a block.

Firstly we define the type of the block, in this case a resource. Then comes a series of labels with the number of labels being dependent on the block type. For a resource block two labels are required, the first describing the type of the resource as defined by the provider, the second being a name that can be used to refer to the resource in other areas of the configuration files.

Inside the block the resource may require a series of arguments to be provided to it in order to configure and control the underlying cloud resource that will be created.

This post hasn't been intended to be a deep dive into Terraform, instead I've been trying to stoke your interest in the ways an IaC approach can help you apply the same rigour and process to your infrastructure management as you do to your source code.

Many of the concepts within Terraform have a close alignment to those in software engineering. Using an IaC approach alongside traditional source code management can help foster a DevOps mentality where the team responsible for writing the software can also be responsible for managing the infrastructure it runs on. Not only will this allow their knowledge of the software to shape the creation of the infrastructure but also in reverse knowing where and on what infrastructure their code will run may well allow them to write better software.


Tuesday, 3 September 2024

Being at the Helm

 


The majority of containerized applications that are being deployed at any reasonable scale will likely be using some flavour of Kubernetes.

As a container orchestration platform Kubernetes allows the deployment of multiple applications to be organised around the concepts of pods, services, ingress, deployments etc defined in YAML configuration files. 

In this post we won't go into detail around these concepts and will assume a familiarity with their purpose and operation.

Whilst Kubernetes makes this process simpler, when it's being used for multiple applications managing the large number of YAML files can come with its own challenges.

This is where Helm comes into the picture. Described as a package manager for Kubernetes Helm provides a way to manage updates to the YAML configuration files and version them to ensure consistency and allow for re-use.

I initially didn't quite understand the notation of Helm being a package manager but as I've used it more I've come to realise why this is how it's described.

Charts and Releases

The Helm architecture consists of two main elements, the client and the library.

The Helm client provides a command line interface (CLI) to indicate what needs to be updated in a cluster via a collection of standard Kubernetes YAML files, the library then contains the functionality to interact with the cluster to make this happen.

The collection of YAML files passed to the client are referred to as Helm Charts, they define the Kubernetes objects such as deployments, ingress, services etc.

The act of the library using these YAML files to update the cluster is referred to as a Release.

So far you maybe thinking that you can achieve the same outcome by applying the same YAML files to Kubernetes directly using the kubectl CLI. Whilst this is true where Helm adds value is where you need to deploy the same application into multiple environments with certain configuration or set-up differences. 

Values, Parametrisation and Repositories

It will be a common practice to need to deploy an application to multiple environments with a differing numbers of instances, servicing requests on different domains or any other  differences between testing and production environments.

Using Kubernetes directly means either maintaining multiple copies of YAML files or having some process to alter them prior to them being applied to the cluster. Both of these approaches have the potential to cause inconsistency and errors.

To avoid this Helm provides a templating engine to allow a single set of YAML files to become parameterised. The syntax of this templating when you first view it can be quite daunting, while we won't go into detail about it here, like with any language over time as you use it more it will eventually click.

Alongside these parameterised YAML files you specify a Values YAML that defines the environment specific values that should be applied to the parameterised YAML defining the Kubernetes objects.

This allows the YAML files to be consistent between all environments in terms of overall structure whilst varying where they need too. This combination of a Values YAML file and the parameterised YAML defining the Kubernetes objects are what we refer to as Helm Charts.

It maybe that your application is something that needs to be deployed in multiple clusters, for these situations your Helm charts can be provided via a repository in a similar way that we might make re-usable Docker images available.

I think it's at this point that describing Helm as a package manager starts to make sense.

When we think about code package managers we think of re-usable libraries where functionality can be shared and customised in multiple use cases. Helm is allowing us to achieve the same thing with Kubernetes. Without needing to necessarily understand all that is required to deploy the application we can pull down Helm charts, specify our custom values and start using the functionality in our cluster.   

When to Use

The benefit you will achieve by using Helm is largely tied to the scale of the Kubernetes cluster you are managing and the number of applications you are deploying.

If you have a simple cluster with a minimal applications deployed maybe the overhead of dealing with Kubernetes directly is manageable. If you have a large cluster with multiple applications or you have many clusters each with different applications deployed you will benefit more due to the ability it offers to ensure consistency.

You will also benefit from using Helm if you need to deploy a lot of 3rd party applications into your cluster, whether this might be to manage databases, ingress controllers, certificate stores, monitoring or any other cross cutting concern you need to have available in he cluster.

The package manager nature of Helm will reduce the overhead in managing all these dependencies in the same way that you manage dependencies at a code level.

As with many tools your need for Helm may grow over time as you estate increases in complexity. If like me you didn't immediately comprehend the nature and purpose of Helm then hopefully this post has helped you recognise what it can offer and how it could benefit your use case.

Monday, 26 August 2024

Avoiding Toiling

 


Site Reliability Engineering (SRE) is the practice of applying software engineering principles to the management of infrastructure and operations.

Originating at Google in the early 2000s the sorts of things an SRE team might work on include system availability, latency and performance, efficiency, monitoring and the ability to deliver change.

Optimising these kinds of system aspects covers many different topics and areas, one of which is the management of toil.

Toil in this context is not work we don't particularly enjoy doing or don't find stimulating, it has a specific meaning defined by aspects other than our enjoyment of the tasks it involves.

What is Toil?

Toil is work that exhibits some or all of the following attributes.

It is Manual in nature, even if a human isn't necessarily doing the work it requires human initiation, monitoring or any other aspect that means a team member has to oversee its operation.

Toil is Repetitive, the times work has to be done may vary and may not necessarily be at regular intervals, but the task needs to be performed multiple times and will never necessarily be deemed finished.

It is Tactical meaning it is generally reactive, it has to be undertaken either in relation to something happening within the system for example when monitoring highlights something is failing or is sub-optimal.

It has No Enduring Value, this means it leaves the system in the same state as before the work happened. It hasn't improved any aspect of the system or eliminated the need for the work to happen again in the future.

It Scales with Service Growth. Some work items need to happen regardless of how much a system is used. This tends to be viewed as overhead and is simply the cost of having the system in the first place. Toil scales with system use meaning the more users you attract the greater the impact of the toil on your team.

Finally toil can be Automated, some tasks will always require human involvement, but for a task to be toil it must be possible for it to be automated.

What is Toils Impact?

It would be wrong to suggest that toil can be totally eliminated, having a production system being used by large numbers of people is always going to incur a certain amount of toil, and it is unlikely that the whole engineering effort of your organisation can be dedicated to removing it.

Also, much like technical debt, even if you do reach a point where you feel its eliminated the chances are a future change in the system will likely re-introduce it.

But also like technical debt the first step is to acknowledge toil exists, develop ways to be able to detect it and have a strategy for managing it and trying to keep it to a reasonable minimum.

Toils impact is that it engages your engineering resource on tasks that don't add to or improve your system. It may keep it up and running but that is a low ambition to have for any system 

It's also important to recognise that large amounts of toil is likely to impact a teams morale, very few engineers will embark on their career looking to spend large amounts of time on repetitive tasks that lead to no overall value.

The Alternative to Toil

The alternative to spending time on toil is to spend time on engineering. Engineering is a broad concept but in this context it means work that improves the system itself or enables to to be managed in a more efficient way.

As we said previously completely eliminating toil is probably an unrealistic aim. But it is possible to measure how much time your team is spending on toil related tasks. Once you are able to estimate this then it is possible both to set a sensible limit on how much time is spent on these tasks but also measure the effectiveness of any engineering activities designed to reduce it.

This engineering activity might relate to software engineering, refactoring code for performance or reliability, automating testing or certain aspects of the build and deployment pipeline. It might also be more aimed at system engineering, analysing the correctness of the infrastructure the system is running on, analysing the nature of system failures or automating the management of infrastructure.

As previously stated we can view toil as a form of technical debt. In the early days of a system we may take certain shortcuts that at the time are manageable but as the system grows come with a bigger and bigger impact. Time spent trying to fix this debt will set you on a path for gradual system improvement, both for your users and the teams that work on the system.

Saturday, 13 July 2024

The Language of Love

 


Software engineers are often polyglots who will learn or be exposed to multiple programming languages over the course of their career. But I think most will always hold a special affection for the first language they learn, most likely because it's the first time they realise they have the ability to write code and achieve an outcome. Once that bug bites it steers you towards a path where you continue to hone that craft.

For me that language is C and its successor C++.

Potentially my view is biased because of the things I've outlined above but I believe C is a very good language for all potential developers to start with. If you learn how to code close to the metal it will develop skills and a way of thinking that will be of benefit to you as you progress onto high level languages with greater levels of abstraction from how your code is actually running.

In The Beginning

In the late 1960s and early 1970s as the Unix operating system was being developed engineers realised that they needed a program language that could be used to write utilities and programs to run on the newly forming platform.

One of the initial protagonists in this field was Ken Thompson.

After dismissing existing programming languages such as Fortran he started to develop a variant of an existing language called BCLP. He concentrated on simplifying the language structures and making it less verbose. He called this new language B with the first version being released around 1969.

In 1971 Dennis Ritchie continued to develop B to utilise features of more modern computers as well as adding new data types. This culminated in the release of New B. Throughout 1972 the development continued adding more data types, arrays and pointers and the language was renamed C.

In 1973 Unix was re-written in C with even more data types being added as C continued to be developed through the 1970s. This eventually resulted in the release of what many consider the be the definitive book on the C programming language, written by Brian Kernighan and Dennis Ritchie The C Programming Language became known as K&R C and became the unofficial specification for the language.

C has continued to be under active development right up until the present with C23 expected to be released in 2024.

C with Classes

In 1979 Bjarne Stroustrup began work on what he deemed "C with Classes".

Adding classes to C turned into an object oriented language, where C had found a home in embedded programming running close to the metal, adding classes made it more suitable for large scale software development.

In 1982 Stroustrup began work on C++ adding new features such as inheritance, polymorphism, virtual functions and operator overloading. In 1985 he released the book The C++ Programming Language which become the unofficial specification for the language with the first commercial version being released later that year.

Much like C, C++ has continued to be developed with new versions being released up until the present day.

Usage Today

Software Engineering is often considered to be a fast moving enterprise, and while many other programming languages have been developed over the lifetime of C and C++ both are still very widely used.

Often being used when performance is critical, the fact they run close to the metal allows for highly optimised code for use cases such as gaming, network appliances and operating systems.

Usage of C and C++ can often strike fear into the heart of developers who aren't experienced in their use. However the skills that usage of C and C++ can develop will prove invaluable even when working with higher level languages so I would encourage all software engineers to spend some time expose themselves to the languages.

Good engineers can apply their skills using any programming language, the principles and practices of good software development don't vary that much between languages or paradigms. But often there are better choices of language for certain situations, and C and C++ are still the correct choice for many applications.

Friday, 28 June 2024

Vulnerable From All Sides

 


Bugs in software engineering are a fact of life, no engineer no matter what he perceives his or her skill level to be has never written a non-trivial piece of software that didn't on some level have some bugs.

These maybe logical flaws, undesirable performance characteristics or unintended consequences when given bad input. As the security of software products has grown ever more important the presence of a particular kind of bug has become something we have to be very vigilant about. 

A vulnerability is a bug that has a detrimental impact on the security of software. Whereas any other bug might cause annoyance or frustration a vulnerability may have more meaningful consequences such as data loss, loss of system access or handing over control to untrusted third parties.

If bugs, and therefore vulnerabilities, are a fact of life then what can be done about them? Well as with anything being aware of them, and making others aware, is a step in the right direction.

Common Vulnerabilities and Exposures 

Run by the Mitre Corporation with funding from the US Division of Homeland Security Common Vulnerabilities and Exposures (CVE) is a glossary that catalogues software vulnerabilities and via the Common Vulnerability Scoring System (CVSS) provides them with a score to indicate their seriousness.

 In order for a vulnerability to be added to the CVE it needs to meet the following criteria.

It must be independent of any other issues, meaning it must be possible to fix or patch the vulnerability without needing to fix issues elsewhere.

The software vendor must be aware of the issue and acknowledge that it represents a security risk. 

It must be a proven risk, when a vulnerability is submitted it must be alongside evidence of the nature of the exploit and the impact it has.

It must exist in a single code base, if multiple vendors are effected by the same or similar issues then each will receive its own CVE identifier.

Common Vulnerability Scoring System (CVSS)

Once a vulnerability is identified it is given a score via the Common Vulnerability Scoring System (CVSS). This score ranges from 0 to 10, with 10 representing the most severe.

CVSS itself is based on a reasonably complicated mathematical formula, I won't here present all the elements that go into this score but the factors outlined below give a flavour of the aspects of a vulnerability that are taken into account, they are sometime referred to as the base factors.

Firstly Access Vector, this relates to the access that an attacker needs to be able to exploit the vulnerability. Do they for example need physical access to a device, or can they exploit it remotely from inside or outside a network. Related to this is Access Complexity, does an attack need certain conditions to exist at the time of the attack or for the system to be in a certain configuration.

Authentication takes into account the level of authentication an attacker needs. This might range from none to admin level.

Confidentially assesses the impact in terms of data loss of an attacker exploiting the vulnerability. This could range from trivial data loss to the mass export of large amounts of data.

In a similar vein Integrity assesses an attackers ability to change or modify data held within the system, and Availability looks at the ability to effect the availability of the system to legitimate users.

Two other important factors are Exploitability and Remediation Level. The first relating to whether code is known to exist that enables the vulnerability to be exploited, and the latter referring to whether the software vendor has a fix or workaround to provide protection.

These and other factors are weighted within the calculation to provide the overall score.

Patching, Zero Days and Day Ones

The main defence against the exploitation of vulnerabilities is via installing patches to the affected software. Provided by the vendor these are software changes that address the underlying flaws causing the vulnerability in order to stop it being exploited.

This leads to important aspects about the lifetime of a vulnerability.

The life of the vulnerability starts when it is first inadvertently entered into the codebase. Eventually it is discovered by the vendor or a so called white hat hacker who notifies the vendor. The vendor will disclose the issue via CVE while alongside this working on a patch to address it.

This leads to a period of time where the vulnerability is known about, by the vendor, white hack hackers and possibly the more nefarious black hat hackers but where a patch isn't yet available. At this point vulnerabilities are referred to as zero days, they may or may not be being exploited and no patch exists to make the software safe again.

It may seem like once the patch is available the danger has passed. However once a patch is released the nature of the patch often provides evidence of the original vulnerability and provide ideas on how it is exploitable. At this point the vulnerability is referred to as a Day One, the circle of those who may have the ability to exploit it has increased, and vulnerable systems are not yet made safe until the patch has been installed.

CVE provides an invaluable resource in documenting vulnerabilities, forewarned is forearmed. Knowing a vulnerability exists means the defensive action can start, and making sure you stay up to date with all available patches means you are as protected as you can be.


Saturday, 15 June 2024

The World Wide Internet

 


Surfing the web, getting online and hitting the net are terms that ubiquitous among the verbs that describe how we live our lives. They stopped being technical terms a long time ago and now simply describe daily activities that all generations understand and take part in.

In becoming such commonly used terms they have lost some of their meaning with the web and the internet being interchangeable in most peoples minds. However this isn't the case, they do represent two different if complementary technologies.

Internet vs Web

The Internet is the set of protocols that has allowed computers all over the world to be connected and to exchange data, the so called "network of networks". It is concerned with how data is sent and received not what the data actually represents.

Protocols such the Internet Protocol and the Transmission Control Protocol allow each computer to be addressable and routable to allow the free flowing transmission of data.

The World Wide Web (WWW or W3) is an information system that builds on top of the interconnection between devices that the Internet provides to define a system for how information should be represented, displayed and linked together.

Defined by the concepts of hypermedia and hypertext it provides the means by which we are able to view data online, how we are provided links to that data and that data is visually depicted.

The History of the Internet

As computer science emerged as an academic discipline in the 1950s access to computing resource was scarce. For this reason scientist wanted to develop a way for access to be time shared such that many different teams could take advantage of the emerging technology.

This effort culminated in the development of the first wide are network Advanced Research Projects Agency Network (ARPANET) built by US Department of Defence in 1969.

Initially this interconnected network connected a number of universities including the University of California, Stanford Research Institute and the University of Utah.

In 1973 ARPANET expanded to become international with Norwegian Seismic Array and University College London being brought onboard. Into the 1980s ARPANET continued to grow and started to be referred to as the Internet as short hand for Internetwork.

In 1982 the TCP/IP protocols were standardised and the foundations of what we now know as the Internet were starting to be put in place.

The History of the Web

The Web was the invention of Sir Tim Berners-Lee as part of his work at CERN in Switzerland. The problems he was trying to solve related to the storing, updating and discoverability of documents in large datasets being worked on by large numbers of distributed teams.

In 1989 he submitted his proposal for a system that could solve these problems and in 1990 a working prototype was completed including and HTTP server and the very first browser named after the project and called WorldWideWeb.

Building on top of the network provided by the Internet the project defined the HTTP protocol, the structure of URLs and HTML as the way that the data in documents could be represented.

In 1993 CERN made the decision to make these protocols and the code behind them available royalty free, a decision that would change the world forever and enabled the number of web sites in the world to steadily grow from tens, to hundreds, to thousands, to the vast numbers that we now take for granted.

Many technologies end up having a profound impact on our lives without its terminology becoming understood by those outside technological circles. But the Internet and the web are different, they are so embedded in our lives that URLs, hyperlinks, web address etc are normal everyday words.

In the early days of ARPANET, and probably also for the WWW project, although those teams may have realised they were working on what could be important technologies I think they wouldn't have anticipated quite where their work would lead. But when an idea is a good one, it can go in many unexpected directions.

Sunday, 9 June 2024

What's In a Name

 


Surfing the web seems like straightforward undertaking, you type the website you want to go to into your browsers address bar, or you click on a result from a search engine and within no time the website you wanted to visit is in front of you.

But how did this happen simply by typing a web address into a browser? How is the connection made between you and a website out there somewhere in the world?

The answer lies in the Domain Name System (DNS).

All servers on the internet that are hosting the websites you want to access are addressable via a unique Internet Protocol (IP) address. For example as I write this article linkedin.com is currently addressable via 13.107.42.14. These addresses aren't practical for human use which is why we give websites names such as linkedin.com.

DNS is the process by which these human readable names are translated into the IP addresses that can be used to actually access the websites content.

DNS Elements 

Four main elements are involved in a DNS lookup.

A DNS Recursor is a server that receives queries from clients to resolve a websites host name into an IP address. The recursor will usually not be able to provide the answer itself but knows how to recursively navigate the phone directory of the internet in order to give the answer back to the client.

A Root Nameserver is usually the first port of call for the recursor, it can be thought as like a directory of phone directories, based on the area of the internet the websites domain points at it directs the recursor at the correct directory that can be used to continue the DNS query.

A Top Level Domain (TLD) Nameserver acts as the phone directory for a specific part of the internet based on the TLD portion of the web address. For example a TLD nameserver will exists for .com addresses, .co.uk addresses and so on.

An Authoritative Nameserver is the final link in the chain, it is the part of the phone directory that can provide the IP addresses for the website you are looking for.

DNS Resolution

To bring this process to life let's look at the path of a DNS query if you were trying to get to mywebsite.com.

The user types mywebsite.com into their browser and hits enter, the browser then asks a DNS recursor to provide the IP address for mywebsite.com.

The recursor first queries a root nameserver to find the TLD nameserver thats appropriate for this request.

In this example the root nameserver will respond with the TLD nameserver for .com addresses.

The TLD nameserver will then respond with the authoritative nameserver for the websites domain, in this example mywebsite.com, the location of this server will be related to where the website is being hosted. The authoritative nameserver then responds with the IP address for the website, the recursor returns this to the users browser and the website can be loaded.

DNS Security

DNS is one of the fundamental technologies that has its origins in the foundation of the web. At this time when the blueprint of the web was being created security was less of a concern to those solving these engineering problems, it was assumed that the authenticity of the links in the chain could be taken on trust.

Unfortunately in the modern web this level of trust in other actors can be misplaced. When a server claims to be the authoritative nameserver for a particular website how can you trust that this is the case and you aren't going to be directed to a rogue impersonation of the website you are trying to reach.

Domain Name System Security Extensions (DNSSEC) is attempting to replace the trust based system with one that is based on provable security. It introduces the signing and validation of the DNS records being returned from the various elements involved in a DNS query so that their authenticity can be determined.

DNS is one of the technologies that is now taken for granted but solves a problem without which the web as we know it wouldn't be able to exist. On the surface it sounds like a simple problem to solve but the scale of the web means even the simplest of solutions has to be able to scale to a world wide scale.

Saturday, 13 April 2024

The Web of Trust

 


Everyday all of us type a web address into a browser or click on a link provided by a search engine and interact with the web sites that are presented.

Whilst we should always be vigilant, on many occasions we will simply trust that the site we are interacting with is genuine and the data flowing between us and it is secure.

Our browsers do a good job of making sure our surfing is safe, but how exactly is that being achieved. How do we create trust between a website and its users?

SSL/TLS

Netscape first attempted to solve this trust problem by introducing the Socket Secure Layer (SSL) protocol in the early 90s. Initial versions of the protocol still had many flaws but by the release of SSLv3.0 in 1996 it had matured into a technology that was able to provide a mechanism for trust on the web.

As SSL became a foundational part of the web, and because security related protocols always have to be under constant evolution to maintain safety, the Internet Engineering Task Force (IETF) developed Transport Layer Security (TLS) in 1999 as an enhancement to SSLv3.0.

TLS has continued to be developed with TLSv1.3 being released in 2018. 

Its primary purpose is to ensure data being exchanged by a server and a client is secured, but also to establish a level of trust such that the two parties can be sure who they are exchanging the data with.

Creating this functionality relies on a few different elements.

Public Key Encryption

Public key encryption is a form of asymmetric encryption that uses a pair of related keys deemed public and private.

The mathematics behind this relationship between the keys is too complex to go into in this post, but the functionality it provides is based on the fact that the public key can be used to encrypt data that only the private key can decrypt.

This means the public key can be freely distributed and used to encrypt data that only the holder of the private key can decrypt.

The keys can also be used to produce and verify digital signatures. This involves the holder of some data using a mathematical process to "sign" this data using their private key.

The receiver of the data can use the public key to verify the signature and therefore prove that the data came from someone who has the corresponding private key.

Public Key Infrastructure (PKI)

Public Key Infrastructure (PKI) builds on top of the functionality provided by public key encryption to provide a system for establishing trust between client and server.

This is achieved via the issuance of digital certificates from a Certificate Authority (CA).

The CA is at the heart of the trust relationship of the web. When two parties, the client and server, are trying to form a trust relationship they must delegate to a 3rd party that they both already trust, this is the CA.

The CA establishes the identity of the organisation the client will interact with via off line means and issues a digital certificate. This certificate establishes the identity of the organisation, its public key and is signed by the CA to prove it was the one that issued the certificate.

A client when it receives the certificate from the server can use the CA's public key to verify the signature and therefore trust the data in the certificate.

It's possible to have various levels of CA's that may delegate trust to other CA's, deemed intermediary CA's. But all certificates should ultimately be able to be traced back to a so called Root CA that all parties on the web have agreed to trust and whose public keys are available to all participants.   

Certificates and Handshakes

All of the systems previously described are combined whenever we visit a web site to establish trust and security.
  • A user types a web address into the browser or clicks a link provided by a search engine.
  • The user's browser issues a request to the web site to establish a secure connection.
  • The server in response sends the browser it's certificate.
  • The browser validates the certificate authenticity by verifying the signature of the Root CA that the certificate is issued from using the public key of the CA that has been pre-installed on the users machine.
  • Once the certificate is validated, the browser creates a symmetric encryption key that will be used to secure future communication between the browser and the web site. It encrypts the symmetric key using the servers public key and sends it to the server.
  • The users browser has now established the identity of the web site, based on the data contained in its validated certificate, and both parties now have a shared symmetric key that can be used to secure the rest of their communication in the session.
There are certain pieces of functionality that are fundamental to allowing the web to operate in the way it does.

Without the functionality provided by SSL/TLS it wouldn't be possible to use the web as freely as we do whilst also trusting that we can do so in a safe and secure manner.   

Monday, 1 April 2024

Imagining the Worst

 


In the modern technological landscape the list of possible security threats can seem endless. The breadth of potential attackers and potential vectors for their attacks has never been so large, does this mean we are all just helpless waiting for an attack and the terrible consequences to befall us?

One way to be proactive in the face of these dangers is to try and anticipate what form these treats might take, what damage they could do and what countermeasures it might be possible to take.

Threat modelling is a technique for enumerating the threats a system might face, identifying whether or not safeguards might exist and analysing the consequences of these attacks succeeding. 

To help developers and engineers with the threat modelling process Microsoft developed the STRIDE mnemonic in 1999 to serve as a checklist of things for teams to consider when analysing the potential impact of threats to their system.

STRIDE

The STRIDE mnemonic attempts to categorise potential threats in terms of the impact they may have, this allows teams to analyse if any part of a system may be susceptible, and if so how this might be mitigated.

Spoofing is the process of falsely identifying yourself within a system. This might be by using stolen user credentials, leaked access tokens or cookies and any other form of session hijacking.

Tampering involves the malicious manipulation of data either at rest, for example altering data within a database, or while in transit, for example by acting as a main in the middle.

Repudiation relates to an attacker being able to cover their tracks by exploiting any lack of logging or ability to trace actions within a system, this might also include an attacker having the ability to falsify an audit trail to hide malicious activity.

Information Disclosure occurs when information is available to users who shouldn't be able to view it. This might cover a system returning database records a user has no entitlement to view, or the ability of an attacker to intercept data in transit, again for example by acting as a man in the middle.

Denial of Service is any attack that denies users the ability to legitimately use a system, of which the most common form of attack is to overwhelm a system with requests or otherwise cause the system to become unresponsive or unusable.

Elevation of Privilege occurs when an attacker is able to elevate their permissions within a system under attack, normally this would mean obtaining administrator privileges or otherwise penetrating a network sufficiently to be trusted more than a normal external user.

Threat Analysis

Many tools and processes exist for implementing threat modelling, but most will revolve around a team of system experts brainstorming potential threats that a system or sub-system might be susceptible too.

This involves using analysis helpers such as STRIDE to put yourself in the mindset of an attacker. For example you may asses if an authentication system could be exploited via spoofing. The answer might be no because of certain mitigations, or yes because of certain flaws.

When applying this style of analysis to all the aspects of STRIDE it is unlikely that you will find the system is completely protected against all possible attacks. Instead you're a looking to demonstrate that it is adequately protected given the likelihood of an attack being successful and the benefit that would be gained by an attacker if they were successful.

Security is not a design activity that is ever truly complete and instead will be something that evolves over time. You can either choose to learn by mistakes when attacker are successful or you can attempt to pro-actively preempt this by performing some self critical internal analysis to ensure security levels are the highest they can be.                

Sunday, 24 March 2024

Having a Distrustful Mindset

 


I've talked previously about the concept of zero trust as it relates to security. When trying to apply an over arching philosophy like this it can feel quite daunting as to where to start.

Rather than having to create a comprehensive design in how to re-engineer your existing applications and network it can help to start with a shift in mindset in how you view the security properties of your system.

Security is an ever evolving discipline with many different facets so I wouldn't want to give the impression that the items I mention below are the only things you need to think about, but I think they do help foster the kind view that you need to have a healthy distrust of the world around you.

Authenticate

Authentication is the process of identifying the actors within a system. Traditionally this means authenticating users via them supplying a username/password or other such shared secret.

But this can also be extended to cover many other links in the chain.

This might include identifying the client or application the user is using to make the requests, the network the requests are coming from or even the physical device that is being used to communicate.

It's possible for all of these aspects of the flow of data to be used to indicate that something malicious maybe happening, and therefore properly identifying all these elements will enable you to asses if you can trust them.

Properly identifying all elements also enables comprehensive logging of any actions performed.

Authorise

Once you've identified these elements you can move onto to authorisation. This is the process of deciding if the requested operation should be allowed. Normally this would mean is this user allowed to view this data or perform this action.

But again the concept can be extended to cover more aspects of the system. Should this client or application be allowed to perform this action? Should these kinds of requests be allowed from this area of the network? Should this physical device be used to perform this action?

Going beyond authorising just the user again increases your ability to both detect malicious actions and prevent them.

Authorisation can also be linked to the method being used for authentication. Some actions a user may want to perform might be linked to needing a higher level of authentication. As an example some admin actions might require multi-factor authentication onto top of a username and password.

Encrypt

Pretty much all applications involve the movement and manipulation of data, and a large number of threats will relate to trying to expose that data to parties that shouldn't be able to view it.

One defence against this is to make sure that at the points where the data doesn't need to be viewed or worked with it is encrypted.

This will broadly come down to encrypting the data at rest and in transit. When the data is being stored and when it is being moved it should always be encrypted, the only time it should be in plain text is when it is being shown to the user. A user that has been properly authentication and authorised to view that data, via that application, from that network using that device.

This mindset should also be applied to other aspects of data sensitivity, identifying which data items are sensitive and how they are being transported. An example of this would be considering what data items are being included in the query string of requests, because this might be logged in various places data within it should be assessed for sensitivity.

Authentication, authorisation and encryption aren't the only security related factors you need to be thinking about. But embracing them and thinking about them in a greater depth will help you dive deeper into the security of your software and system and be aware of a greater breadth of possible threats. 

Saturday, 16 March 2024

Hacking Humanity

 


The modern technological world is a dangerous place, many evil actors are lurking around every dark corner with designs on your data, or simply wishing to impact your business just to show they can.

Many of these potential hackers will look to exploit flaws in your software, we should all be aware of the OWASP top ten and how code can be made to do things it wasn't intended to do. 

However there is a class of attacks that have nothing to do with exploiting flaws in software and instead are about taken advantage of flaws in human nature. Social Engineering is the process of using human psychology in order to manipulate someone into undermining the security of a system and play an unwitting role in an attack.

The nature of these types of attacks are very different to technological based attacks and so therefore are the possible defences against them.

Cognitive Bias

All Social Engineering attacks exploit some aspect of cognitive bias, this is the human tendency to make incorrect decisions based on flaws on how we interpret information being presented to us.

The exploitation of these biases can take many forms but most are trying to persuade us to take an action that will ultimately lead to harm. The below is not an exhaustive list but demonstrate some of the techniques and why they work.

The Halo Effect attempts to get you to concentrate on a particular aspect of a communication whilst ignoring information that should lead you to question what is on offer. An example would be receiving a notification that you've won a prize, the positivity of that news is designed to distract you from thinking about the fact that you never entered any kind of competition or how your contact details were obtained.

Recency bias exploits the tendency to place more importance on recent events over historical ones. Attacks of this nature are timed to appear to align with recent experiences, many examples of this would have been seen during the COVID pandemic.

Authority bias takes advantage of our unwillingness to challenge someone or something that has perceived authority over us. Examples of this would be emails that claim to come from a senior work colleague or a government department.

There are many more examples of cognitive bias all of which use some aspect of human psychology to get us to ignore information that should make us suspicious in favour of information that makes us feel we should act.

Vectors of Attack

Social Engineering can be exploited in many different forms of social interaction and communication. Again the below is not an exhaustive list but acts as examples of the vectors these attacks may use.

By far the biggest vector is phishing and its variations. Whether it be an email, instance message, SMS or phone call all phishing attacks are designed to get the victim to expose information or take an action because they are duped into thinking the instruction is coming from someone it isn't.

This might be someone exposing their credit card information because they think they are communicating with their bank or someone clicking on a link that installs malware because they think it relates to a planned delivery.

Spear phishing attempts to make these attacks even more convincing by crafting the attack to be specific to an individual rather than generic in nature. 

A similar vector is that of impersonation, access may be granted to a building to an individual because they are dressed in an official looking manner or because they meet a stereotype of how we expect certain individuals to present themselves.

Tail gating is another example of a physical social engineering attack where an attacker will follow someone into a secure building looking to exploit a human tendency to avoid conflict or to openly question an individuals actions.

Possible Defences

The first and most affective defence against social engineering is education. Teaching people that these techniques exist and the impact they can have will hopefully foster a natural suspicion of unsolicited communication before taking action.

Combined with this education many organisations undertake regular testing of employees. This will often take the form of them being exposed to emails or other communication that exploits the same aspects of cognitive biases to get them to take a certain action. This represents a safe way for people to realise they have been exploited and learn for next time when the attack may be real.

Some defences are technical in nature, for example employing principles of least privilege and zero trust can help to ensure that the blast radius of any attack is kept to a minimum. An example would be ensuring employees have the minimum level of system access needed to fulfil their roles, meaning if their account is compromised the attack gains limited access and influence.

Social Engineering is at least as bigger cyber threat as attacks that exploit technical flaws. The aims are usually the same, information exposure and taking control over a system. Where code can be scanned and tested on a continual and automated bases to find and rectify technical flaws, it is often a harder problem to solve to make individuals aware of their cognitive biases and how they can be exploited.

But knowing they exist is a pretty good step in the right direction.