In this deep dive we will explore the complex mechanisms that make Amazon VPC work. We will look at what is needed to securely run millions of networks without ever running out of IP space.
The post header displays Pablo Picasso’s ‘Bull’ series. In this study, Picasso started with a realistic graphic. With every subsequent drawing he moved closer to the essence of a bull. The end result is a line drawing; the most minimalistic abstraction of a bull.
People might say the end result is not that special. They might say they could have drawn it themselves. However, this is not the point. What Picasso has shown in his ‘Bull’ series, is that he understands the image of the bull so well that he has been able to simplify it to a level that “anyone could have drawn”.
This is what Amazon has done for software defined networking (SDN) with VPC. They have taken the most complex networking mechanisms and kept on abstracting and simplifying it until it was so easy that anyone can use it. This blog post will describe the techniques used to achieve this feat.
This post the first in a two-part series. Today we will look at the Virtual Router, the Mapping Service and Blackfoot, which is the edge device running Internet Gateway, VPN, Direct Connect and more. In the second part we will look at HyperPlane, the system that enables Network Load Balancer, Transit Gateway, PrivateLink and NAT Gateway.
[Author’s note: although I consider myself well versed in AWS technology and its inner workings, I am not an official AWS employee and do not have more insights than what is available in public resources. Most of what is described in this post is based on these public resources, but some assumptions and inferences have been made. I have tried to indicate these assumptions where applicable.]
The challenge: scale and reliability
Let’s approach VPC from the perspective of a new AWS user. This user creates a new VPC with three subnets; one for every availability zone they are planning to use. They add instances to every subnet, but don’t add a Internet Gateway, NAT Gateway, or VPN yet. Their architecture now looks like this:
Easy enough. Now let’s look at the physical infrastructure these instances run on. Every EC2 instance consumes part or all of a physical server. For example one physical T3 host might contain two
t3.2xlarge instances, 128
t3.nano instances, or any combination in between. The instance with IP 10.0.1.1 might be placed on a physical host like this:
The other instances on this physical host likely belong to other customers, so traffic for those instances should never be accessible to yours. And vice versa; traffic meant for your instances should never end up on their virtual machines. Additionally, these instances should be free to use whichever IP address they like, even if that collides with your IP plan. In an extreme case, every instance on a physical host could belong to a different VPC and they could all have the exact same IP address. In this diagram, the different VPCs are indicated by color:
A classic solution to this problem would be using VLANs. However, there are only about 4.096 VLAN IDs available. This becomes a major constraint with the amount of customers and instances AWS hosts. Instead, Amazon needs something that scales at the same rate that AWS grows at.
A Virtual Router and encapsulation
Because all the instances on a physical server might have conflicting IP addresses, a mechanism that identifies which VPC the traffic originates from is needed. Amazon VPC uses a Virtual Router and encapsulation to achieve this. [Note from the author: in AWS re:Invent 2015 | (NET403) Another Day, Another Billion Packets Eric Brandwine talks about encapsulations and headers. In Packet Pushers Show 387: AWS Networking – A View From The Inside, around the 17:30 minute mark Nick Matthews refuses to exactly specify which technology is used.]
Any inbound or outbound traffic on a physical host goes through the Virtual Router. On older (non-Nitro) instances, this is software running on the hypervisor. On Nitro instances it’s running on an actual piece of hardware called the “Nitro Card for VPC”. This PCIe card is responsible for handling security groups and VPC encapsulation, among other things.
The Virtual Router encapsulates outbound traffic with a VPC header. If you would like to read more about encapsulation, check out my post about Tcpdump, Wireshark and Encapsulation.
Let’s look at encapsulation in our example. Say our instance with IP 10.0.1.1 in the black VPC tries to reach 10.0.1.2. The traffic first passes the Virtual Router, which adds the “black vpc” header to the packet (in reality it will add the unique VPC ID, like vpc-123ad1487ab908e).
By reading this header, a receiving system knows where a packet originates from. This allows it to determine if the traffic is valid and to which EC2 instance to route it. We will look at how that works in the next section.
The Mapping Service
In our example we’ve seen two instances (10.0.1.1 and 10.0.1.2) in the same subnet. That means that they are in the same availability zone, which in turn means that the physical hosts can connect to each other directly. However, the Virtual Router at the source physical host does not yet know to which physical host to send the packet. All it knows is that Black 10.0.1.1 is trying to reach 10.0.1.2. This is where the VPC Mapping Service comes in.
When the Virtual Router is trying to route this traffic, it will first query the Mapping Service: I’ve got a packet from Black 10.0.1.1 to 10.0.1.2, where do I send it? (Technically, the Virtual Router has a complete cache of relevant data available to it locally, because the central Mapping Service could never handle the amount of requests. As said in one of the sources, the AWS networking team has not even implemented the ‘cache missed’ use case, because it can never occur.)
The Mapping Service knows all the routes in your VPC, and in this case it knows that 10.0.1.1 and 10.0.1.2 are in the same subnet. As such it knows that 10.0.1.2 must also be in the Black VPC, and it will answer “Black 10.0.1.2 resides on the physical host with IP address 192.168.0.24.”
Now the Virtual Router knows where to send the traffic. It will then append another IP header, after which the packet is sent onto the network. The outer IP header will allow the packet be routed to the physical host at 192.168.0.24. When the packet arrives, that host will strip the outer IP header and lo and behold: there is a VPC header there! The VPC header tells the Virtual Router on the receiving host to send the packet to Black 10.0.1.2, and so the packet arrives at its destination. The full process looks like this:
The process between different subnets and availability zones is not much different but involves a few additional steps on layer 3. If you would like to learn the details, check out AWS re:Invent 2015 | (NET403) Another Day, Another Billion Packets. The link points to the correct timestamp.
The Mapping Service does a lot more than just resolving addresses for physical hosts. Whenever a request comes in the Mapping Service checks if the requesting host is actually allowed to ask that. For example, if the requesting host would ask “Blue 10.0.1.1 wants to know where 10.0.1.2 resides”, but Blue 10.0.1.1 does not exist on that physical host, the traffic will be dropped and an alarm will be raised.
Not depicted in the flow above, the receiving end also validates incoming packets. So even if the request above would come through and arrive at the other EC2 host, it would ask the Mapping Service if it can trust the inbound traffic. If not, the traffic will be dropped and an alarm will be raised.
Routing to non-VPC destinations with the Blackfoot edge device
In the previous sections we have looked at traffic within a single AWS region. But what happens when your traffic is directed at a non-AWS resource like a public internet address or a private address reachable over VPN or Direct Connect?
To answer this question we need to introduce the AWS Blackfoot edge device. Like the Virtual Router and the Mapping Service, this is a generally invisible but essential component of the AWS infrastructure.
Fun fact about the Blackfoot name: it’s named after a South African penguin with… black feet. The device is named Blackfoot because it’s a Linux based network appliance (think Tux), and a large part of the team that designed it was based in Cape Town.
The Blackfoot edge device provides bidirectional translation of VPC traffic to destinations outside AWS. It does this for a number of use cases:
- Directing traffic to and from public IP addresses with the Internet Gateway.
- Directing traffic over VPN connections using IPSec.
- Directing traffic over Direct Connect using VLAN tagging.
- Sending traffic to S3 and DynamoDB using Gateway VPC Endpoints.
In the next sections we will look at each of these in turn. Before we do, you should be aware that although we will be talking about a singular Blackfoot edge device, there will actually be multiple redundant devices performing the edge functions. This can be derived from the fact that an Internet Gateway, a Virtual Private Gateway and Gateway VPC Endpoints are all highly available, automatically scaling, regional entities. The details of Blackfoot’s high availability architecture have not been publicly disclosed.
[Note from the author: the responsibilities of the Blackfoot edge device have been clearly laid out in AWS re:Invent 2015 | (NET403) Another Day, Another Billion Packets. However, how the device actually functions and what networking technology it uses to achieve those responsibilities is mostly inferred from the available data. I will definitely be wrong in a few places, since Nick Matthews commented on this post ‘There are some assumptions here that are incorrect, but don’t have public-facing explanation (sorry :/ ).’ And he would know. There might be additional systems in the architecture, or responsibilities might be differently assigned than I’ve described them. However, even though some details might be off, the general concept will be correct.]
You may have noticed that when you assign a public IP address to an EC2 instance (either dynamic or an Elastic IP), that IP address is visible in the AWS Web Console, but never in the IP configuration of the operating system of that instance. You can look up the public IP with the metadata service, but
ip addr will only display the private IP address.
This is because the public address will actually be assigned to the Blackfoot edge device, which will perform one-to-one NAT between the public IP address and your instance’s private address.
Onder the hood, the same Virtual Router and Mapping Service are being used, but the Blackfoot edge device provides translation between non-AWS and AWS traffic.
Of course your instances can also be reached on their public IP addresses. When an outside resource connects to your IP, the Blackfoot edge device translates the public IP address to your private IP address and adds VPC encapsulation.
The Blackfoot edge device also performs VPN endpoint functionality. When you set up a VPN connection, two public IP addresses are assigned to you. Like with the Internet Gateway, these addresses will actually be assigned to the Blackfoot edge device. When your Customer Gateway connects to these IP addresses, the connection is technically set up between your CGW and a Blackfoot edge device. The Blackfoot will then terminate the IPSec connection and forward traffic to and from your VPC.
With Direct Connect, you get a physical connection directly into the Amazon network. There are a number of ways you can route traffic from and to Direct Connect: public interfaces, private interfaces and transit interfaces.
A Direct Connect private interface can be connected to a Virtual Gateway in AWS. In this case the connection actually terminates on the Blackfoot edge device, which will apply 802.1ad Q-in-Q VLAN tagging (and stripping), thus converting VLAN traffic to VPC traffic and vice versa.
Gateway VPC Endpoints
There are two types of VPC Endpoints: Gateway and Interface endpoints. There are a number of differences, but for now it’s important to know that Gateway Endpoints live on Blackfoot edge devices, and Interface endpoints live on HyperPlane instances. We will cover HyperPlane in part two of the VPC deep dive.
There are only two Gateway VPC Endpoints types: the S3 gateway endpoint and the DynamoDB gateway endpoint. All other VPC endpoints are Interface endpoints.
The goal of VPC Endpoints is to allow traffic directed at public AWS services (like S3 and DynamoDB APIs) to stay within AWS’ global network. Without VPC Endpoints, traffic directed to public AWS services by any EC2 instance will consider S3 or DynamoDB to reside on the public internet because the services have public IP addresses.
VPC Endpoints provide better security and performance by intercepting traffic to S3 and DynamoDB on the Blackfoot edge device and routing it to the services’ endpoints over internal networks instead of the public internet.
Gateway Endpoints allow for an additional layer of security; through a policy on the endpoint you can control which buckets or tables can be accessed through the Gateway Endpoint.
In this post we’ve looked at the Mapping Service, the Virtual Router and the Blackfoot edge device. We’ve seen how these three technologies enable AWS to operate on their unique scale.
The end result might seem simple; you store some addresses in a mapping database, you put a router on your physical server and you create a device that can do NAT, VPN and some other things on the edge.
Saying this would be the same as describing Picasso’s last bull as “anybody could draw that”. The fact is they can’t. Only if you understand every detail of networking protocols, the limitations of existing technology, and have the ability to take into account a million or even a billion virtual networks in the future, could you have come up with a design that scales and performs like VPC.
In this post we’ve mainly looked at the design; how VPC works. We’ve scratched the surface on security, but there is so much more happening there. And we haven’t even talked about performance; it’s extremely difficult to create a system like this and not clog up all the tubes.
When writing this post, I went deep into the rabbit hole. Some of the videos I used for research have only 2.000 or 3.000 views. From these nuggets of information I was able to form a picture that is certainly not complete, but hopefully correct in its limited scope. What I have found has, as so often before, blown me away.
If you would like to learn more about what makes AWS networking tick, check out the resources below, my post about AWS Global Accelerator and the upcoming part 2 of this VPC deep dive.
Resources and copyright
The content of this post is based on publicly available resources provided by Amazon and others. If you have a networking background and would like to get an even deeper understanding of how VPC works, I recommend you watch all sessions below:
- AWS re:Invent 2015 | (NET403) Another Day, Another Billion Packets
- AWS re:Invent 2017 | (NET405) Another Day, Another Billion Flows
- AWS re:Invent 2017 | (CMP332) C5 Instances and the Evolution of Amazon EC2 Virtualization
- AWS re:Invent 2017 | Tuesday Night Live with Peter DeSantis
- AWS re:Invent 2018 | (CMP303) Powering Next-Gen EC2 Instances: Deep Dive into the Nitro System
- AWS re:Invent 2018 | (NET313) Amazon VPC: Security at the Speed Of Light
- AWS re:Invent 2019 | (NET406) AWS Transit Gateway reference architectures for many VPCs
- AWS re:Invent 2019 | (CMP303) Powering next-gen Amazon EC2: Deep dive into the Nitro system
- AWS re:Invent 2019 | Monday Night Live with Peter DeSantis
- Networking @Scale 2018 – Load Balancing at Hyperscale
- Amazon VPC for On-Premises Network Engineers – Part 1
- Amazon VPC for On-Premises Network Engineers – Part 2
Thanks to Nick Matthews, Principle Solutions Architect on the AWS networking team, who provided feedback in the comments below:
- Packet Pushers Show 387: AWS Networking – A View From The Inside
- Packet Pushers Heavy Networking 433: An Insider’s Guide To AWS Transit Gateways
Top image: [Photo: "Pasadena, Norton Simon Museum, Picasso P. The Bull, 1946” by Vahe Martirosyan is licensed under CC BY - SA 2.0]