Nebulaworks Insight Content Card Background - Bekky bekks light blue wall
Recent Updates
If you’re looking to use Azure as part of your cloud strategy, you may run into challenges regarding Scale Sets and Azure Load Balancing. Here are some tips on Azure Scale Sets and Load Balancers, I’ll like to share after ~3 months of use. Below, I’ll discuss some Azure features, their purpose, shortcomings, and workarounds where applicable. I’m using Terraform so I’ll try to include code where it is useful or not intuitive.
Scale Sets
A major building block of the platform is Virtual Machine Scale Sets (vmss). This concept is almost identical to AWS Auto-scaling groups. The purpose is to set a minimum count of nodes (Azure calls these ‘instances’ but I will stick with ‘node’) providing a particular service (running identical images) and to allow for easy scaling in on-demand, by alert, or on a scheduled basis. Overall this is a good service but it has some quirks.
Overprovision: a feature you’ll likely want to disable. During scaling-up actions, Azure’s default behavior is to
build an extra node. Once services are up, the node will be destroyed to match the original count provided. While this
is an interesting idea, during my usage I found a bug (that has been reported) where the Azure CLI (and thus Terraform)
builds too many nodes (2x+1). In the end, you’ll get the count you requested but there can be unforeseen consequences.
For example, if you build a service that relies on consensus for leadership election, your cluster will not reach quorum
because there are too many failed nodes. An easy workaround is to disable
overprovision in the
main stanza of your azurerm_virtual_machine_scale_set
resource
resource "azurerm_virtual_machine_scale_set" "main" {
...
overprovision: false
}
Connecting to a scale set node: at least during development, you’ll want to connect directly to scale set. Azure can
expose NAT ports that you can reference for SSH. Add your own resource definition to your azurerm_lb
frontend_ip
:
resource "azurerm_lb_nat_pool" "test" {
count = "{var.instance_count}"
resource_group_name = "${azurerm_resource_group.test.name}"
loadbalancer_id = "${azurerm_lb.test.id}"
name = "SampleApplicationPool"
protocol = "Tcp"
frontend_port_start = 2000
frontend_port_end = "${2000 + count.index}
backend_port = 22
frontend_ip_configuration_name = "you-load-balancer-frontend-ip-name"
}
resource "azurerm_virtual_machine_scale_set" "test" {
network_profile {
...
ip_configuration {
load_balancer_inbound_nat_rules_ids = ["${azurerm_lb_nat_pool.test.*.id}"]
}
}
}
Load Balancers
Azure has two types of layer-4 load balancers, called SKUs. You can choose between Basic
and Standard
. Although you
may be tempted to use the default basic
, if you’re planning to use TLS, I recommend you start with standard
since
the basic cannot perform HTTPS health checks (although you could do TCP on 443). The basic load balancers are well,
basic, so I’ll spend my time describing the requirements of a standard lb. Unless otherwise stated, you can assume ‘LB’
below is referring to a ‘Standard LB’.
There is a small, ominous paragraph in the documentation for Standard LBs that I have read at least 10 times. The implications are difficult to grok but through feeling my way in the dark, I have some of it. I believe its talking about allowing port traffic via NSG (next section) and requiring a public IP for internet traffic (basic LB does not) but it honestly could have other implications.
One key aspect is the scope of the virtual network for the resource. While Basic Load Balancer exists within the scope of an availability set, a Standard Load Balancer is fully integrated with the scope of a virtual network and all virtual network concepts apply.
Network Security Group: (NSG) are port allowance rules you can add to a network device. In general, you can add NSGs to VM NICs or subnets. However, if using a Scale Set, you can only add to subnet because vmss does not expose a NIC per node.
NSGs allow you to set many parameters per rule but my most commonly used are:
source_port_range
- due to ephemeral ports its recommended that you traffic filter on destination, however, you can set your prefix to the corresponding OS high port range.destination_port_range
- the service port you’re filtering traffic tosource_address_prefix
- an IP or CIDR block where traffic will be allowed from
If you want to use NAT traffic to ssh to your machine, you must allow inbound to dest 22. I recommend using
source_address_prefix
to protect yourself.
External Traffic: If you want your LB to have access to external networks like internet traffic or public facing Azure services, you must have a standard SKU, static PublicIP and associate it with the azurerm_lb:
resource "azurerm_public_ip" "test" {
name = "my-public-ip"
location = "eastus"
resource_group_name = "${azurerm_resource_group.test.name}"
public_ip_address_allocation = "static"
sku = "standard"
}
resource "azurerm_lb" "test" {
...
sku = "standard"
frontend_ip_configuration {
public_ip_address_id = "${azurerm_public_ip.test.id}"
...
}
}
Note: Azure does offer Service Endpoints which allow you to access some services via private networking: Azure KeyVault, Storage Containers, etc. However, there are missing services such as querying www.management.azure.com, so this feature is not useful to me yet.
Monitoring: Probably my biggest critique of Azure’s offering so far is its lack of quick access to LB health probe
statuses. Everyone wants to know the status of their health probes but the only way to find them in Azure is by building
a metric
. Though this process is not intuitive, thankfully it is simple.
- Monitoring -> Metrics -> select the LB resource -> Add a Metric ->
Health Probe Status
-> Average
This metric will show you the % of successful health probes based on the probes you’ve set up. You want to see the line at 100. You can add filters to narrow down to probes on specific ports
I also recommend creating another metric for Data Path Availability
-> Average. This will show the status of the
Azure Load Balancer Service
. Since health probes come from an Azure service, knowing the status of the service
providing your health probes may help during an outage.
Quick Tips
Resource IDs: All resources have an ID the syntax of which is something like below. You can find your specific resources IDs at https://resources.azure.com/
/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/{resourceProviderNamespace}/{resourceType}/{resourceName} /subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Network/loadBalancers/mylbname
Service Principals: (SPN) Programmatic access in Azure is centered around the concept of application SPNs in Azure AD. An SPN is made up of 4 values: tenant id, subscription, client_id, client_secret
If your team is interested in other tips and best practices when leveraging Azure, Nebulaworks is here to help! Reach out to us with your questions.
Looking for a partner with engineering prowess? We got you.
Learn how we've helped companies like yours.