Nebulaworks Insight Content Card Background - Bekky bekks light blue wall

Tips on Azure Scale Sets and Load Balancers

May 29, 2019 Drew Mullen

Tips on getting started with some of the foundational resource types in Azure including scale sets and load balancers.

If you’re looking to use Azure as part of your cloud strategy, you may run into challenges regarding Scale Sets and Azure Load Balancing. Here are some tips on Azure Scale Sets and Load Balancers, I’ll like to share after ~3 months of use. Below, I’ll discuss some Azure features, their purpose, shortcomings, and workarounds where applicable. I’m using Terraform so I’ll try to include code where it is useful or not intuitive.

Scale Sets

A major building block of the platform is Virtual Machine Scale Sets (vmss). This concept is almost identical to AWS Auto-scaling groups. The purpose is to set a minimum count of nodes (Azure calls these ‘instances’ but I will stick with ‘node’) providing a particular service (running identical images) and to allow for easy scaling in on-demand, by alert, or on a scheduled basis. Overall this is a good service but it has some quirks.

Overprovision: a feature you’ll likely want to disable. During scaling-up actions, Azure’s default behavior is to build an extra node. Once services are up, the node will be destroyed to match the original count provided. While this is an interesting idea, during my usage I found a bug (that has been reported) where the Azure CLI (and thus Terraform) builds too many nodes (2x+1). In the end, you’ll get the count you requested but there can be unforeseen consequences. For example, if you build a service that relies on consensus for leadership election, your cluster will not reach quorum because there are too many failed nodes. An easy workaround is to disable overprovision in the main stanza of your azurerm_virtual_machine_scale_set resource

resource "azurerm_virtual_machine_scale_set" "main" {
  ...
  overprovision: false
}

Connecting to a scale set node: at least during development, you’ll want to connect directly to scale set. Azure can expose NAT ports that you can reference for SSH. Add your own resource definition to your azurerm_lb frontend_ip:

resource "azurerm_lb_nat_pool" "test" {
  count                          = "{var.instance_count}"
  resource_group_name            = "${azurerm_resource_group.test.name}"
  loadbalancer_id                = "${azurerm_lb.test.id}"
  name                           = "SampleApplicationPool"
  protocol                       = "Tcp"
  frontend_port_start            = 2000
  frontend_port_end              = "${2000 + count.index}
  backend_port                   = 22
  frontend_ip_configuration_name = "you-load-balancer-frontend-ip-name"
}

resource "azurerm_virtual_machine_scale_set" "test" {
  network_profile {
      ...
      ip_configuration {
          load_balancer_inbound_nat_rules_ids  = ["${azurerm_lb_nat_pool.test.*.id}"]
      }
  }
}

Load Balancers

Azure has two types of layer-4 load balancers, called SKUs. You can choose between Basic and Standard. Although you may be tempted to use the default basic, if you’re planning to use TLS, I recommend you start with standard since the basic cannot perform HTTPS health checks (although you could do TCP on 443). The basic load balancers are well, basic, so I’ll spend my time describing the requirements of a standard lb. Unless otherwise stated, you can assume ‘LB’ below is referring to a ‘Standard LB’.

There is a small, ominous paragraph in the documentation for Standard LBs that I have read at least 10 times. The implications are difficult to grok but through feeling my way in the dark, I have some of it. I believe its talking about allowing port traffic via NSG (next section) and requiring a public IP for internet traffic (basic LB does not) but it honestly could have other implications.

One key aspect is the scope of the virtual network for the resource. While Basic Load Balancer exists within the scope of an availability set, a Standard Load Balancer is fully integrated with the scope of a virtual network and all virtual network concepts apply.

Network Security Group: (NSG) are port allowance rules you can add to a network device. In general, you can add NSGs to VM NICs or subnets. However, if using a Scale Set, you can only add to subnet because vmss does not expose a NIC per node.

NSGs allow you to set many parameters per rule but my most commonly used are:

source_port_range - due to ephemeral ports its recommended that you traffic filter on destination, however, you can set your prefix to the corresponding OS high port range.
destination_port_range - the service port you’re filtering traffic to
source_address_prefix - an IP or CIDR block where traffic will be allowed from

If you want to use NAT traffic to ssh to your machine, you must allow inbound to dest 22. I recommend using source_address_prefix to protect yourself.

External Traffic: If you want your LB to have access to external networks like internet traffic or public facing Azure services, you must have a standard SKU, static PublicIP and associate it with the azurerm_lb:

resource "azurerm_public_ip" "test" {
  name                           = "my-public-ip"
  location                       = "eastus"
  resource_group_name            = "${azurerm_resource_group.test.name}"
  public_ip_address_allocation   = "static"
  sku                            = "standard"
}

resource "azurerm_lb" "test" {
  ...
  sku  = "standard"
  frontend_ip_configuration {
    public_ip_address_id = "${azurerm_public_ip.test.id}"
    ...
  }
}

Note: Azure does offer Service Endpoints which allow you to access some services via private networking: Azure KeyVault, Storage Containers, etc. However, there are missing services such as querying www.management.azure.com, so this feature is not useful to me yet.

Monitoring: Probably my biggest critique of Azure’s offering so far is its lack of quick access to LB health probe statuses. Everyone wants to know the status of their health probes but the only way to find them in Azure is by building a metric. Though this process is not intuitive, thankfully it is simple.

Monitoring -> Metrics -> select the LB resource -> Add a Metric -> Health Probe Status -> Average

This metric will show you the % of successful health probes based on the probes you’ve set up. You want to see the line at 100. You can add filters to narrow down to probes on specific ports

I also recommend creating another metric for Data Path Availability -> Average. This will show the status of the Azure Load Balancer Service. Since health probes come from an Azure service, knowing the status of the service providing your health probes may help during an outage.

Quick Tips

Resource IDs: All resources have an ID the syntax of which is something like below. You can find your specific resources IDs at https://resources.azure.com/

/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/{resourceProviderNamespace}/{resourceType}/{resourceName} /subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Network/loadBalancers/mylbname

Service Principals: (SPN) Programmatic access in Azure is centered around the concept of application SPNs in Azure AD. An SPN is made up of 4 values: tenant id, subscription, client_id, client_secret

If your team is interested in other tips and best practices when leveraging Azure, Nebulaworks is here to help! Reach out to us with your questions.