F5 Distributed Cloud

223 Topics

Simplify Network Segmentation for Hybrid Cloud
Introduction Enterprises have always had the need to maintain separate development and production environments. Operational efficiency, reduction of blast radius, security and compliance are generally the common objectives behind separating these environments. By dividing networks into smaller, isolated segments, organizations can enhance security, optimize performance, and ensure regulatory compliance. This article demonstrates a practical strategy for implementing network segmentation in modern multicloud environments that also connect on-prem infrastructure. This uses F5 Distributed Cloud (F5 XC) services to connect and secure network segments in cloud environments like Amazon Web Services (AWS) and on-prem datacenters. Need for Segmentation Network segmentation is critical for managing complex enterprise environments. Traditional methods like Virtual Routing and Forwarding (VRFs) and Multiprotocol Label Switching (MPLS) have long been used to create isolated network segments in on-prem setups. F5 XC ensures segmentation in environments like AWS and it can extend the same segmentation to on-prem environments. These techniques separate traffic, enhance security, and improve network management by preventing unauthorized access and minimizing the attack surface. Scenario Overview Our scenario depicts an enterprise with three different environments (prod, dev, and shared services) extended between on-prem and cloud. A 3rd party entity requires access to a subset of the enterprise's services. This article, covers the following two networking segmentation use-cases: Hybrid Cloud Transit Extranet (servicing external 3 rd party partners/customers) Hybrid Cloud Transit Consider an enterprise with three distinct environments: Production (Prod), Development (Dev), and Shared Services. Each environment requires strict isolation to ensure security and performance. Using F5 XC Cloud Connect, we can assign each VPC a network segment effectively isolating the VPC’s. Segments in multiple locations (or VPC’s) can traverse F5 XC to reach distant locations whether in another cloud environment or on-prem. Network segments are isolated by default, for example, our Prod segment cannot access Shared. A segment connector is needed to allow traffic between Prod and Shared. The following diagram shows the VPC segments, ensuring complete "ships in the night" isolation between environments. In this setup, Prod, Dev, and Shared Services environments operate independently and are completely isolated from one another at the control plane level. This ensures that any issues or attacks in one environment do not affect the others. Customer Requirement: Shared Services Access Many enterprises deploy common services across their organization to support internal workloads and applications. Some examples include DHCP, DNS, NTP, and NFS, services that need to be accessible to both Prod and Dev environments while keeping Prod and Dev separate from each other. Segment Connectors is a method to allow communication between two isolated segments by leaking the routes between the source and destination segments. It is important to note that segment connector can be of type Direct or SNAT. Direct allows bidirectional communication between segments whereas the SNAT option allows unidirectional communication from the source to the destination. Extending Segmentation to On-Premises Enterprises already use segmented networks within their on-premises infrastructure. Extending this segmentation to AWS involves creating similar isolated segments in the cloud and establishing secure communication channels. F5 XC allows you to easily extend this segmentation from on-prem to the cloud regardless of the underlay technology. In this scenario, communication between the on-premises Prod segment and its cloud counterpart is seamless, and the same also applies for the Dev segment. Meanwhile Dev and Prod stay separate ensuring that existing security and isolation is preserved across the hybrid environment. Extranet In this scenario an external entity (customer/partner) needs access to a few applications within our Prod segment. There are two different ways to enable this access, Network-centric and App-centric. Let’s refer to the external entity as Company B. In order to connect Company B we generally need appropriate cloud credentials, but Company B will not share their cloud credentials with us. To solve this problem, F5 XC recommends using AWS STS:AssumeRole functionality whereby Company B creates an AWS IAM Role that trusts F5 XC with the minimum privileges necessary to configure Transit Gateway (TGW) attachments and TGW route table entries to extend access to the F5 XC network or network segments. Section 1 – Network-centric Extranet Many times, partners & customers need to access a unique subset of your enterprise’s applications. This can be achieved with F5 XC’s dedicated network segments and segment connectors. With a segment connector for the external and prod network segments, we can give Company B access to the required HTTP service without gaining broader access to other non-Prod segments. Locking Down with Firewall Policies We can implement a Zero Trust firewall policy to lock down access from the external segment. By refining these policies, we ensure that third-party consumers can only access the services they are authorized to use. Our firewall policy on the CE only allows access from the external segment to the intended application on TCP/80 in Prod. [ec2-user@ip-10-150-10-146 ~]$ curl --head 10.1.10.100 HTTP/1.1 200 OK Server: nginx/1.24.0 (Ubuntu) Date: Thu, 30 May 2024 20:50:30 GMT Content-Type: text/html Content-Length: 615 Last-Modified: Wed, 22 May 2024 21:35:11 GMT Connection: keep-alive ETag: "664e650f-267" Accept-Ranges: bytes [ec2-user@ip-10-150-10-146 ~]$ ping -O 10.1.10.100 PING 10.1.10.100 (10.1.10.100) 56(84) bytes of data. no answer yet for icmp_seq=1 no answer yet for icmp_seq=2 no answer yet for icmp_seq=3 ^C --- 10.1.10.100 ping statistics --- 4 packets transmitted, 0 received, 100% packet loss, time 3153ms After applying the new policies, we confirm that the third-party access is restricted to the intended services only, enhancing security and compliance. This demonstrates how F5 Distributed Cloud services enable networking segmentation across on-prem and cloud environments, with granular control over security policies applied between the segments. Section 2 - App-centric Extranet In the scenario above, Company B can directly access one or more services in Prod with a segment connector and we’ve locked it down with a firewall policy. For the App-centric method, we’ll only publish the intended services that live in Prod to the external segment. App-centric connectivity is made possible without a segment connector by using load balancers within App Connect that target the application within the Prod segment and advertises its VIP address to the external segment. The following illustration shows how to configure each component in the load balancer. Visualization of Traffic Flows The visualization flow analysis tool in the F5 XC Console shows traffic flows between the connected environments. By analyzing these flows, particularly between third-party consumers and the Prod environment, we can identify any unintended access or overreach. The following diagram is for a Network-centric connection flow: This following diagram shows an App-centric connection flow using the load balancer: Product Feature Demo Conclusion Effective network segmentation is a cornerstone of secure and efficient cloud environments. We’ve discussed how F5 XC enables hybrid cloud transit and extranet communication. Extranet can be done with either a network centric or app-centric deployment. F5 XC is an end to end platform that manages and orchestrates end-to-end segmentation and security in hybrid-cloud environments. Enterprises can achieve comprehensive segmentation, ensuring isolation, secure access, and compliance. The strategies and examples provided demonstrate how to implement and manage segmentation across hybrid environments, catering to diverse requirements and enhancing overall network security. Additional Resources More features and guidance are provided in the comprehensive guide below, where showing exactly how you can use the power and flexibility of F5 Distributed Cloud and Cloud Connect to deliver a Network-centric approach with a firewall and an App-centric approach with a load balancer. Create and manage segmented networks inyour own cloud and on-prem environments, and achieve the following benefits: Ability to isolate environments within AWS Ability to extend segmentation to on-prem environments Ability to connect external partners or customers to a specific segment Use Enhanced Firewall Policies to limit access and reduce the blast radius Enhance the compliance and regulatory requirements by isolating sensitive data and systems Visualize and monitor the traffic flows and policies across segments and network domains Workflow Guide - Secure Network Fabric (Multi-Cloud Networking) YouTube: Using network segmentation for hybrid-cloud and extranet with F5 Distributed Cloud Services DevCentral:Secure Multicloud Networking Article Series GitHub: S-MCN Use-case Playbooks (Console, Automation) for F5 Distributed Cloud Customers F5.com: Product Information Product Documentation Network Segmentation Cloud Connect Network Segment Connectors App Security App Networking CE Site Management
Dave_Potter
Jul 23, 2024 Place Technical Articles
150Views
0likes
0Comments
Customer driven Site Deployment Using AWS and F5 Distributed Cloud Terraform Modules
Introduction and Problem Scope F5 Distributed Cloud Mesh’s Secure Networking provides connectivity and security services for your applications running on the Edge, Private Clouds, or Public Clouds. This simplifies the deployment and configuration of connectivity and security services for your Multi-Cloud and Edge Cloud deployment needs across heterogeneous environments. F5 Distributed Cloud Services leverage the“Site” construct to deploy our Secure Mesh or AppStack Site instances to manage workloads. A Site could be a customer location like AWS, Azure, Google Cloud Platform (GCP), private cloud, or an edge site. To run F5 Distributed Cloud Services, the site needs to be deployed with one or more instances ofF5 Distributed Cloud Node, a software appliance that is managed by F5 Distributed Cloud Console. This site is where customer applications and F5 Distributed Cloud services are running. To deploy a Node, different options are available: Use F5 Distributed Cloud Services Console to deploy a site Leverage F5 Distributed Cloud Services Terraform provider to deploy a site following F5 Distributed Cloud Services Console user experience Use F5 Distributed Cloud Services Terraform modules Documentation of all the different deployment patterns found at https://docs.cloud.f5.com/docs-v2/multi-cloud-network-connect/how-to/site-management A customer may not want to leverage the above two options since they rely on using F5 Distributed Cloud Services Console. Reasons not to use the mentioned two options could be: Security and Privacy Concerns Data Security: Reluctance to share sensitive data with another organization. Access Keys: Not willing to share cloud provider access keys or credentials. Compliance: Need to comply with specific regulatory requirements (e.g., GDPR (General Data Protection Regulation), HIPAA) that require control over data. Control and Customization Customization: Need for a highly customized orchestration solution tailored to specific requirements, to create networking and service topologies considering brownfield realities Cost and Resource Management: Resource Allocation: Better control over resource allocation and optimization Operational Considerations Support: Preference for internal support and troubleshooting over relying on external support. Uptime and SLAs (Service Level Agreements): Concerns about meeting service level agreements (SLAs) and uptime requirements To be able to roll out a site despite the points mentioned above, it is possible for the customer to manage the lifecycle of a site outside the F5 Distributed Cloud Services Console. F5 Distributed Cloud Services created a set ofterraform modules to help customers manage the lifecycle of a site outside of F5 Distributed Cloud Services Console. Those modules are available at: AWS module Azure module GCP module The F5 Distributed Cloud Services Site Management documentation provides an overview of all available site types and their documentation on the topic of provisioning. Though many topologies could be deployed via F5 Distributed Cloud Services Console, the following AWS, Azure, GCP topologies can only be realized using Terraform modules: Single Node Single NIC existing VPC / subnet and 3rd party NAT GW Single Node Multi NIC existing VPC / subnet and 3rd party NAT GW Three Node Single NIC existing VPC / subnet and 3rd party NAT GW Three Node Multi NIC existing VPC / subnet and 3rd party NAT GW Any other external resource and its attributes that are to be used, e.g. credentials from Vault systems, IAM policies, SSH keys Deployment Scenario in AWS The F5 DevCentral GitHubproject contains Terraform templates to provision greenfield and / or brownfield Customer Edge (CE) topologies in AWS, GCP, and Azure with multiple use case script templates in respective repositories. To exemplify one of the scenarios, in this article, we walk through the journey a customer would undertake to provision a CE site in AWS using Terraformmodules. High-level Sequence workflow All of the AWS, GCP, and Azure scenarios follow similar high-level steps, as shown in Fig.1. Step 1: F5 Distributed Cloud Services tenant needs to be ready and user to access tenant set up. Step 2: Clone the desired AWS, GCP, or Azure repo from F5 DevCentral GitHub project. For AWS, this is https://github.com/f5devcentral/terraform-xc-aws-ce. Each of these repositories contains multiple deployment scenarios called topologies. Each topology is described by its own readme "readme.md" file. The description includes The resource objects that are created Use instructions and all requirements to be able to create the topology. Especially in brownfield environments Step 3: Customize the “terraform.tfvars” file to the customer’s specific context. These include Distributed Cloud specific parameters. The parameters in this file are described in relation to the function it serves for the specific scenario. Step 4: Run through the Init/Plan/Deploy workflow of Terraform deployment and verify the status of the CE Site using F5 Distributed Cloud Services Console. The Terraform reconciliation functions ensure meeting the intended objectives. Fig. 1: High-Level Sequence Diagram Customer deployment topology description We will explain the above steps in the context of a greenfield deployment, the Terraform scripts of which are available here. The corresponding logical topology view of this deployment is shown in Fig.2. This deployment scenario instantiates the following resources: Single-node CE cluster AWS SLO interface AWS VPC AWS SLO interface subnet AWS route tables AWS Internet Gateway Assign AWS EIP to SLO The objective of this deployment is to create a Site with a single CE node in a new VPC for the provided AWS region and availability zone. The CE will be created as an AWS EC2 instance. An AWS subnet is created within the VPC. The CE Site Local Outside (SLO) interface will be attached to the VPC subnet and the created EC2 instance. SLO is a logical interface of a site (CE node) through which reachability is achieved to external (e.g. Internet or other services outside the public cloud site). To enable reachability to the Internet, the default route of the CE node will point to the AWS Internet gateway. Also, the SLO will be configured with an AWS External IP address (Elastic IP). Fig.2. Customer Deployment Topology in AWS Description of input parameters in Terraform vars file Parameters must be customized to adapt to the customer’s environment. The definition of the parameters in the “terraform.tfvars” is as follows: Parameters Definitions owner Identifies the email of the IT manager used to authenticate to the AWS system project_prefix Prefix that will be used to identify the resource objects in AWS and XC. project_suffix The suffix that will be used to identify the site resources in AWS and XC ssh_public_key_file Local file system’s path to ssh public key file f5xc_tenant Full F5XC tenant name f5xc_api_url F5XC API url f5xc_cluster_name Name of the Cluster f5xc_api_p12_file Local file system path to api_cert_file (downloaded from XC Console) aws_region AWS region for the XC Site aws_existing_vpc_id Existing VPC ID (brownfield) aws_vpc_cidr_block CIDR Block of the VPC aws_availability_zone AWS Availability Zone (a) aws_vpc_slo_subnet_node0 AWS Subnet in the VPC for the SLO subnet Configuring other environmental variables Export the following environment variables in the working shell, setting it to customer’s deployment context. Environment Variables Definitions AWS_ACCESS_KEY AWS Access key for authentication AWS_SECRET_ACCESS_KEY AWS Secret key for authentication VES_P12_PASSWORD XC P12 Password from Console TF_VAR_f5xc_api_p12_cert_password Same as VES_P12_PASSWORD Deploy Topology Deploy the topology with: terraform init terraform plan terraform deploy –auto-approve and monitor the status of the Sites on the F5 Distributed Cloud Services Console. Created site object will be available in Secure Mesh Site section of the F5Distributed CloudServices Console. Video-based description of the deployment Scenario This demonstration video shows the procedure for provisioning the deployment topology described above in three steps. <p><iframe src="https://www.youtube.com/watch?v=8_T3dQSEdhc" width="750" height="422" frameborder="0" allowfullscreen></iframe></p> References https://docs.cloud.f5.com/docs-v2/platform/services/mesh/secure-networking https://docs.cloud.f5.com/docs-v2/platform/concepts/site https://docs.cloud.f5.com/docs-v2/multi-cloud-network-connect/how-to/site-management https://docs.cloud.f5.com/docs-v2/multi-cloud-network-connect/how-to/site-management/deploy-aws-site-terraform https://docs.cloud.f5.com/docs-v2/multi-cloud-network-connect/troubleshooting/troubleshoot-manual-ce-deployment-registration-issues Note: This project is open source and actively monitored by F5 XC on a best-effort basis. While there is no formal commitment regarding service level agreements (SLA) or support assistance, we encourage the community to report any issues through GitHub. Customers and partners are warmly invited to contribute to the code, fostering a collaborative environment that enhances the project's development and usability.
C__Klewar
Jul 17, 2024 Place Technical Articles
122Views
0likes
0Comments
F5 Distributed Cloud, from manual configuration to Terraform automation

Valentin_Tobi
Jul 11, 2024 Place Technical Articles
109Views
1like
0Comments
F5 Distributed Cloud: Virtual Sites - Regional Edge (RE)

MattHarmon
Jul 10, 2024 Place Technical Articles
405Views
6likes
2Comments
Customer-driven Site Deployment Using AWS and F5 Distributed Cloud Terraform Modules
Introduction and Problem Scope F5 Distributed Cloud Mesh’s Secure Networking provides connectivity and security services for your applications running on the Edge, Private Clouds, or Public Clouds. This simplifies the deployment and configuration of connectivity and security services for your Multi-Cloud and Edge Cloud deployment needs across heterogeneous environments. F5 Distributed Cloud Services leverages the “Site” construct to deploy our Secure Mesh or AppStack Site instances to manage workloads. A Site could be a customer location like AWS, Azure, GCP (Google Cloud Platform), private cloud, or an edge site. To run F5 Distributed Cloud Services, the site needs to be deployed with one or more instances of F5 Distributed Cloud Node, a software appliance that is managed by F5 Distributed Cloud Console. This site is where customer applications and F5 Distributed Cloud services are running. To deploy a Node, different options are available: Customer deployment topology description We will explain the above steps in the context of a greenfield deployment, the Terraform scripts of which are available here. The corresponding logical topology view of this deployment is shown in Fig.2. This deployment scenario instantiates the following resources: Single-node CE cluster AWS SLO interface AWS VPC AWS SLO interface subnet AWS route tables AWS Internet Gateway Assign AWS EIP to SLO The objective of this deployment is to create a Site with a single CE node in a new VPC for the provided AWS region and availability zone. The CE will be created as an AWS EC2 instance. An AWS subnet is created within the VPC. CE Site Local Outside (SLO) interface will be attached to VPC subnet and the created EC2 instance. SLO is a logical interface of a site (CE node) through which reachability is achieved to external (e.g. Internet or other services outside the public cloud site). To enable reachability to the Internet, the default route of the CE node will point to the AWS Internet gateway. Also, the SLO will be configured with an AWS External IP address (Elastic IP). Fig.2. Customer Deployment Topology in AWS List of terraform input parameters provided in vars file Parameters must be customized to adapt to the customer environment. The definition of the parameters in the “terraform.tfvars” show in below table. Parameters Definitions owner Identifies the email of the IT manager used to authenticate to the AWS system project_prefix Prefix that will be used to identify the resource objects in AWS and XC. project_suffix The suffix that will be used to identify the site’s resources in AWS and XC ssh_public_key_file Local file system’s path to ssh public key file f5xc_tenant Full F5XC tenant name f5xc_api_url F5XC API url f5xc_cluster_name Name of the Cluster f5xc_api_p12_file Local file system path to api_cert_file (downloaded from XC Console) aws_region AWS region for the XC Site aws_existing_vpc_id Existing VPC ID (brownfield) aws_vpc_cidr_block CIDR Block of the VPC aws_availability_zone AWS Availability Zone (a) aws_vpc_slo_subnet_node0 AWS Subnet in the VPC for the SLO subnet Configuring other environmental variables Export the following environment variables in the working shell, setting it to customer’s deployment context. Environment Variables Definitions AWS_ACCESS_KEY AWS Access key for authentication AWS_SECRET_ACCESS_KEY AWS Secret key for authentication VES_P12_PASSWORD XC P12 Password from Console TF_VAR_f5xc_api_p12_cert_password Same as VES_P12_PASSWORD Deploy Topology Deploy the topology with: terraform init terraform plan terraform deploy –auto-approve And monitor the status of the Sites on the F5 Distributed Cloud Services Console. Created site object will be available in Secure Mesh Site section of the F5Distributed CloudServices Console. Video-based description of the deployment Scenario This demonstration video shows the procedure for provisioning the deployment topology described above in three steps. References https://docs.cloud.f5.com/docs-v2/platform/services/mesh/secure-networking https://docs.cloud.f5.com/docs-v2/platform/concepts/site https://docs.cloud.f5.com/docs-v2/multi-cloud-network-connect/how-to/site-management https://docs.cloud.f5.com/docs-v2/multi-cloud-network-connect/how-to/site-management/deploy-aws-site-terraform https://docs.cloud.f5.com/docs-v2/multi-cloud-network-connect/troubleshooting/troubleshoot-manual-ce-deployment-registration-issues
C__Klewar
Jul 08, 2024 Place Technical Articles
139Views
0likes
0Comments
Configure Generic Webhook Alert Receiver using F5 Distributed Cloud Platform
Generic Webhook Alerts feature in F5 Distributed Cloud (F5 XC) gives feasibility to easily configure and send Alert notifications related to Application Infrastructure (IaaS) to specified URL receiver. F5 XC SaaS console platform sends alert messages to web servers to receive as soon as the events gets triggered.
chaithanya_dileep
Jul 02, 2024 Place Technical Articles
117Views
2likes
0Comments
Customer Edge Site High Availability for Application Delivery - Reference Architecture
Purpose This guide describes the reference architecture for deploying a highly available F5 Distributed Cloud (F5XC) Customer Edge (CE) site. It explains the networking options available to deploy a highly available multi-node CE site in an on-premise data center, branch location, or on the public cloud when deployed manually. Audience This guide is for technical readers, including NetOps and Solution Architect teams who want to better understand the various options for deploying a highly available F5 Distributed Cloud Customer Edge (CE) site. The guide assumes the reader is familiar with basic networking concepts like routing protocols, DNS, and data center network architecture. Also, the reader must be aware of various F5XC concepts such as Load Balancing, BGP configuration, Sites and Virtual Sites, and Site Local Inside (SLI) and Site Local Outside (SLO) interfaces. Introduction To create a resilient network architecture, all components on the network must be deployed in a redundant topology to handle device and connectivity failures. A CE acts as an L7 gateway and sits in the path of the network traffic, hence it needs a redundant architecture. For a production setup, it is recommended to deploy the site as a three-node cluster. These three nodes are the control nodes. Additional worker nodes can be added for higher L7 and security performance. Clustering on CE Site A CE can be deployed as a multi-node site for redundancy and scaling performance. CE runs kubernetes on its nodes and inherits k8s HA architecture of having either one or three control nodes and optional worker nodes. Production deployments are recommended to have 3 control nodes for redundancy and additional worker nodes to meet the performance requirements of the site. A multi-node site can tolerate one control node failure as it needs at least 2 nodes to form the quorum for HA. It's important to ensure that multiple control nodes don't fail simultaneously in each site. Worker node failures do not cause the whole site to fail. It only reduces the total throughput the site can handle. Note: The control nodes may also be addressed as master nodes in legacy documentation. Although they are called control nodes, they run both the control plane and data plane functions. Figure: CE Clustering In a multi-node setup, two CE control nodes form tunnels to the two closest REs. If one of the control nodes with a tunnel fails, it gets reassigned to the remaining control node. In a single node site, the same node forms tunnels to two different REs. Worker nodes are not supported for sites with a single control node. Figure: RE – CE connectivity CE Site HA Options In a regular deployment, a multi-node CE site is used to achieve redundancy. A load balancer configured on the CE site uses the IP address of SLI, SLO, or both interfaces as VIP by default. But this means the load balancer domain/hostname will need to resolve to multiple IP addresses across the nodes of the CE. To simplify this, F5XC also allows users to specify a custom IPv4 address as the VIP for each load balancer. An alternative topology is to use multiple single-node sites deployed across different availability zones in the data center or public cloud. In this case, the sites can be grouped into a Virtual Site. A load balancer can be configured with a custom VIP advertised to this Virtual Site. Both of these options are explained in detail in the sections below. High Availability Options for Single CE Site This section describes the deployment options available to direct the traffic across the CE nodes and lists the pros and cons of each option. The feasibility of these options may vary by environment (on-premise or public cloud) and networking tools available. These nuances are also explained for each option. For L4 and L7 load balancer VIPs on the CE site, all nodes (control and worker nodes) can actively receive the traffic. The site bandwidth scales linearly with the number of nodes. So, multiple worker nodes can be deployed based on the performance requirements of the site. For public load balancers, the VIP is on RE, and the bandwidth is limited to the bandwidth of the tunnel connecting the two CE control nodes to the REs. Layer 3 Redundancy Using Static Routing With ECMP This is the simplest way to configure redundancy for Load Balancer VIP on the CE cluster. The application admin can configure the LB with a user-specified VIP and the Network admin can configure equal-cost static routes for this VIP IP addresses, with the SLI/SLO IP addresses of the CE nodes as the next hop. The router uses Equal Cost Multi-Path (ECMP) to spread the traffic across the CE nodes. It is recommended to use consistent hashing ECMP configuration on the router to ensure an active session to a CE node is not rehashed in case another node fails. Figure: Static Routing Pros: VIP IP can be from any valid subnet. It is not restricted to the SLO or SLI subnet where it is advertised. Simple L3 routing configuration. Can scale with worker nodes with minimal route configuration change. All active nodes can receive the traffic. Cons: Needs routing configuration changes external to F5XC, every time a new LB VIP is created/deleted. Traffic will get blackholed when a CE node fails, until the node’s route is removed from the route configuration, or the node is restored. When To Use: When the NetOps team does not have access to routing devices with dynamic routing protocol capabilities like BGP. In use cases where the number of load balancers on the site is small and doesn’t change often, the operational overhead of configuring and managing the routes is less. Layer 3 Redundancy Using BGP Routing With ECMP BGP peering can be configured between F5XC CE and the router. This configuration requires LBs to be created with user-specified VIPs. The CE advertises equal cost, /32 routes to the VIP with the SLO/SLI as the next hop. The router uses Equal Cost Multi-Path (ECMP) to spread the traffic across the CE nodes. It is recommended to use sticky/persistent ECMP configuration on the router to ensure an active session to a CE node is not rehashed to a different node in case of a node failure. Note: Separate BGP peers must be configured for VIPs on SLO and SLI. Users can select the peer interface on CE while configuring the peers. For more information check BGP. Figure:BGP Routing Pros: VIP IP can be from any valid subnet. It is not restricted to the SLO or SLI subnet where it is advertised. Can automatically scale with worker nodes. Automatically revokes the route for failed CE node. Faster failover than any other method. All active nodes can receive the traffic. Cons: Needs advanced network configuration on the router. The router must support BGP. When To Use: The site has a large number of load balancers configured. Load balancers are frequently created and deleted. The application requires fast failover and minimal disruption in case of node failure. With network overlay technologies like Cisco ACI. Layer 2 redundancy using VRRP/GARP A user can enable VRRP on the CE site. This configuration requires LBs to be created with user-specified VIPs. Only control nodes participate in the VRRP redundancy group and one of them is elected as the leader for a VIP. VIPs will be randomly placed on different nodes. Only the leader node for the VIP sends out Gratuitous ARP (GARP) broadcasts for the VIP. If the leader node fails, a new leader is elected, and VIP is placed on it. Figure: VRRP/GARP Pros: No network configuration is required external to F5XC. Automatic failover of VIP when VRRP leader node fails. Cons: Only control nodes can receive the traffic. Only one node is actively receiving traffic at a given time. VIPs will be placed on different control nodes randomly. Equal distribution of VIPs across control nodes is not guaranteed. Failover can be slow depending on the ARP resolution time on the network. When To Use: The application team does not have access to routers, DNS servers, or load balancers on the network. (see other deployment options for details) The application does not require high throughput. Some traffic loss can be tolerated in case of node failure (e.g. for non-critical applications) Does not work in public cloud deployments as the cloud networking blocks GARP requests. External Proxy Load Balancing Network admin can configure an external load balancer (LB), with the CE SLO/SLI IP addresses in its origin pool, to spread the traffic across CE nodes. This can be a TCP or HTTP load balancer. For an external TCP LB, the client IP will be lost as the LB will SNAT the request before forwarding it to the CE nodes. F5XC does not support proxy protocol on the client side so it cannot be used to convey the client IP to the load balancer on the CE. For an external HTTP LB, the traffic will still get SNAT-ed, but the client IP can be persisted to the CE nodes if the external LB can add the X-Forwarded-For header to the request. If the LB on the CE site is a HTTPS LB or TCP LB with TLS enabled, the external LB will have to host the TLS certificate as it will terminate the client TLS sessions. A wild card certificate can be used to simplify this issue, but this may not always be a viable option for the applications. In case of public cloud deployments where L2 ARP and routing protocols may not work, in addition to TCP or HTTP load balancers, users also have the option to use Network Load Balancer on AWS and Google Cloud, Standard Load Balancer on Azure or similar feature on other public clouds, that does not SNAT the traffic but just forwards it to the CE nodes just like a router running ECMP. Note: For multi-node public cloud sites created using the F5XC console, it will automatically create the required cloud native LB. But we also support manual CE deployment in the cloud in which case the user will have to create the LB. Figure: External LB Pros: All active nodes can receive the traffic. Health probes can be configured to get F5XC LB health to avoid traffic blackholing. Can scale with worker nodes. Works for public cloud deployments. Cons: Managing certificates on external LB can be operationally challenging for TLS traffic. No source IP retention in the case of TCP LB Adds additional proxy hop. External LB can become a performance bottleneck even if CE is scaled out using worker nodes. When To Use: There is an existing load balancer (usually in DMZ) in the traffic path, but CE is used for additional services like WAAP, DDoS, etc. In public clouds where cloud LBs can be used to load balance to the nodes. DNS Load Balancing Network admins can use DNS to resolve application hostnames to the SLI/SLO IP addresses on the CE Nodes. The DNS can be configured to respond with one IP at a time in a round robin manner. Alternatively, a private DNS LB or Global Server Load Balancer (GSLB) can be used which can make load-based intelligent decisions to distribute the traffic more evenly. User-specified VIPs must not be used in this case as the hostname must resolve to the individual node’s SLI/SLO IP address for the traffic to get routed to the node. Figure: DNS LB Pros: Can be configured using existing DNS servers. Can scale with worker nodes. Works for public cloud deployments. (Not recommended as better options are available) This does not add a L4/L7 hop to the traffic path. Cons: Needs DNS configuration changes external to F5XC, every time a new LB VIP is created/deleted, or a new worker node is added. Traffic will get blackholed when a CE node fails, until the node’s IP is removed from the DNS configuration, or the node is restored. Intelligent distribution of traffic requires GSLB which can be expensive. Subject to DNS cache and TTL causing clients to resolve to a down CE node. When To Use: The application team only has access to GSLB or DNS server and does not want to limit the traffic to only one node at a time as in the case of the VRRP/GARP option High performance is not a requirement as multiple clients may resolve to the same node even if the site has multiple nodes Can be used in the public cloud if the user does not want to create an external LB. High Availability Using Multiple Single Node Sites Across Availability Zones Instead of deploying a single multi-node site, customers can opt to deploy two (or more) single-node sites and use them together (as individual sites or grouped into a Virtual Site) to advertise a VIP. This can be useful if the data center has two AZs so it’s more logical to deploy a CE on each AZ than deploying a three-node CE with one node in one AZ and two nodes in the other. By upgrading one site at a time, it is guaranteed that at least one site will always be online to serve the traffic, providing resiliency against upgrade failures. This is very useful in case of critical applications demanding zero downtime. All the deployment options above, other than the VRRP/GARP method, can be used in this case. It is recommended to use a consistent hash configuration for ECMP on the router to ensure all packets in a TCP session from a client are always routed to the same site. In this deployment, each CE has two tunnels to the nearest REs. Hence, this method is also beneficial when you want to publish an app to the internet using F5XC Regional Edges (REs) as you can scale throughput by adding CEs and hence more tunnels. Note: This is a big advantage of this topology over a multi-node site as the latter is limited to only two tunnels. For Public load balancers, the VIP is on the Regional Edges (REs) on the F5XC global network. The load balancing happens on the REs and the CEs provide secure connectivity with auto SNAT between REs and private origins. So, to get the most out of the available compute, the CEs in this case can be configured for Enhanced L3 performance mode as all the L7 processing happens on the RE. Figure: RE-CE tunnels for multiple single-node site deployment Conclusion This guide should help the reader learn about the various HA options available in F5XC to make an informed decision on which method to choose based on their requirements and the networking tools available. For a more detailed explanation of the above options with config examples also see:F5 Distributed Cloud – CE High Availability Options: A Comparative Exploration Related Articles F5XC Load Balancing and Distributed Proxy Concepts F5XC Virtual Network Concepts F5XC Site BGP Configurations on F5XC F5 Distributed Cloud - Customer Edge Site - Deployment & Routing Options F5 Distributed Cloud - Listener Logic
bhushanpai
Jun 26, 2024 Place Technical Articles
353Views
3likes
0Comments
F5 Distributed Cloud – CE High Availability Options: A Comparative Exploration
This article explores an alternative approach to achieve HA across single CE nodes, catering for use cases requiring higher performance and granular control over redundancy and failover management. Introduction F5 Distributed Cloud offers different techniques to achieve High Availability (HA) for Customer Edge (CE) nodes in an active-active configuration to provide redundancy, scaling on-demand and simplify management. By default, F5 Distributed Cloud uses a method for clustering CE nodes, in which CEs keep track of peers by sending heartbeats and facilitating traffic exchange among themselves. This method also handles the automatic transfer of traffic, virtual IPs, and services between CE peers —excellent for simplified deployment and running App Stack sites hosting Kubernetes workloads. However, if CE nodes are deployed mainly to manage L3/L7 traffic and application security, this default model might lack the flexibility needed for certain scenarios. Many of our customers tell us that achieving high availability is not so straightforward with the current clustering model. These customers often have a lot of experience in managing redundancy and high availability across traditional network devices. They like to manage everything themselves—from scheduling when to switch over to a redundant pair (planned failover), to choosing how many network paths (tunnels) to use between CEs to REs (Regional Edges) or other CEs. They also want to handle any issues device by device, decide the number of CE nodes in a redundancy group, and be able to direct traffic to different CEs when one is being updated. Their feedback inspired us to write this article, where we explore a different approach to achieve high availability across CEs. The default clustering model is explained in this document: https://docs.cloud.f5.com/docs/ves-concepts/site#cluster-of-nodes Throughout this article, we will dive into several key areas: An overview of the default CE clustering model, highlighting its inherent challenges and advantages. Introduction to an alternative clustering strategy: Single Node Clustering, including: An analysis of its challenges and benefits. Identification of scenarios where this approach is most applicable. A guide to the configuration steps necessary to implement this model. An exploration of failover behavior within this framework. A comparison table showing how this new method differs from the default clustering method. By the end of this article, readers will gain an understanding of both clustering approaches, enabling informed decisions on the optimal strategy for their specific needs. Default CE Clustering Overview In a standard CE clustering setup, a cluster must have at least three Master nodes, with subsequent additions acting as Worker nodes. A CE cluster is configured as a "Site," centralizing operations like pool configuration and software upgrades to simplify management. In this clustering method, frequent communication is required between control plane components of the nodes on a low latency network. When a failover happens, the VIPs and services - including customer’s compute workloads - will transition to the other active nodes. As shown in the picture above, a CE cluster is treated as a single site, regardless of the number of nodes it contains. In a Mesh Group scenario, each mesh link is associated with one single tunnel connected to the cluster. These tunnels are distributed among the master nodes in the cluster, optimizing the total number of tunnels required for a large-scale Mesh Group. It also means that the site will be connected to REs only via 2 tunnels – one to each RE. Design Considerations for Default CE Clustering model: Best suited for: 1- App Stack Sites: Running Kubernetes workloads necessitates the default clustering method for container orchestration across nodes. 2- Large-scale Site-Mesh Groups (SMG) 3- Cluster-wide upgrade preference: Customers who favour managing nodes collectively will find cluster-wide upgrades more convenient, however without control over the upgrade sequence of individual nodes. Challenges: o Network Bottleneck for Ingress Traffic: A cluster connected to two Regional Edge (RE) sites via only 2 tunnels can lead to only two nodes processing external (ingress) traffic, limiting the use of additional nodes to process internal traffic only. o Three-master node requirement: Some customers are accustomed to dual-node HA models and may find the requirement for three master nodes resource-intensive. o Hitless upgrades: Controlled, phased upgrades are preferred by some customers for testing before widespread deployment, which is challenging with cluster-wide upgrades. o Cross-site deployments: High network latency between remote data centers can impact cluster performance due to the latency sensitivity of etcd daemon, the backbone of cluster state management. If the network connection across the nodes gets disconnected, all nodes will most likely stop the operation due to the quorum requirements of etcd. Therefore, F5 recommends deploying separate clusters for different physical sites. o Service Fault Sprawl and limited Node fault tolerance: Default clusters can sometimes experience a cascading effect where a fault in a node spreads throughout the cluster. Additionally, a standard 3-node cluster can generally only tolerate the failure of two nodes. If the cluster was originally configured with three nodes, functionality may be lost if reduced to a single active node. These limitations stem from the underlying clustering design and its dependency on etcd for maintaining cluster state. The Alternative Solution: HA Between Multiple Single Nodes The good news is that we can achieve the key objectives of the clustering – which are streamlined management and high availability - without the dependency on the control plane clustering mechanisms. Streamlined management using “Virtual Site”: F5 Distributed Cloud provides a mechanism called “Virtual Site” to perform operations on a group of sites (site = node or cluster of nodes), reducing the need to repeat the same set of operations for each site. The “Virtual Site” acts as an abstraction layer, grouping nodes tagged with a unique label and allows collectively addressing these nodes as a single entity. Configuration of origin pools and load balancers can reference Virtual Sites instead of individual sites/nodes, to facilitate cluster-like management for two or more nodes and enabling controlled day 2 operations. When a node is disassociated from Virtual Site by removing the label, it's no longer eligible for new connections, and its listeners are simultaneously deactivated. Upgrading nodes is streamlined: simply remove the node's label to exclude it from the Virtual Site, perform the upgrade, and then reapply the label once the node is operational again. This procedure offers you a controlled failover process, ensuring minimal disruption and enhanced manageability by minimizing the blast radius and limiting the cope of downtime. As traffic is rerouted to other CEs, if something goes wrong with an upgrade of a CE node, the services will not be impacted. HA/Redundancy across multiple nodes: Each single node in a Virtual Site connect to dual REs through IPSec or SSL/TLS tunnels, ensuring even load distribution and true active-active redundancy. External (Ingress) Traffic: In the Virtual Site model, the Regional Edges (REs) distribute external traffic evenly across all nodes. This contrasts with the default clustering approach where only two CE nodes are actively connected to the REs. The main Virtual Site advantage lies in its true active/active configuration for CEs, increasing the total ingress traffic capacity. If a node becomes unavailable, the REs will automatically reroute the new connections to another operational node within the Virtual Site, and the services (connection to origin pools) remain uninterrupted. Internal (East-West) Traffic: For managing internal traffic within a single CE node in a Virtual Site (for example, when LB objects are configured to be advertised within the local site), all network techniques applicable to the default clustering model can be employed in this model as well, except for the Layer 2 attachment (VRRP) method. Preferred load distribution method for internal traffic across CEs: Our preferred methods for load balancing across CE nodes are either DNS based load balancing or Equal-Cost Multi-Path (ECMP) routing utilizing BGP for redundancy. DNS Load Balancer Behavior: If a node is detached from a Virtual Site, its associated listeners and Virtual IPs (VIPs) are automatically withdrawn. Consequently, the DNS load balancer's health checks will mark those VIPs as down and prevent them from receiving internal network traffic. Current limitation for custom VIP and BGP: When using BGP, please note a current limitation that prevents configuring a custom VIP address on the Virtual Site. As a workaround, custom VIPs should be advertised on individual sites instead. The F5 product team is actively working to address this gap. For a detailed exploration of traffic routing options to CEs, please refer to the following article here: https://community.f5.com/kb/technicalarticles/f5-distributed-cloud---customer-edge-site---deployment--routing-options/319435 Design Considerations for Single Node HA Model: Best suited for: 1- Customers with high throughput requirement: This clustering model ensures that all Customer Edge (CE) nodes are engaged in managing ingress traffic from Regional Edges (REs), which allows for scalable expansion by adding additional CEs as required. In contrast, the default clustering model limits ingress traffic processing to only two CE nodes per cluster, and more precisely, to a single node from each RE, regardless of the number of worker nodes in the cluster. Consequently, this model is more advantageous for customers who have high throughput demands. 2- Customers who prefer to use controlled failover and software upgrades This clustering model enables a sequential upgrade process, where nodes are updated individually to ensure each node upgrades successfully before moving on to the other nodes. The process involves detaching the node from the cluster by removing its site label, which causes redirecting traffic to the remaining nodes during the upgrade. Once upgraded, the label is reapplied, and this process is repeated for each node in turn. This is a model that customers have known for 20+ years for upgrade procedures, with a little wrinkle with the label. 3- Customers who prefer to distribute the load across remote sites Nodes are deployed independently and do not require inter-node heartbeat communication, unlike the default clustering method. This independence allows for their deployment across various data centers and availability zones while being managed as a single entity. They are compatible with both Layer 2 (L2) spanned and Layer 3 (L3) spanned data centers, where nodes in different L3 networks utilize distinct gateways. As long as the nodes can access the origin pools, they can be integrated into the same "Virtual Site". This flexibility caters to customers' traditional preferences, such as deploying two CE nodes per location, which is fully supported by this clustering model. Challenges: Lack of VRRP Support: The primary limitation of this clustering method is the absence of VRRP support for internal VIPs. However, there are some alternative methods to distribute internal traffic across CE nodes. These include DNS based routing, BGP with Equal-Cost Multi-Path (ECMP) routing, or the implementation of CEs behind another Layer 4 (L4) load balancer capable of traffic distribution without source address alteration, such as F5 BIG-IPs or the standard load balancers provided by Azure or AWS. Limitation on Custom VIP IP Support: Currently, the F5 Distributed Cloud Console has a restriction preventing the configuration of custom virtual IPs for load balancer advertisements on Virtual Sites. We anticipate this limitation will be addressed in future updates to the F5 Distributed Cloud platform. As a temporary solution, you can advertise the LB across multiple individual sites within the Virtual Site. This approach enables the configuration of custom VIPs on those sites. Requires extra steps for upgrading nodes Unlike the Default clustering model where upgrades can be performed collectively on a group of nodes, this clustering model requires upgrading nodes on an individual basis. This may introduce more steps, especially in larger clusters, but it remains significantly simpler than traditional network device upgrades. Large-Scale Mesh Group: In F5 Distributed Cloud, the "Mesh Group" feature allows for direct connections between sites (whether individual CE sites or clusters of CEs) and other selected sites through IPSec tunnels. For CE clusters, tunnels are established on a per-cluster basis. However, for single-node sites, each node creates its own tunnels to connect with remote CEs. This setup can lead to an increased number of tunnels needed to establish the mesh. For example, in a network of 10 sites configured with dual-CE Virtual Sites, each CE is required to establish 18 IPSec tunnels to connect with other sites, or 19 for a full mesh configuration. Comparatively, a 10-site network using the default clustering method—with a minimum of 3 CEs per site—would only need up to 9 tunnels from each CE for full connectivity. Opting for Virtual Sites with dual CEs, a common choice, effectively doubles the number of required tunnels from each CE when compared to the default clustering setup. However, despite this increase in tunnels, opting for a Mesh configuration with single-node clusters can offer advantages in terms of performance and load distribution. Note: Use DC Groups as an alternative solution to Secure Mesh Group for CE connectivity: For customers with existing private connectivity between their CE nodes, running Site Mesh Group (SMG) with numerous IPsec tunnels can be less optimal. As a more scalable alternative for these customers, we recommend using DC Cluster Group (DCG). This method utilizes IP-in-IP tunnels over the existing private network, eliminating the need for individual encrypted IPsec tunnels between each node and streamlining communication between CE nodes via IP-n-IP encapsulations. Configuration Steps The configuration for creating single node clusters involves the following steps: Creating a Label Creating a Virtual Site Applying the label to the CE nodes (sites) Review and validate the configuration The detailed configuration guide for the above steps can be found here: https://docs.cloud.f5.com/docs/how-to/fleets-vsites/create-virtual-site Example Configuration: In this example, you can create a label called "my-vsite" to group CE nodes that belong to the same Virtual Site. Within this label, you can then define different values to represent different environments or clusters, such as specific Azure region or an on-premise data center. Then a Virtual Site of “CE” type can be created to represent the CE cluster in “Azure-AustraliaEast-vSite" and tied to any CE that is tagged with the label “my-vsite=Azure-AustraliaEast-vSite”: Now, any CE node that should join the cluster (Virtual Site), should get this label: Verification: To confirm the Virtual Site configuration is functioning as intended, we joined two CEs (k1-azure-ce2 and k1-azure-ce03) into the Virtual Site and evaluated the routing and load balancing behavior. Test 1: Public Load Balancer (Virtual Site referenced in the pool) The diagram shows a public "Load Balancer" advertised on the RE referencing a pool that uses the newly created Virtual Site to access the private application: As shown below, the pool member was configured to be accessed through the Virtual Site: Analysis of the request logs in the Performance dashboard confirmed that all requests to the public website were evenly distributed across both CEs. Test 2: Internal Load Balancer (LB advertised on the Virtual Site) We deployed an internal Load Balancer and advertised it on the newly created Virtual Site, utilizing the pool that also references the same Virtual Site (k1-azure-ce2 and k1-azure-ce03). As shown below, the Load Balancer was configured to be advertised on the Virtual Site. Note: Here we couldn't use a "shared" custom VIP across the Virtual Site due to a current platform constraint. If a custom VIP is required, we should use "site" as opposed to "Virtual Site" and advertise the Load Balancer on all sites, like below picture: Request logs revealed that when traffic reached either CE node within the Virtual Site, the request was processed and forwarded locally to the pool member. In the example below: src_site: Indicates the CE (k1-azure-ce2) that processed the request. src_ip: Represents the client's source IP address (192.168.1.68). dst_site: Indicates the CE (k1-azure-ce2) from which the pool member is accessed. dst_ip: Represents the IP address of the pool member (192.168.1.6). Resilience Testing: To assess the Virtual Site's resilience, we intentionally blocked network access from k1-azure-ce2 CE to the pool member (192.168.1.6). The CE automatically rerouted traffic to the pool member via the other CE (k1-azure-ce03) in the Virtual Site. Note:By default, CEs can communicate with each other via the F5 Global Network. This can be customized to use direct connectivity through tunnels if the CEs are members of the same DC Cluster Group (IP-n-IP tunneling) or Secure Mesh Groups (IPSec tunneling). The following picture shows the traffic flow via F5 Global Network. The following picture shows the traffic flow via the IP-n-IP tunnel when a DC Clustering Group (DCG) is configured across the CE nodes. Failover Behaviour When a CE node is tied to a Virtual Site, all internal Load Balancers (VIPs) advertised on that Virtual Site will be deployed in the CE. Additionally, the Regional Edge (RE) begins to use this node as one of the potential next hops for connections to the origin pool. Should the CE become unavailable, or if it lacks the necessary network access to the origin server, the RE will almost seamlessly reroute connections through the other operational CEs in the Virtual Site. Uncontrolled Failover: During instances of uncontrolled failover, such as when a node is unexpectedly shut down from the hypervisor, we have observed a handful of new connections experiencing timeouts. However, these issues were resolved by implementing health checks within the origin pool, which prevented any subsequent connection drops. Note: Irrespective of the clustering model in use, it's always recommended to configure health checks for the origin pool. This practice enhances failover responsiveness and mitigates any additional latency incurred during traffic rerouting. Controlled Failover: The moment a CE node is disassociated from the Virtual Site — by the removal of its label— the CE node will not be used by RE to connect to origin pools anymore. At the same time, all Load Balancer listeners associated with that Virtual Site are withdrawn from the node. This effectively halts traffic processing for those applications, preventing the node from receiving related traffic. During controlled failover scenarios, we have observed seamless service continuity on externally advertised services (to REs). On-Demand Scaling: F5 Distributed Cloud provides a flexible solution that enables customers to scale the number of active CE nodes according to demand. This allows you to easily add more powerful CE nodes during peak periods (such as promotional events) and then remove them when demand subsides. With the Virtual Sites method, you can even mix and match node sizes within your cluster (Virtual Site), providing granular control over resources. It's advisable to monitor CE node performance and implement node related alerts. These alerts notify you when nodes are operating at high capacity, allowing for timely addition of extra nodes as needed. Moreover, you can monitor node’s health in the dashboard. CPU, Memory and Disk utilizations of nodes can be a good factor in determining if more nodes are needed or not. Furthermore, the use of Virtual Sites makes managing this process even easier, thanks to labels. Node Based Alerts: Node-based alerts are essential for maintaining efficient CE operations. Accessing the alerts in the Console: To view alerts, go to Multi-Cloud Network Connect > Notifications > Alerts. Here, you can see both "Active Alerts" and "All Alerts." Alerts related to node health fall under the "infrastructure" alert group. The following screenshot shows alerts indicating high loads on the nodes. Configuring Alert Policies: Alert policies determine the notification process for raised alerts. To set up an alert policy, navigate to Multi-Cloud Network Connect > Alerts Management > Alert Policies. An alert policy consists of two main elements: the alert receiver configuration and the policy rules. Configuring Alert Receiver: The configuration allows for integration with platforms like Slack and PagerDuty, among others, facilitating notifications through commonly used channels. Configuring Alert Rules: For alert selection, we recommend configuring notifications for alerts with severity of “Major” or “Critical” at a minimum. Alternatively, the “infrastructure” group which includes node-based alerts can be selected. Comparison Table Criteria Default Cluster Single Node HA Minimum number of nodes in HA 3 2 Upgrade operations Per cluster Per Node Network redundancy and client side routing for east-west traffic VRRP, BGP, DNS, L4/7 LB DNS, L4/7 LB, BGP* Tunnels to RE 2 tunnels per cluster 2 tunnels per node Tunnels to other CEs (SMG or DCG) 1 tunnel from each cluster 1 tunnel from each node External traffic processing Limited to 2 nodes All nodes will be active Internal traffic processing All nodes can be active All nodes can be active Scale management in Public Cloud Sites Straightforward, by configuring ingress interfaces in Azure/AWS/GCP sites Straightforward, by adding or removing the labels Scale management in Secure Mesh Sites Requires reconfiguring the cluster (secure mesh site) - may cause interruption Straightforward, by adding or removing the labels Custom VIP IP Available Not Available (Planned to be available in future releases), workaround available. Node sizes All nodes should be same size. Upgrading node size in a cluster is a disruptive operation. Any node sizes or clusters can join the Virtual Site * When using BGP, please note a current limitation that prevents configuring custom VIP address on the Virtual Site. Conclusion: F5 Distributed Cloud offers a flexible approach to High Availability (HA) across CE nodes, allowing customers to select the redundancy model that best fits their specific use cases and requirements. While we continue to advocate for default clustering approach due to their operational simplicity and shared VRRP VIP or, unified network configuration benefits, especially for routine tasks like upgrades, the Virtual Site and single node HA model presents some great use cases. It not only addresses the limitations and challenges of the default clustering model, but also introduces a solution that is both scalable and adaptable. While Virtual Sites offer their own benefits, we recognize they also present trade-offs. The overall benefits, particularly for scenarios demanding high ingress (RE to CE) throughput and controlled failover capabilities cater to specific customer demands. The F5 product and development team remains committed to addressing the limitations of both default clustering and Virtual Sites discussed throughout this article. Their focus is on continuous improvement and finding the solutions that best serve our customers' needs. References and Additional Links: Default Clustering model: https://docs.cloud.f5.com/docs/ves-concepts/site#cluster-of-nodes Configuration guide for Virtual Sites:https://docs.cloud.f5.com/docs/how-to/fleets-vsites/create-virtual-site Routing Options for CEs:https://community.f5.com/kb/technicalarticles/f5-distributed-cloud---customer-edge-site---deployment--routing-options/319435 Configuration guide for DC Clustering Group:https://docs.cloud.f5.com/docs/how-to/advanced-networking/configure-dc-cluster-group
Kayvan
Jun 25, 2024 Place Technical Articles
1KViews
4likes
0Comments
Securing the LLM User Experience with an AI Firewall
As artificial intelligence (AI) seeps into the core day-to-day operations of enterprises, a need exists to exert control over the intersection point of AI-infused applications and the actual large language models (LLMs) that answer the generated prompts. This control point should serve to impose security rules to automatically prevent issues such as personally identifiable information (PII) inadvertently exposed to LLMs. The solution must also counteract motivated, intentional misuse such as jailbreak attempts, where the LLM can be manipulated to provide often ridiculous answers with the ensuing screenshotting attempting to discredit the service. Beyond the security aspect and the overwhelming concern of regulated industries, other drivers include basic fiscal prudence 101, ensuring the token consumption of each offered LLM model is not out of hand. This entire discussion around observability and policy enforcement for LLM consumption has given rise to a class of solutions most frequently referred to as AI Firewalls or AI Gateways (AI GW). An AI FW might be leveraged by a browser plugin, or perhaps applying a software development kit (SDK) during the coding process for AI applications. Arguably, the most scalable and most easily deployed approach to inserting AI FW functionality into live traffic to LLMs is to use a reverse proxy. A modern approach includes the F5 Distributed Cloud service, coupled with an AI FW/GW service, cloud-based or self-hosted, that can inspect traffic intended for LLMs like those of OpenAI, Azure OpenAI, or privately operated LLMs like those downloaded from Hugging Face. A key value offered by this topology, a reverse proxy handing off LLM traffic to an AI FW, which in turn can allow traffic to reach target LLMs, stems from the fact that traffic is seen, and thus controllable, in both directions. Should an issue be present in a user’s submitted prompt, also known as an “inference”, it can be flagged: PII (Personally Identifiable Information) leakage is a frequent concern at this point. In addition, any LLM responses to prompts are also seen in the reverse path: consider a corrupted LLM providing toxicity in its generated replies. Not good. To achieve a highly performant reverse proxy approach to secured LLM access, a solution that can span a global set of users, F5 worked with Prompt Security to deploy an end-to-end AI security layer. This article will explore the efficacy and performance of the live solution. Impose LLM Guardrails with the AI Firewall and Distributed Cloud An AI firewall such as the Prompt Security offering can get in-line with AI LLM flows through multiple means. API calls from Curl or Postman can be modified to transmit to Prompt Security when trying to reach targets such as OpenAI or Azure OpenAI Service. Simple firewall rules can prevent employee direct access to these well-known API endpoints, thus making the Prompt Security route the sanctioned method of engaging with LLMs. A number of other methods could be considered but have concerns. Browser plug-ins have the advantage of working outside the encryption of the TLS layer, in a manner similar to how users can use a browser’s developer tools to clearly see targets and HTTP headers of HTTPS transactions encrypted on the wire. Prompt Security supports plugins. A downside, however, of browser plug-ins is the manageability issue, how to enforce and maintain across-the-board usage, simply consider the headache non-corporate assets used in the work environment. Another approach, interesting for non-browser, thick applications on desktops, think of an IDE like VSCode, might be an agent approach, whereby outbound traffic is handled by an on-board local proxy. Again, Prompt can fit in this model however the complexity of enforcement of the agent, like the browser approach, may not always be easy and aligned with complete A-to-Z security of all endpoints. One of the simplest approaches is to ingest LLM traffic through a network-centric approach. An F5 Distributed Cloud HTTPS load balancer, for instance, can ingest LLM-bound traffic, and thoroughly secure the traffic at the API layer, things like WAF policy and DDoS mitigations, as examples. HTTP-based control plane security is the focus here, as opposed to the encapsulated requests a user is sending to an LLM. The HTTPS load balancer can in turn hand off traffic intended for the likes of OpenAI to the AI gateway for prompt-aware inspections. F5 Distributed Cloud (XC) is a good architectural fit for inserting a third-party AI firewall service in-line with an organization’s inferencing requests. Simply project a FQDN for the consumption of AI services; in this article we used the domain name “llmsec.busdevF5.net” into the global DNS, advertising one single IP address mapping to the name. This DNS advertisement can be done with XC. The IP address, through BGP-4 support for anycast, will direct any traffic to this address to the closest of 27 international points of presence of the XC global fabric. Traffic from a user in Asia may be attracted to Singapore or Mumbai F5 sites, whereas a user in Western Europe might enter the F5 network in Paris or Frankfurt. As depicted, a distributed HTTPS load balancer can be configured – “distributed” reflects the fact traffic ingressing in any of the global sites can be intercepted by the load balancer. Normally, the server name indicator (SNI) value in the TLS Client Hello can be easily used to pick the correct load balancer to process this traffic. The first step in AI security is traditional reverse proxy core security features, all imposed by the XC load balancer. These features, to name just a few, might include geo-IP service policies to preclude traffic from regions, automatic malicious user detection, and API rate limiting; there are many capabilities bundled together. Clean traffic can then be selected for forwarding to an origin pool member, which is the standard operation of any load balancer. In this case, the Prompt Security service is the exclusive member of our origin pool. For this article, it is a cloud instantiated service - options exist to forward to Prompt implemented on a Kubernetes cluster or running on a Distributed Cloud AppStack Customer Edge (CE) node. Block Sensitive Data with Prompt Security In-Line AI inferences, upon reaching Prompt’s security service, are subjected to a wide breadth of security inspections. Some of the more important categories would include: Sensitive data leakage, although potentially contained in LLM responses, intuitively the larger proportion of risk is within the requesting prompt, with user perhaps inadvertently disclosing data which should not reach an LLM Source code fragments within submissions to LLMs, various programming languages may be scanned for and blocked, and the code may be enterprise intellectual property OWASP LLM top 10 high risk violations, such as LLM jailbreaking where the intent is to make the LLM behave and generate content that is not aligned with the service intentions; the goal may be embarrassing “screenshots”, such as having a chatbot for automobile vendor A actually recommend a vehicle from vendor B OWASP Prompt Injection detection, considered one of the most dangerous threats as the intention is for rogue users to exfiltrate valuable data from sources the LLM may have privileged access to, such as backend databases Token layer attacks, such as unauthorized and excessive use of tokens for LLM tasks, the so-called “Denial of Wallet” threat Content moderation, ensuring a safe interaction with LLMs devoid of toxicity, racial and gender discriminatory language and overall curated AI experience aligned with those productivity gains that LLMs promise To demonstrate sensitive data leakage protection, a Prompt Security policy was active which blocked LLM requests with, among many PII fields, a mailing address exposed. To reach OpenAI GPT3.5-Turbo, one of the most popular and cost-effective models in the OpenAI model lineup, prompts were sent to an F5 XC HTTPS load balancer at address llmsec.busdevf5.net. Traffic not violating the comprehensive F5 WAF security rules were proxied to the Prompt Security SaaS offering. The prompt below clearly involves a mailing address in the data portion. The ensuing prompt is intercepted by both the F5 and Prompt Security solutions. The first interception, the distributed HTTPS load balancer offered by F5 offers rich details on the transaction, and since no WAF rules or other security policies are violated, the transaction is forwarded to Prompt Security. The following demonstrates some of the interesting details surrounding the transaction, when completed (double-click to enlarge). As highlighted, the transaction was successful at the HTTP layer, producing a 200 Okay outcome. The traffic originated in the municipality of Ashton, in Canada, and was received into Distributed Cloud in F5’s Toronto (tr2-tor) RE site. The full details around the targeted URL path, such as the OpenAI /v1/chat/completions target and the user-agent involved, vscode-restclient, are both provided. Although the HTTP transaction was successful, the actual AI prompt was rejected, as hoped for, by Prompt Security. Drilling into the Activity Monitor in the Prompt UI, one can get a detailed verdict on the transaction (double-click). Following the yellow highlights above, the prompt was blocked, and the violation is “Sensitive Data”. The specific offending content, the New York City street address, is flagged as a precluded entity type of “mailing address”. Other fields that might be potentially blocking candidates with Prompt’s solution include various international passports or driver’s license formats, credit card numbers, emails, and IP addresses, to name but a few. A nice, time saving feature offered by the Prompt Security user interface is to simply choose an individual security framework of interest, such as GDPR or PCI, and the solution will automatically invoke related sensitive data types to detect. An important idea to grasp: The solution from Prompt is much more nuanced and advanced than simple REGEX; it invokes the power of AI itself to secure customer journeys into safe AI usage. Machine learning models, often transformer-based, have been fine-tuned and orchestrated to interpret the overall tone and tenor of prompts, gaining a real semantic understanding of what is being conveyed in the prompt to counteract simple obfuscation attempts. For instance, using printed numbers, such as one, two, three to circumvent Regex rules predicated on numerals being present - this will not succeed. This AI infused ability to interpret context and intent allows for preset industry guidelines for safe LLM enforcement. For instance, simply indicating the business sector is financial will allow the Prompt Security solution to pass judgement, and block if desired, financial reports, investment strategy documents and revenue audits, to name just a few. Similar awareness for sectors such as healthcare or insurance is simply a pull-down menu item away with the policy builder. Source Code Detection A common use case for LLM security solutions is identification and, potentially, blocking submissions of enterprise source code to LLM services. In this scenario, this small snippet of Python is delivered to the Prompt service: def trial(): return 2_500 <= sorted(choices(range(10_000), k=5))[2] < 7_500 sum(trial() for i in range(10_000)) / 10_000 A policy is in place for Python and JavaScript detection and was invoked as hoped for. curl --request POST \ --url https://llmsec.busdevf5.net/v1/chat/completions \ --header 'authorization: Bearer sk-oZU66yhyN7qhUjEHfmR5T3BlbkFJ5RFOI***********' \ --header 'content-type: application/json' \ --header 'user-agent: vscode-restclient' \ --data '{"model": "gpt-3.5-turbo","messages": [{"role": "user","content": "def trial():\n return 2_500 <= sorted(choices(range(10_000), k=5))[2] < 7_500\n\nsum(trial() for i in range(10_000)) / 10_000"}]}' Content Moderation for Interactions with LLMs One common manner of preventing LLM responses from veering into undesirable territory is for the service provider to implement a detailed system prompt, a set of guidelines that the LLM should be governed by when responding to user prompts. For instance, the system prompt might instruct the LLM to serve as polite, helpful and succinct assistant for customers purchasing shoes in an online e-commerce portal. A request for help involving the trafficking of narcotics should, intuitively, be denied. Defense in depth has traditionally meant no single point of failure. In the above scenario, screening both the user prompt and ensuring LLM response for a wide range of topics leads to a more ironclad security outcome. The following demonstrates some of the topics Prompt Security can intelligently seek out; in this simple example, the topic of “News & Politics” has been singled out to block as a demonstration. Testing can be performed with this easy Curl command, asking for a prediction on a possible election result in Canadian politics: curl --request POST \ --url https://llmsec.busdevf5.net/v1/chat/completions \ --header 'authorization: Bearer sk-oZU66yhyN7qhUjEHfmR5T3Blbk*************' \ --header 'content-type: application/json' \ --header 'user-agent: vscode-restclient' \ --data '{"model": "gpt-3.5-turbo","messages": [{"role": "user","content": "Who will win the upcoming Canadian federal election expected in 2025"}],"max_tokens": 250,"temperature": 0.7}' The response, available in the Prompt Security console, is also presented to the user. In this case, a Curl user leveraging the VSCode IDE. The response has been largely truncated for brevity, fields that are of interest is an HTTP “X-header” indicating the transaction utilized the F5 site in Toronto, and the number of tokens consumed in the request and response are also included. Advanced LLM Security Features Many of the AI security concerns are given prominence by the OWASP Top Ten for LLMs, an evolving and curated list of potential concerns around LLM usage from subject matter experts. Among these are prompt injection attacks and malicious instructions often perceived as benign by the LLM. Prompt Security uses a layered approach to thwart prompt injection. For instance, during the uptick in interest in ChatGPT, DAN (Do Anything Now) prompt injection was widespread and a very disruptive force, as discussed here. User prompts will be closely analyzed for the presence of the various DAN templates that have evolved over the past 18 months. More significantly, the use of AI itself allows the Prompt solution to recognize zero-day bespoke prompts attempting to conduct mischief. The interpretative powers of fine-tuned, purpose-built security inspection models are likely the only way to stay one step ahead of bad actors. Another chief concern is protection of the system prompt, the guidelines that reel in unwanted behavior of the offered LLM service, what instructed our LLM earlier in its role as a shoe sales assistant. The system prompt, if somehow manipulated, would be a significant breach in AI security, havoc could be created with an LLM directed astray. As such, Prompt Security offers a policy to compare the user provided prompt, the configured system prompt in the API call, and the response generated by the LLM. In the event that a similarity threshold with the system prompt is exceeded in the other fields, the transaction can be immediately blocked. An interesting advanced safeguard is the support for a “canary” word - a specific value that a well behaved LLM should never present in any response, ever. The detection of the canary word by the Prompt solution will raise an immediate alert. One particularly broad and powerful feature in the AI firewall is the ability to find secrets, meaning tokens or passwords, frequently for cloud-hosted services, that are revealed within user prompts. Prompt Security offers the ability to scour LLM traffic for in excess of 200 meaningful values. Just as a small representative sample of the industry’s breadth of secrets, these can all be detected and acted upon: Azure Storage Keys Detector Artifactory Detector Databricks API tokens GitLab credentials NYTimes Access Tokens Atlassian API Tokens Besides simple blocking, a useful redaction option can be chosen. Rather than risk compromise of credentials and obfuscated value will instead be seen at the LLM. F5 Positive Security Models for AI Endpoints The AI traffic delivered and received from Prompt Security’s AI firewall is both discovered and subjected to API layer policies by the F5 load balancer. Consider the token awareness features of the AI firewall, excessive token consumption can trigger an alert and even transaction blocking. This behavior, a boon when LLMs like the OpenAI premium GPT-4 models may have substantial costs, allows organizations to automatically shut down a malicious actor who illegitimately got hold of an OPENAI_API key value and bombarded the LLM with prompts. This is often referred to as a “Denial of Wallet” situation. F5 Distributed Cloud, with its focus upon the API layer, has congruent safeguards. Each unique user of an API service is tracked to monitor transactional consumption. By setting safeguards for API rate limiting, an excessive load placed upon the API endpoint will result in HTTP 429 “Too Many Request” in response to abusive behavior. A key feature of F5 API Security is the fact that it is actionable in both directions, and also an in-line offering, unlike some API solutions which reside out of band and consume proxy logs for reporting and threat detection. With the automatic discovery of API endpoints, as seen in the following screenshot, the F5 administrator can see the full URL path which in this case exercises the familiar OpenAI /v1/chat/completions endpoint. As highlighted by the arrow, the schema of traffic to API endpoints is fully downloadable as an OpenAPI Specification (OAS), formerly known as a Swagger file. This layer of security means fields in API headers and bodies can be validated for syntax, such that a field whose schema expects a floating-point number can see any different encoding, such as a string, blocked in real-time in either direction. A possible and valuable use case: allow an initial unfettered access to a service such as OpenAI, by means of Prompt Security’s AI firewall service, for a matter of perhaps 48 hours. After a baseline of API endpoints has been observed, the API definition can be loaded from any saved Swagger files at the end of this “observation” period. The loaded version can be fully pruned of undesirable or disallowed endpoints, all future traffic must conform or be dropped. This is an example of a “positive security model”, considered a gold standard by many risk-adverse organizations. Simply put, a positive security model allows what has been agreed upon through and rejects everything else. This ability to learn and review your own traffic, and then only present Prompt Security with LLM endpoints that an organization wants exposed is an interesting example of complementing an AI security solution with rich API layer features. Summary The world of AI and LLMs is rapidly seeing investment, in time and money, from virtually all economic sectors; the promise of rapid dividends in the knowledge economy is hard to resist. As with any rapid deployment of new technology, safe consumption is not guaranteed, and it is not built in. Although LLMs often suggest guardrails are baked into offerings, a 30-second search of the Internet will expose firsthand experiences where unexpected outcomes when invoking AI are real. Brand reputation is at stake and false information can be hallucinated or coerced out of LLMs by determined parties. By combining the ability to ingest globally at high-speed dispersed users and apply a first level of security protections, F5 Distributed Cloud can be leveraged as an onboarding for LLM workloads. As depicted in this article, Prompt Security can in turn handle traffic egressing F5’s distributed HTTPS load balancers and provide state-of-the-art AI safeguards, including sensitive data detection, content moderation and other OWASP-aligned mechanisms like jailbreak and prompt injection mitigation. Other deployment models exist, including deploying Prompt Security’s solution on-premises, self-hosted in cloud tenants, and running the solution on Distributed Cloud CE nodes themselves is supported.
Steve_Gorman
Jun 24, 2024 Place Technical Articles
623Views
2likes
0Comments
The Power of &: F5 Hybrid DNS solution
While some organizations prioritize the advantages of a SaaS solution like scalability, others value the benefits of an on-premises solution, such as data control and migration flexibility. This is why having the option to deploy a hybrid model can be beneficial, not just for redundancy, but also for allowing organizations to blend the best of both worlds. Understanding theArchitecture’scomponents F5 BIG-IP DNS - (formerly BIG-IP GTM) is a well-known on-premise solution for delivering high-performance DNS services such as DNSExpress and DNS Caching. It is also recognized for offering intelligent DNS responses that are based on various factors such as LDNS’ Geolocation (GSLB) and health status of applications. F5 Distributed Cloud DNS (F5 XC DNS) – It is F5’s SaaS-based DNS solution which is built on a global data plane, ensuring automatic scalability to meet high-volume demand. Additionally, it also provides GSLB and security such as DNS DoS protection. On the diagram above, BIG-IP DNS will be the hidden primary DNS, acting as the source of truth for DNS records. This setup ensures centralized control and adds an extra layer of security by reducing exposure to potential attacks. F5XC DNS will function as the secondary DNS server, receiving DNS records from BIG-IP via Zone Transfer. It will be responsible for handling public DNS queries and providing domain name resolution services to clients. In the first part of this article, we will show you how to set up and configure BIG-IP DNS as the hidden primary and F5XC DNS as the authoritative secondary DNS server. For some, this setup is sufficient for their requirements, but for others, there may be additional requirements to consider in this hybrid design. In the later part of this article, we will demonstrate how we can address these challenges by leveraging F5's platform features and capabilities! Steps on Implementing F5 Hybrid DNS Solution Step 1: Configure BIG-IP DNS First, we need to configure BIG-IP DNS to be able to perform a zone transfer to F5XC DNS. For more details on the configuration, you can check this link: https://community.f5.com/kb/technicalarticles/configuring-big-ip-for-zone-transfer-and-dnssec/330359 Step 2: Configure F5XC DNS Now after you've configured BIG-IP DNS, we need to configure F5 XC DNS to be a secondary DNS server. For more details, check the steps below: Log into XC Console, selectDNS Managementoption, clickAdd Zone. In Domain Name field, enter the domain/subdomain. In our example, it will bef5sg.com Zone Type:Secondary DNS Configuration Under the Secondary DNS Configuration field, clickConfigure On the DNS primary server IP field, enter the public IP address of the Primary DNS. In our example it will be the Public IP of BIG-IP DNS. On TSIG key, enter the name we used to generate TSIG earlier in BIG-IP. In our example, usedexample. On the TSIG Key algorithm field and select an algorithm from the drop-down. Selecthmac-sha256. Click Configure in the TSIG key value in base 64 format section, On the Secret Type field, selectClear Secretand paste the secret in the Secret field. Use the same secret we generated earlier in BIG-IP DNS. ClickApply. You should see the DNS records transferred from BIG-IP DNS to F5XC DNS Step 3: Configure Domain Registrar In this example, the domain registrar I'm using is Namecheap. I'll configure it so that the authoritative name server for the domain f5sg.com is set to F5XC (ns1.f5clouddns.com and ns2.f5clouddns.com). The steps will vary depending on which domain registrar you are using. Refer to the documentation of your registrar. See the screenshots below for how I configured it in Namecheap. F5 XC DNS should now be able to answer DNS queries since it is set to be the authoritative DNS. Now, let's do some testing! On my local machine, I will perform a dig on the f5sg.com domain. See below: You can see that on the dig result, the NS for f5sg.com is set to ns1.f5clouddns.com & ns2.f5clouddns.com! I can also resolve sales.f5sg.com! We have successfully implemented BIG-IP as Hidden Primary and F5XC as Authoritative Secondary DNS! Challenges and Considerations Now let's discuss the additional requirements or challenges that we might encounter with this hybrid setup solution: Security: We need to comply with security compliance. Nowadays, there are laws requiring the implementation of DNSSEC (DNS Security Extensions). We need to consider this in the design and implement it without adding complexity. Resiliency: Although F5XC DNS Infrastructure is built to be resilient, we still want a backup plan to failover to the BIG-IP Primary DNS in case of unforeseen events. This process will be manual, as we need to change the NS records at the registrar to promote the hidden BIG-IP Primary DNS as the authoritative NS for the domain once F5XC is unavailable. Synchronization: BIG-IP will not be able to synchronize the GSLB functionality with F5XC because Wide-IP records are non-standard and cannot be transferred as part of zone transfers. Solution to Challenges Now comes the fun part: tackling the challenges we’ve laid out! Fortunately, F5 Distributed Cloud is an API-first platform that enables us to automate configuration. At the same time, we have the power of the BIG-IP platform, where you can run custom scripts that will enable us to integrate it with F5XC through API. Solution to Challenge #1: This is easy. DNSSEC records like RRSIG, DNSKEY, DS, NSEC, and NSEC3 are standardized and can be synchronized as part of a zone transfer. Since BIG-IP DNS is our primary DNS and supports DNSSEC, we can enable it. The records will synchronize to F5XC DNS and still respond with signed records, maintaining the integrity and security of our DNS infrastructure. How do you enable it? Check the last part of the technical article below: https://community.f5.com/kb/technicalarticles/configuring-big-ip-for-zone-transfer-and-dnssec/330359 Solution to Challenge #2: We need to automate failover! But when automating tasks, you need two things: a trigger and an action. In our scenario, our trigger should be the availability of F5XC DNS to resolve DNS queries, and the action should be to change our nameserver to BIG-IP on the domain registrar. If you can create and run a script in BIG-IP, it means you can continuously monitor the health of F5XC DNS, allowing us to determine the trigger. But what about the action to change the domain name server records in the registrar? It's easy—check if it can be configured via API, then the problem is solved! Let's explore using Namecheap as our registrar for this example. We will use the BIG-IP EAV (external) monitor to run the script. If you're unfamiliar with the BIG-IP external monitor and its capabilities, check this out →https://my.f5.com/manage/s/article/K71282813 A dummy pool configured with an external monitor will run at intervals. The attached script is designed to monitor F5XC and check if it can resolve DNS queries. If it cannot, the script will trigger an API call to Namecheap (our domain registrar) to change the nameservers back to BIG-IP DNS. Simultaneously, the script will update the domain's NS records from F5XC to BIG-IP. Step 1: Create an external monitor using the custom script. Refer to article K71282813 how to create the external monitor. See the codeshare link for the sample custom script I used: Namecheap and BIG-IP Integration via API | DevCentral Step 2: Create a dummy pool and attach the custom external monitor Let's do some tests! See the results in the later part of this article. Solution to Challenge #3: We can't use Zone Transfer to synchronize GSLB configurations? No problem! Instead, we'll harness the power of APIs. We can run a custom script in BIG-IP to convert Wide-IP configurations into F5XC DNSLB records via API. Let's see below how we can do this. On BIG-IP DNS, configure the zone records for the domainf5sg.comto delegate the subdomains needed for GSLB. For example, we need to perform GSLB forwww.f5sg.com,we will configure the zone like below: www.f5sg.comCNAMEwww.gslb.f5sg.com gslb.f5sg.com NS ns1.f5clouddns.com On BIG-IP we will create Wide-IP configuration forwww.gslb.f5sg.comwhich should hold the A records. These Wide-IP configurations can be converted by a script to F5XC DNSLB configurations. Check the sample script on this codeshare link: BIG-IP Wide-IP to F5XC DNSLB converter | DevCentral Testing and Result Challenge #2:Failover Testing To simulate the scenario in which F5XC is unable to respond to DNS queries, we designed the script to execute a dig command to F5XC for a TXT record. If F5XC responds with "RESPONSE-OK," no further action is needed. However, if it fails to respond correctly or does not respond at all, the script will trigger a failover action. Scenario 1: When F5XC responds to DNS queries(TXT record value is RESPONSE-OK) Namecheap dashboard shows F5XC nameservers BIG-IP DNS zone records shows F5XC nameservers F5XC zone records shows F5XC nameservers Result when performing dig to resolve sales.f5sg.com -> it shows that F5XC nameservers are Authoritative Scenario: When F5XC doesn't respond to DNS queries (TXT record value is RESPONSE-NOT-OK) We changed the TXT record value to 'RESPONSE-NOT-OK,' which should mark the monitor as down. The dummy pool went down, which means the script inside the monitor detected that the dig result was not what it expected. You can see from the zone records below that the NS records have now changed to GTM (gtm1.f5sg.com and gtm2.f5sg.com) When we check our domain registrar, Namecheap, we can see that the nameservers are now automatically set to BIG-IP GTMs. When I issue a dig command from my workstation, I can see that the nameserver responding to my query isgtm1.f5sg.com Online DNS tools (like MXToolbox) also report that gtm1.f5sg.com is the authoritative NS that responds to the DNS queries for sales.f5sg.com, which resolves to 2.2.2.2 We have now solved one of the challenges by implementing a backup failover plan using custom monitors and automations, made possible by the power of BIG-IP and APIs! Challenge #3:Synchronization Testing Using this script, we can convert and synchronize the BIG-IP Wide-IP configuration to its F5XC equivalent configuration Note: The sample script is limited to handling a Wide-IP with a single GTM pool. Inside the pool is where you will define the IP addresses that you want to load balance. The pool load balancing method is also limited to Round Robin, Ratio, Static Persist, and Global Availability. The script is designed to run at intervals. There are several ways to execute it: you can use external monitors (as we did earlier) or utilize a cronjob, etc. For testing and simplicity, I will use a cronjob set to run every 10 minutes. Let's begin creating our GSLB configuration. If you've configured BIG-IP GTM/DNS before, one of the first objects you need to create is a GTM server. I've configured two Generic Servers representing the application in two different Data Centers. Next is we create aGTM Poolwhich we will associate theVirtual Serverinside the GTM server we created earlier. (i.e. I'm assigning 1.1.1.1 and 2.2.2.2 as the members of the pool) Lastly, we will create theWide-IP recordand attach the GTM Pool we created earlier After this, the script should get triggered and convert this BIG-IP DNS Wide-IP configuration into F5XC DNS configuration. We should see that a new Primary Zone will be created in F5XC (gslb.f5sg.com) When you view the resource records, we should see a DNSLB record which has the record name equivalent to the subdomain of the wide-IP record. (BIG-IP DNS Wide-IP record is www.glsb.f5sg.com, In F5XC DNS zonegslb.f5sg.com, the record name iswwwand pointing toDNSLB object) The load balancing rules should have the DNSLB pool(pool-www) which is the equivalent of theGTM Pool(pool_www) configured in BIG-IP DNS TheDNSLB pool memberswill include the same IP addresses we defined asGTM Pool members in BIG-IP DNS. There are four load balancing methods available in F5XC, and there is an equivalent BIG-IP DNS load balancing method. The script was created to match this methods but if you configure the BIG-IP DNS pool load balancing method to something other than these four, it will default to Round Robin. BIG-IP DNS F5XC DNS Round Robin Round-Robin Ratio Ratio-Member Static Persist Static-Persist Global Availability Priority Based on the results above, we have successfully converted and synchronize BIG-IP DNS Wide-IP configuration into F5XC DNSLB records! Conclusion We have resolved DNS challenges using the power and integration of F5 solutions! By utilizing both BIG-IP and F5XC platforms, which can sign and serve DNSSEC records, we can seamlessly implement DNSSEC in a hybrid setup without complexity. Furthermore, our scalable F5XC Cloud DNS will shield you from myriad DNS DoS attacks, which are continually evolving, especially with the rise of AI. In terms of DNS resiliency, with the power of our API-first platforms and automation, we can create a DNS hybrid solution capable of automatically failing over from Cloud DNS to on-prem DNS. Lastly, we can synchronize the configurations of both platforms using standards like Zone Transfer and APIs. This capability allows us to convert and synchronize GSLB configurations between our on-prem DNS and Cloud DNS, making administration easier, and establishing a single source of truth.
michelangelodorado
Jun 19, 2024 Place Technical Articles
478Views
2likes
0Comments