Using F5 Distributed Cloud AppStack & CE Site Survivability
Introduction In private and in secure network environments, one of the greatest challenges is keeping apps running when connectivity goes down. Whether it’s due to network maintenance or an unintended outage, when apps can’t connect upstream to retrieve data, users are left with little or nothing to do. F5 Distributed Cloud AppStack and Customer Edge (CE) survivability, a new feature in Distributed Cloud, provides a unique advantage in upstream outage situations. When all connectivity is lost between a CE and its Regional Edge (RE), including to the Global Controller, CE Survivability kicks in by allowing users to continue to access their services local to the site. Before this capability, routing to alternate sites, even when connected through a site mesh group, would fail due to identity management certificate authorities (CA’s) held within the Global Controller not being reachable, failing verification. With AppStack and CE survivability, local site services are covered with alternative upstreams, both remote and local, in the event of a total loss of external connectivity. When connectivity is lost, the CE creates a new control plane formerly provided by the Global Controller, and it establishes a local CA under a new root CA trusted by its dependent services. To reach the remaining remote nodes that continue to be reachable within the CE’s site mesh group, Border Gateway Protocol (BGP) is utilized to optimally connect each site. Loss of connectivity is now protected between each of the following: Nodes within a multi-node Customer Edge site Customer Edge site and Regional Edge (RE) sites Customer Edge site and the Global Controller (GC) Customer Edge sites within a Full Mesh Site Mesh Group Customer Edge sites within a Hub/Spoke Site Mesh Group (future release) Customer Edge sites within a DC Cluster Group (future release) CE offline survivability solves this problem by enhancing each local site's control plane and routing: Local ControlPlane–Alocal Control plane is set up for the management of certificates and secrets using a local Certificate Authority (CA). This CA will be minted under a new Root CA and will be trusted by services under that tenant. Routing – The routing will be exchanged via BGP among nodes of a site and among nodes across sites in a Site Mesh Group. Note: A maximum ofsevendays is supported for a site to survive without connectivity to the Global Controller. How To Enable CE Survivability on AppStack & Customer Edge Sites Pre-requisites: Two or more deployed CE Sites Log in to the Distributed Cloud Console and go to Shared Configuration. Navigate to Virtual Sites, then Add Virtual Site. Enter the following data: Name: full-mesh-ce-sites Site Type: CE Selector Expression: ves.io/siteType = ves-io-ce Now, enable CE Survivability on each of your deployed CE sites. Navigate to Cloud & Edge Sites > Site Management > [your site type] > [Your Site] > Manage Configuration. Scroll to Advanced Configuration, click Show Advanced Fields, then Enable Offline Survivability Mode. Repeat this action for each CE identified by the virtual site Selector Expression entered above. Now, navigate to Cloud & Edge Sites > Networking > Add Site Mesh Group, and enter the following: Name: full-mesh-ce-group Virtual Site: shared/full-mesh-ce-sites Mesh Choice: Full Mesh Full Mesh Choice: Control and Data Plane Mesh To confirm and validate the enablement of CE Survivability, navigate to each CE Site's Dashboard via the Site Mesh Group. This can be found by going to Cloud and Edge Sites > Site Connectivity > Site Mesh Group, select the group that was just created, then click on each site in the group. Confirm that "Offline Survivability" is "Enabled" on the Detail panel, and that the "Local Control Plane Status" is enabled under the Health panel. An example of a site mesh group with this configured is shown as follows. Example: Bookinfo distributed app deployment Bookinfo is a modern K8s distributed app maintained by Istio, and its deployment model demonstrates the power of Distributed Cloud’s Multi-Cloud Networking CE Survivability. For more information about Bookinfo, including how to deploy it, go to https://istio.io/latest/docs/examples/bookinfo/. For this exercise, each of the CEs in the site mesh group must have the Offline Survivability status Enabled. This status changes from Configured status to Enabled after the CE restarts when the setting is configured. To confirm the current status, go to the CE Site’s Dashboard. Within the “Software Version” frame, observe the Offline Survivability and its associated status. To confirm the tunnel between CE’s will L3 route the traffic with the Site Mesh Group Full Mesh configuration, open the Site Dashboard, and navigate to the Tools section, then select Show routes, and use Virtual Network Type VIRTUAL_NETWORK_SITE_LOCAL_INSIDE. The following route to the Origin server existing on the AWS TGW Site has a route entry on the Azure VNET Site, meaning that, if L3 routing is required directly from Azure, it can transit directly between both CE sites, bypassing the Global Network. This feature bolsters CE site connectivity, empowering it to locally make L3 routed and L7 HTTP load balancing decisions, should it loose access to the Global Network and the Global Controller while still with some connectivity to the CEs in the Site Mesh Group. In the event of connectivity loss, the CE transitions to offline mode. To demonstrate this easily, in HTTP load balancing, origin pools configured with a priority 0 only work when no other origin pool is available. This makes it possible to use just the local origin pool only when absolutely needed. Additionally, any SSL-backed services with certs that can no longer be validated to the Global Controller's root CA due to loss of connectivity are automatically re-issued to use certs from the Site’s own internal CA. This preserves the SSL chain of authority while the site is unable to reach the Global Controller and potentially also the Internet, allowing encrypted services that would otherwise be unable to run while being unable to reach the root CA. When a CE Site is online, this distributed app example is configured for users to connect to the Product Info (details) page locally with an on-prem point-of access controller. The Details service then connects to the reviews service to show product feedback, and the reviews service then connects to the ratings service to pull in more volatile 1–5-star ratings. In the following Distributed Cloud HTTP Load Balancer configuration sample, each CE site is configured with a Static IP-based Origin Pool. Under normal circumstances, the Global Controller processes HTTP requests to the ratings service received by the On-Prem/Azure CE Site, and it reverse proxies the requests to the destination service running in AWS. However, with CE Survivability, when the HTTP LB service goes offline and is severed from the Global Controller and/or the Global Network, the CE configures itself to make load balancing decisions locally. This results in requests being sent to the locally cached ratings DB until the site is back online, allowing users to continue to have complete access to the app, albeit with some potentially latent or stale ratings. The following YAML can be copied to the Distributed Cloud Console to either the Load Balancers or Distributed App services at Manage > Load Balancers > HTTP Load Balancers > Add HTTP Load Balancer. In the configuration section, select JSON view, then change to the YAML config format. The following HTTP LB config works with the app’s Bookinfo ratings service separately deployed to multiple sites. (NOTE:origin pools must be pre-configuredbeforethis section). metadata: name: booking-ratings-multisite namespace: default labels: ves.io/app_type: bookinfo annotations: {} disable: false spec: domains: - ratings.cluster.local - ratings http: dns_volterra_managed: false port: 9080 advertise_custom: advertise_where: - site: network: SITE_NETWORK_INSIDE_AND_OUTSIDE site: tenant: default namespace: system name: azure-vnet-wus2 kind: site use_default_port: {} default_route_pools: - pool: tenant: default namespace: default name: bookinfo-ratings-aws kind: origin_pool weight: 1 priority: 1 endpoint_subsets: {} - pool: tenant: default namespace: default name: bookinfo-ratings-local kind: origin_pool weight: 1 priority: 0 endpoint_subsets: {} routes: [] While the On-Prem CE site is online, access to the ratings service can be observed as follows: country: PRIVATE kubernetes: {} app_type: bookinfo timeseries_enabled: false browser_type: Firefox device_type: Mac req_id: 03145f22-ddc6-413a-b6e3-a511e1840a94 path: / hostname: master-0 original_authority: ratings:9080 rtt_upstream_seconds: "0.015000" src_instance: UNKNOWN req_headers: "null" tenant: tme-lab-works-oeaclgke longitude: PRIVATE app: obelix rtt_downstream_seconds: "0.000000" policy_hits: policy_hits: - result: allow policy_set: ves-io-active-service-policies-network-security-dpotter malicious_user_mitigate_action: MUM_NONE policy_namespace: shared policy_rule: allow policy: ves-io-allow-all rate_limiter_action: none method: GET as_number: "0" rsp_body: UNKNOWN time_to_last_downstream_tx_byte: 0.019969293 dst_instance: STATIC vh_type: HTTP-LOAD-BALANCER x_forwarded_for: 172.17.0.5 duration_with_no_data_tx_delay: "0.015985" rsp_size: "164" api_endpoint: UNKNOWN authority: ratings:9080 domain: ratings region: PRIVATE time_to_first_downstream_tx_byte: 0.019928893 has_sec_event: true rsp_code_class: 2xx rsp_code_details: via_upstream time_to_last_upstream_rx_byte: 0.019945393 dst: S:10.0.163.106 scheme: http city: PRIVATE dst_site: aws-tgw-site latitude: PRIVATE messageid: dea91c9a-beed-4561-67af-ab4112426b1f tls_version: VERSION_UNSPECIFIED connection_state: CLOSED dst_ip: NOT-APPLICABLE network: PRIVATE src_site: azure-vnet-wus2 terminated_time: 2023-01-25T06:45:04.842657797Z as_org: PRIVATE duration_with_data_tx_delay: "0.016025" src_ip: 172.17.0.5 connected_time: 2023-01-25T06:45:04.822615905Z stream: svcfw tls_cipher_suite: VERSION_UNSPECIFIED/TLS_NULL_WITH_NULL_NULL original_path: /ratings/0 req_size: "391" user_agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/109.0 severity: info cluster_name: azure-vnet-wus2-default tls_fingerprint: UNKNOWN src: N:site-local-inside time_to_first_upstream_rx_byte: 0.019900193 rsp_code: "200" time_to_first_upstream_tx_byte: 0.003944379 src_port: "42106" site: azure-vnet-wus2 "@timestamp": 2023-01-25T06:45:05.134969Z req_body: UNKNOWN sample_rate: 0 time_to_last_upstream_tx_byte: 0.003949779 dst_port: "80" namespace: default req_path: /ratings/0 time: 2023-01-25T06:45:05.134Z asn: PRIVATE user: IP-172.17.0.5 vh_name: ves-io-http-loadbalancer-booking-ratings-multisite node_id: envoy_0 proxy_type: http total_duration_seconds: 0.02 While the CE is offline and isolated from the Global Network, Global Controller, and the Internet connection requests aren’t visible in the Distributed Cloud Console, but using a local client, we can see that requests continue to be delivered using the local ratings service. Below, we can infer that the local CE has assumed the primary role of making load balancing decisions. Depending on the severity of the outage, the load balancing decision can either be to use the Internet site or the direct CE-CE site mesh group tunnel to reach the priority origin pool, or it can be handled entirely locally on site. Note:To isolate the On-Prem CE artificiallyin this exercise, security policies can be configured to 1) block connections inbound from the remote CE site(s), and 2) block connections outbound to the F5 Global Network RE Sites and the Global Controller. Because the IPSEC and SSL tunnel connections can be long-lived, you may ssh into the CE site’s CLI and soft-restart the ver, vpm , ike, and openvpn services running the CE Site(s). When the CE can no longer reach the Global Controller through the RE’s or the Site Mesh Group CE’s, the CE begins making decisions, including load balancing from its own local control plane. You can see this happening in the following example. Opening a shell on a container running on a K8s cluster local to the site. Curl requests to the ratings app continue to resolve via service discovery with the locally advertised VIP from Distributed Cloud. Because the Origin Pool in AWS is no longer reachable, traffic is sent instead to the backup local ratings service. % k exec -it jump -- /bin/sh / # curl -v ratings:9080/health * Trying 10.40.1.5:9080... * Connected to ratings (10.40.1.5) port 9080 (#0) > GET /health HTTP/1.1 > Host: ratings:9080 > User-Agent: curl/7.79.1 > Accept: */* > * Mark bundle as not supporting multiuse < HTTP/1.1 200 OK < content-type: application/json < date: Wed, 25 Jan 2023 07:11:36 GMT < x-envoy-upstream-service-time: 7 < server: volt-adc < transfer-encoding: chunked < * Connection #0 to host ratings left intact {"status":"Ratings is healthy"}/ # The following video covers each of the settings in this article, including how to test offline survivability when disconnecting the outside network at the edge site isn't desirable or possible. Conclusion Outages can be unpredictable; CE Survivability enables a “best case scenario” suite of services, whether keeping your business apps available and your users going until connectivity is restored. With Distributed Cloud AppStack and CE Site Survivability enabled, both at the data plane and control plane, services that are distinguished when online yet still useable while offline is the new level of service you can expect from your users. CE Survivability on Distributed Cloud delivers this and more. For more information, please visit the following resources: https://www.f5.com/cloud/products/multi-cloud-transit https://docs.cloud.f5.com/docs/how-to/site-management/manage-site-offline-survivability Video demo: https://youtu.be/InyJKwksbos3KViews6likes0CommentsDemo Guide: Edge Compute with F5 Distributed Cloud Services (SaaS Console, Automation)
This demo guide provides walk-through steps or Terraform scripts to deploy and connect with multi-cloud-networking (MCN) a sample Compute Edge app infrastructure across multiple cloud providers (Azure and AWS) or a single cloud of your choosing.2.6KViews8likes0CommentsDemo Guide: HA for Distributed Apps with F5 Distributed Cloud Services (SaaS Console, Automation)
Modern distributed apps require agility and high-availability (HA) for their deployment topologies. This demo guide walks through optimizing deploying an HA Kubernetes workload to power high-performing and highly-available backends or centralized app services.1.9KViews4likes0CommentsDeploy High-Availability and Latency-sensitive workloads with F5 Distributed Cloud
Introduction F5 Distributed Cloud Services delivers virtual Kubernetes (vK8s) capabilities to simplify deployment and management of distributed workloads across multiple clouds and regions. At the core of this solution is Distributed Cloud's multi-cloud networking service, enabling connectivity between locations. In Distributed Cloud, every location is identified as a site, and K8s clusters running in multiple sites can be managed by the platform. This greatly simplifies the deployment and networking of infrastructure and workloads. Centralized databases require significant compute and memory resources and need to be configured with High Availability (HA). Meanwhile, latency-sensitive workloads require placement as close to an end-users’ region as possible. Distributed Cloud handles each scenario with a consistent approach to the app and infrastructure configuration, using multi-cloud networking with advanced mesh, and with Layer 4 and/or Layer 7 load balancing. It also protects the full application ecosystem with consistent and robust security policies. While Regional Edge (RE) sites deliver many benefits of time-to-value and agility, there are many instances where customers may find it useful to deploy compute jobs in the location or region of their choice. This may be a cloud region, or a physical location in closer proximity to the other app services, or due to regulatory or other requirements such as lower latency. In addition, the RE deployments have more constraints in terms of pre-configured options for memory and compute power; in cases where it is necessary to deploy a workload demanding more resources or specific requirements such as high memory or compute, the Customer Edge (CE) deployment may be a better fit. One of the most common scenarios for such a demanding workload is database deployment in a High-Availability (HA) configuration. An example would be a PostgreSQL database deployed across several compute nodes running within a Kubernetes environment, which is a perfect fit for a CE deployment. We’ll break down this specific example in the content that follows with links to other resources useful in such undertaking. Deployment architecture F5 Distributed Cloud Services provide a mechanism to easily deploy Kubernetes apps by using virtual Kubernetes (vK8s), which helps to distribute app services across a global network while making them available closer to users. You can easily combine RE and CE sites in one vK8s deployment to ease application management and securely communicate between regional deployments and backend applications. Configuration of our CE starts with the deployment of F5 CE Site, which provides ways to easily connect and manage the multi-cloud infrastructure. The Distributed Cloud CE Site works with other CE and F5-provided RE Sites, which results in a robust distributed app infrastructure with full mesh connectivity, and ease of management as if it were a single K8S cluster. From architecture standpoint a centralized backend or database deployed in a CE Site provides an ideal platform that other sites can connect with. We can provision several nodes in a CE for a high-availability configuration for a PostgreSQL database cluster. The services within this cluster can then be exposed to other app services, such as deployments in RE sites, by way of a TCP load balancer. Thus, the app services that consume database objects could reside close to the end-user if they are deployed in F5 Distributed Cloud Regional Edge, resulting in the following optimized architecture: Prepare environment for HA Load F5 Distributed Cloud Services allows creating customer edge sites with worker nodes on a wide variety of cloud providers: AWS, Azure, GCP. The pre-requisite is a Distributed Cloud CE Site or App Stack, and once deployed, you can expose the services created on these edge sites via a Site mesh and any additional Load Balancers. A single App Stack edge Site may support one or more virtual sites, which is similar to a logical grouping of site resources. A single virtual site can be deployed across multiple CEs, thus creating a multi-cloud infrastructure. It’s also possible to place several virtual sites into one CE, each with their own policy settings for more granular security and app service management. It is also feasible for several virtual sites to share both the same and different CE sites as underlying resources. During the creation of sites & virtual sites labels such as site name, site type and others can be used to organize site resources. The shows how VK8S clusters can be deployed across multiple CEs with virtual sites to control distributed cloud infrastructure. Note that this architecture shows four virtual clusters assigned to CE sites in different ways. In our example, we can start by creating a AWS VPC site with worker nodes do as described here. When the site is created, the label must be assigned. Use the ves.io/siteName label to name the site. Follow these instructions to configure the site. As soon as edge site is created and the label is assigned, create a virtual site, as described here. The virtual site should be of the type CE and the label must be ves.io/siteName with operation == and the name of the AWS VPC site. Note the virtual site name, as it will be required later. At this point, our edge site for the HA Database deployment is ready. Now create the VK8S cluster. Select both virtual sites (one on CE and one on RE) by using the corresponding label: The all-res one will be used for the deployment of workloads on all RE’s. Environment for both RE and CE deployments is ready. Deploy HA Postgres to CE We will use Helm charts to deploy a PostgreSQL cluster configuration with help of Bitnami, which provides ready-made Helm charts for HA databases: MongoDB, MariaDB, PostgreSQL, etc. in the following repository: https://charts.bitnami.com/bitnami. In general, these Helm charts work very similarly, so the example used here can be applied to most other databases or services. An important key in values for the database is clusterDomain. The value is constructed this way: {sitename}.{tenant_id}.tenant.local. Note that site_id here is Edge site id, not the virtual one. You can get this information from site settings. Open the JSON settings of the site in AWS VPC Site list. Tenant id and site name will be shown as tenant and name fields of the object: VK8S supports only non-root containers, so these values must be specified: containerSecurityContext: runAsNonRoot: true To deploy the load to a predefined virtual site, specify: commonAnnotations: ves.io/virtual-sites: "{namespace}/{virtual site name}" When deployed, HA Database exposes its connection via a set of services. For PostgresDB the service name might look like: ha-postgres-postgresql-ha-postgresql, on port 5432. To review the services list of the deployments, select Services tab of the VK8S cluster. Even though RE deployment and CE deployment are in one VK8S namespace, they are not accessible directly. Services need to be first exposed as Load Balancers. Expose CE service to RE deployment To access HA Database deployed to CE site, we will need to expose the database service via a TCP Load Balancer. TCP Load Balancer is created by using the Origin Pool. To create the Origin Pool for vk8s deployed service follow these instructions. As soon as Origin Pool is ready, TCP Load Balancer can be created, as described here. This load balancer needs to be accessible only from the RE network, or in other words to be advertised there. Therefore, when creating TCP Load Balancer specify “Advertise Custom” for “Where to Advertise the VIP” field. Click “Configure” and select “vK8s Service Network on RE” for “Select Where to Advertise” field, as well as “Virtual Site Reference” and “ves-io-shared/ves-io-all-res“ for subsequent settings. Also, make sure to specify domain name for the “Domain” field. This makes it possible to access the service via the TCP Load Balancer domain and port. If the domain is specified as re2ce.internal and port is 5432, the connection to the DB might be performed from the RE using these settings. RE to CE Connectivity At this point, the HA Database Workload is deployed to the CE environment. This workload implements a central data storage, which takes advantage of compute-intensive resources provided by the CE. While the CE is an ideal fit for compute-heavy operations, it is typically optimized for a single region of the cloud where the CE is deployed. This architecture could be complemented by a multi-region architecture where end-users from regions other than the CE may reduce latency delays by adding regional edge services by moving some of the data and compute capability off of the CE and to the RE close to the end-users’ region. Moving services with data access points to the edge raises questions of caching and updates propagation. The ideal use-cases for such services are around not overly compute-heavy but rather time- and latency-sensitive workloads – those that require decision-making at the compute edge. These edge services still require secure connectivity back to core, and in our case we can stand up a mock service in the Regional Edge to consume the data from the centralized Customer Edge and present it to end-users. The NGINX reverse-proxy server is a handy solution to implement data access decisions on the edge. NGINX has several plugins, allowing access to backend systems via HTTP protocol. PostgreSQL does not provide such an adapter natively, but NGINX has a module just for that: NGINX OpenResty can be compiled with Postgres HTTP module, allowing to do GET/POST requests to access and modify data. To enable access to Postgres database the upstream tag is used this way: upstream database { postgres_server re2ce.internal dbname=haservicesdb user=haservices password=haservicespass; } As soon as the upstream setup, the queries can be performed: location /data { postgres_pass database; postgres_query "SELECT * FROM articles"; } Unfortunately, postgres_query and postgres_pass does not support caching, so additional proxy_pass needs to be configured: location / { rds_json on; proxy_buffering on; proxy_cache srv; proxy_ignore_headers Cache-Control; proxy_cache_methods GET HEAD POST; proxy_cache_valid 200 302 30s; proxy_cache_valid 404 10s; proxy_cache_use_stale error timeout http_500 http_502 http_503 http_504 http_429; add_header X-Cache-Status $upstream_cache_status; proxy_pass http://localhost:8080/data; } Note the additional rds_json directive above, it's used to convert the response from binary to JSON. Now that the data is cached on the Regional Edge, when the central server is unavailable or inaccessible, the cached response is returned. This is an ideal situation for how a distributed multi-region app may be designed and connected, where the deployment on RE creates a service accessible to the end-users via a Load Balancer. Enhanced Security Posture with Distributed Cloud Of course, we’re using NGINX with the PostgreSQL module for illustration purposes only; exposing databases this way in production is not secure. However, this gives us an opportunity to think through how publicly accessible service endpoints can potentially be open to attacks. A Web App Firewall (WAF) is provided as part of the Web App & API Protection (WAAP) sets of services within F5 Distributed Cloud and secure all of the services exposed in our architecture with a consistent set of protection and controls. For example, with just a few clicks, we can protect the Load Balancer that exposes an external web port to end-users on the RE using WAF and bot protection services. Similarly, other services on the CE can also be protected with the same consistent security policies. Monitoring & Visibility All of the networking, performance, and security data and analytics are readily available to send-users within F5 Distributed Cloud dashboards. For our example it is a list of all connections from RE to CE, via the TCP load balancer detailed for each RE site: Another useful data point is a chart and detail of HTTP load balancer requests: Conclusion In summary, the success of a distributed cloud architecture is dependent on placing the right types of workloads on the right cloud infrastructure. F5 Distributed Cloud provides and securely connects various types of distributed app-ready infrastructure, which we used in our example such as the Customer Edge and Regional Edge. A compute-heavy centralized database workload that requires high availability can take advantage of vK8s for ease of deployment and config with Helm charts, scalability, and control. The CE workload can then be exposed via Load Balancers to other services, deployed in other clouds or regions, such as the Regional Edge service we utilized here. All of the distributed cloud infrastructure, networking, security and insights are available in one place with F5 Distributed Cloud services. Additional Material Now that you've seen how to build our solution, try it out for yourself! Product Simulator: A guided simulation in a sandbox environment covering each step in this solution Demo Guide: A comprehensive package, including a step-by-step guide and the images needed to walk through this solution every step of the way in your own environment. This includes the scripts needed to automate a deployment, including the images that support the sample application. Links GitHub: Demo Guide - HA DB with CE and RE Simulator: High Availability Workloads Product Page: Distributed Cloud Multi-Cloud Transit Product Page: Distributed Cloud Web App & API Protection (WAAP) Tech Doc: Deploying Distributed Cloud in AWS VPC's Tech Doc: Virtual Kubernetes (vK8s) on Distributed Cloud3.5KViews4likes0Comments