Securely Scale RAG - Azure OpenAI Service, F5 Distributed Cloud and NetApp
Arguably, the easiest and most massively scalable approach to harnessing Large Language Models (LLMs) is to consume leading services like OpenAI endpoints, the most well-known of cloud-based offering delivered to enterprises over the general Internet. Access to hardware, such as GPUs, and the significant skillset to run LLMs on your own become non-issues, consumption is simply an API call away. One concern, and a serious one, is that sensitive inferencing (AI prompts, both the requests and responses) travels "in the wild" to these LLMs found through DNS at public endpoints. Retrieval Augmented Generation (RAG) adds potentially very sensitive corporate data to prompts, to leverage AI for internal use cases, thus ratcheting up even further the uneasiness with using the general Internet as a conduit to reach LLMs. RAG is a popular method to greatly increase the accuracy and relevancy of generative AI for a company’s unique set of problems. Finally, to leverage sensitive data with RAG, the source documents must be vectorized with similarly remote “embedding” LLMS; once again sensitive, potentially proprietary sensitive data will leave the corporate premises to leverage the large AI solutions like OpenAI or Azure OpenAI. Unlike purveyors of locally executed models, say a repository like Huggingface.com, which allow downloading of binaries to be harnessed on local compute, industry leading solutions like OpenAI and Azure OpenAI Service are founded on the paradigm of remote compute. Beyond the complexity and resources of quickly and correctly setting up performant on-prem models one time, the choice to consume remote endpoints allows hassle-free management like models perpetually updated to latest revisions and full white-glove support available to enterprise customers consuming SaaS AI models. In this article, an approach will be presented where, using F5 Distributed Cloud (XC) and NetApp, Azure OpenAI Service can be leveraged with privacy, where prompts are carried over secured, encrypted tunnels over XC between on-premises enterprise locations and that enterprise’s Azure VNET. The Azure OpenAI models are then exclusively exposed as private endpoints within that VNET, nowhere else in the world. This means both the embedding LLM activity to vectorize sensitive corporate data, and the actual generative AI prompts to harness value from that data are encrypted in flight. All source data and resultant vector databases remain on-premises in well-known solutions like a NetApp ONTAP storage appliance. Why is the Azure OpenAI Service a Practical Enabler of AI Projects? Some of the items that distinguish Azure OpenAI Service include the following: Prompts sent to Azure OpenAI are not forwarded to OpenAI, the service exists within Microsoft Azure, benefiting from the performance of Microsoft’s enormous cloud computing platform Customer prompts are never used for training data to build new or refine existing models Simplified billing, think of the Azure OpenAI Service as analogous to an “all you can eat buffet”, simply harness the AI service and settle the charge incurred on a regular monthly billing cycle With OpenAI, models are exposed at universal endpoints shared by a global audience, added HTTP headers such as the OPENAI_API_KEY value distinguish users and allow billing to occur in accordance with consumption. Azure OpenAI Service is slightly different. No models actually exist to be used until they are setup under an Azure subscription. At this point, beyond receiving an API key to identify the source user, the other major difference is unique API "base" URL (endpoint) is setup for accessing LLMs an organization wishes to use. Examples would be a truly unique enterprise endpoint for GPT-3.5-Turbo, GPT4 or perhaps an embedding LLM used in vectorization, such as the popular text-embedding-ada-002 LLM. This second feature of Azure OpenAI Service presents a powerful opportunity to F5 Distributed Cloud (XC) customers. This stems from the fact that unlike traditional OpenAI, this per-organization, unique base URL for API communications does not have to be projected into the global DNS, reachable from anywhere on the Internet. Instead, Microsoft Azure allows the OpenAI service to be constrained to a private endpoint, accessible only from where the customer chooses. Leveraging F5 XC Multicloud Networking offers a way to secure and encrypt communications between on-premises locations and Azure subnets only available from within the organization. What does this add up to for the enterprise with generative AI projects? It means huge scalability for AI services and consuming the very much leading-edge modern OpenAI models, all in a simple manner an enterprise can now consume today with limited technical onus on corporate technology services. The sense of certainty that sensitive data is not cavalierly exposed on the Internet is a critical cog in the wheel of good data governance. Tap Into Secure Data from NetApp ONTAP Clusters for Fortified Access to OpenAI Models The F5 Distributed Cloud global fabric consists of points of presence in 26+ metropolitan markets worldwide, such as Paris, New York, Singapore, that are interconnected with high-speed links aggregating to more than 14 Tbps of bandwidth in total, it is growing quarterly. With the F5 multicloud networking (MCN) solution, customers can easily set up dual-active encrypted tunnels (IPSec or SSL) to two points on the global fabric. The instances connected to are referred to as RE’s (Regional Edge nodes) and the customer-side sites are made up of CE’s (Customer Edge nodes, scalable from one to a full cluster). The service is a SaaS solution and setup is turn-key based upon menu click-ops or Terraform. The customer sites, beyond being in bricks-and-mortar customer data centers and office locations, can also exist within cloud locations such as Microsoft Azure Resource Groups or AWS VPCs, among others. Enterprise customers with existing bandwidth solutions may choose to directly interconnect sites as opposed to leveraging the high-speed F5 global fabric. The net result of an F5 XC Layer 3 multicloud network is high-speed, encrypted communications between customer sites. By disabling the default network access provided by Azure OpenAI Service, and only allowing private endpoint access, one can instantiate a private approach to running workloads with well-known OpenAI models. With this deployment in place, customers may tap into years of data acquired and stored on trusted on-premises NetApp storage appliances to inject value into AI use cases, customized and enhanced inference results using well-regarded, industry-leading OpenAI models. A perennial industry leader in storage is ONTAP from NetApp, a solution that can safely expose volumes to file systems, through protocols such as NFS and SMB/CIFS. The ability to also expose LUNs, meaning block-level data that constitutes remote disks, is also available using protocols like iSCSI. In the preceding diagram, one can leverage AI through a standard Python approach, in the case shown harnessing an Ubuntu Linux server, and volumes provided by ONTAP. AI jobs, rather than calling out to an Internet-routed Azure OpenAI public endpoint can instead interact with a private endpoint, one which resolves through private DNS to an address on a subnet behind a customer Azure CE node. This endpoint cannot be reached from the Internet, it is restricted to only communicating with customer subnets (routes) located in the L3 multicloud deployment. In use cases that leverage one’s own data, a leading approach is Retrieval Augmented Generation (RAG) in order to empower Large Language Models (LLMs) to deliver niche, hyper-focused responses pertaining to specialized, sometimes proprietary, documents representing the corporate body of knowledge. Simple examples might include highly detailed, potentially confidential, company-specific information distilled from years of financial internal reporting. Another prominent early use case of RAG is to backstop frontline, customer helpdesk employees. With customers sensitive to delays in handling support requests, and pressure to reduce support staff research delays, the OpenAI LLM can harvest only relevant knowledge base (KB) articles, releases notes, and private engineering documents not normally exposed in their entirety. The net result is a much more effective helpdesk experience, with precise, relevant help provided to the support desk employee in seconds. RAG Using Microsoft Azure OpenAI, F5 and NetApp in Nutshell In the sample deployment, one of the more important items to recognize is that two OpenAI models will be harnessed, an embedding LLM and a generative transformer based GPT family LLM. A simple depiction of RAG would be as follows: Using OpenAI Embedding LLMs The OpenAI embedding modeltext-embedding-ada-002 is used first to vectorize data sourced from the on-premises ONTAP system, via NFS volumes mounted to the server hosting Python. The embedding model consumes “chunks” of text from each sourced document and converts the text to numbers, specifically long sequences of numbers, typically in the range of 700 to 1,500 values. These are known as vectors. The vectors returned in the private OpenAI calls are then stored in a vector database, in this case ChromaDB was used. It is important to note, the ChromaDB itself was directed to install itself within a volume supported by the on-premises ONTAP cluster, as such the content at rest is governed by the same security governance as the source content in its native format. Other common industry solutions for vector storage and searches include Milvus and for those looking to cloud-hosted vectors Pinecone. Vector databases are purpose-built to manage vector embeddings. Conventional databases can, in fact, store vectors but the art of doing a semantic search, finding similarities between vectors, would then require vector indices solutions. One of the best known in FAISS (Facebook AI Similarity Search) which is a library that allows developers to quickly search for embeddings of multimedia documents. These semantic searches would otherwise be inefficient or impossible with standard database engines (SQL). When a prompt is first generated by a client, the text in the prompt is vectorized by the very same OpenAI embedding model, producing a vector on the fly. The key to RAG, the “retriever” function, then compares the newly arrived query with semantically similar text chunks in the database. The actual semantic similarity of the query and previously stored chunks is arrived at through a nearest neighbor search of the vectors, in other words, phrases and sentences that might augment the original prompt can be provided to the OpenAI GPT model. The art of finding semantic similarities relies upon comparing the lengthy vectors. The objective, for instance, to find supportive text around the user query “how to nurture shrub growth” might reasonably align more closely with a previously vectorized paragraph that included “gardening tips for the North American spring of 2024” and less so with vectorized content stemming from a user guide for the departmental photocopy machine. The suspected closeness of vectors, are text samples actually similar topic wise, is a feature of semantic similarity search algorithms, many exist in themarketplace and two approaches commonly leveraged are cosine similarity and Euclidean distance; a brief description for those interested can be found here. The source text chunks corresponding to vectors are retained in the database and it is this source text that augments the prompt after the closest neighbor vectors are calculated. Using OpenAI GPT LLMs Generative Pre-trained Transformer (GPT) refers to a family of LLMs created by OpenAI that are built on atransformer architecture. The specific OpenAI model used in this model is not necessarily the latest, premium model, GPT-4o and GPT-4 Turbo are more recent, however the utilized gpt-35-turbo model is a good intersection of price versus performance and has been used extensively in deployed projects. With the retriever function helping to build an augmented prompt, the default use case documented included three text chunks to buttress the original query. The OpenAI prompt response will not only be infused with the provided content extracted from the customer but unlike normal GPT responses, RAG will have specific attributions to which documents and specific paragraphs led to the response. Brief Overview of Microsoft OpenAI Service Setup Microsoft Azure has a long history of adding innovative new functions as subscribed “opt in” service resources, the Azure OpenAI Service is no different. A thorough, step-by-step guide to setting up the OpenAI service can be found here. This screenshot demonstrates the rich variety of OpenAI models available within Azure, specifically showing the Azure OpenAI Studio interface, highlighting models such as gpt-4, gpt-4o and dall-e-3. In this article, two models are added, one embedding and the other GPT. The following OpenAI Service Resource screen shows the necessary information to actually use our two models. This information consists of the keys (use either KEY1 and KEY2, both can be seen and copied with the Show Keys button) and the unique, per customer endpoint path, frequently referred to as the base URL by OpenAI users. Perhaps the key Azure feature that empowers this article is the ability to disable network access to the configured OpenAI model, as seen below. With traditional network access disabled, we can then enable private endpoint access and set the access point to a network interface on the private subnet connected to the inside interface of our F5 Distributed Cloud CE node. The following re-visits the earlier topology diagram, with focus upon where the Azure OpenAI service interacts with our F5 Distributed Cloud multicloud network. The steps involved in setting up an Azure site in F5 Distributed Cloud are found here. The corresponding steps for configuring an on-premises Distributed Cloud site are found in this location. Many options exist, such as using KVM or a bare metal server, the link provided highlights the VMware ESXi approach to on-premises site creation. Demonstrating RAG in Action using OpenAI Models with a Secure Private Endpoint The RAG setup, in lieu of vectorizing actual private and sensitive documents, utilized the OpenAI embedding LLM to process chunks taken from the classic H.G. Wells 1895 science fiction novel “The Time Machine” in text or markdown format. The novel is one of many in the public domain through the Gutenberg Project. Two NFS folders supported by the NetApp ONTAP appliance in a Redmond, Washington office were used: one for source content and one for supporting the ChromaDB vector database. The NFS mounts are seen below, with the Megabytes consumed and remaining available seen per volume, the ONTAP address can be seen as 10.50.0.220. (Linux Host) #df -h 10.50.0.220:/RAG_Source_Documents_2024 1.9M 511M 1% /mnt/rag_source_files 10.50.0.220:/Vectors 17M 803M 3% /home/sgorman/langchain-rag-tutorial-main/chroma2 The creation of the vector database was handled by one Python script and the actual AI prompts generated against the OpenAI gpt-35-turbo model were housed in another script. This may often make sense, as the vector database creation may be an infrequently run script, only executed when new source content is introduced (/mnt/rag_source_files) whereas the generative AI tasks targeting gpt-3.5-turbo are likely run continuously for imperative business needs like helpdesk or code creations, as example purposes. Creating the vector database first entails preparing the source text, typically remove extraneous formatting or less than valuable text fields, think of boilerplate statements such as repetitive footnotes or perhaps copyright/privacy statements that might be found on every single page of some corporate documents. The next step is to create text chunks for embedding, the tradeoff of using too short chunks will be lack of semantic meaning in any one chunk and a growth in the vector count. Using overly long chunks, on the other hand, could lead to lengthy augmented prompts sent to gpt-35-turbo that significantly grow the token count for requests, although many models now support very large token counts a common value remains a total, for requests and responses, of 4,096 tokens. Token counts are the foundation for most billing formulae of endpoint-based AI models. Finally, it is important to have some degree of overlap of generated chunks such that meanings and themes within documents are not lost; if an idea is fragmented at the demarcation point of adjacent chunks the model may not pickup on its importance. The vectorization script for “The Time Machine” resulted in 978 chunks being created from the source text, with character counts per chunk not to exceed 300 characters. The text splitting function is loaded from LangChain and the pertinent code lines include: from langchain.text_splitter import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter( chunk_size=300, chunk_overlap=100, } The values of 100 characters of overlaps suggests each chunk will incorporate 200 characters of new text within the total of 300 grabbed. It is important to remember all characters, even white space, count towards totals. As per the following screenshot, the source novel, when split into increments of 200 new characters per chunk does indicate 978 chunks were indeed a correct total (double click to expand). With the source data vectorized and secure on the NetApp appliance, the actual use of the gpt-35-turbo OpenAI model could commence. The following shows an example, where the model is instructed in the system prompt to only respond via information it can glean from the RAG augmented prompt text, the response portions shown in red font. python3 create99.py “What is the palace of green porcelain?” <response highlights below, the response also included the full text chunks RAG suggested would potentially support the LLM in answering the posed question> Answer the question based on the above context: What is the palace of green porcelain? Response: content='The Palace of Green Porcelain is a deserted and ruined structure with remaining glass fragments in its windows and corroded metallic framework.' response_metadata={'token_usage': {'completion_tokens': 25, 'prompt_tokens': 175, 'total_tokens': 200}, 'model_name': 'gpt-35-turbo', The response of gpt-35-turbo is correct, and we see that the token consumption is heavily slanted towards the request (the “prompt”), with 175 tokens used whereas the response required only 25 tokens. The key takeaways are that the prompt and its response did not travel hop-by-hop over the Internet to a public endpoint, all traffic traveled with VPN-like security from the on-premises server and ONTAP to a private Azure subnet using F5 Distributed Cloud. The OpenAI model was utilized as a private endpoint, corresponding a network interface available only on that private subnet and not found within the global DNS, only the private corporate DNS or /etc/hosts files. Adding Laser Precision to RAG Using the default chunking strategy did lead to sub-optimal results, when ideas, themes and events were lost across chunk boundaries, even when including some degree of overlap. The following is one example: A key moment in the H.G. Wells book involves the protagonist meeting a character Weena, who provides strange white flowers which are pocketed. Upon returning to the present time, the time traveler relies upon the exotic and foreign look of the white flowers to attempt to prove to friends the veracity of his tale. # python3 query99.py “What did Weena give the Time Traveler?” As captured in the response below, the chunks provided by RAG do not provide all the details, only that something of note was pocketed, but gpt-35-turbo can therefore not return a sufficient answer as the full details are not provided in the augmented prompt. The screenshot shows first the three chunks and at the end the best answer the LLM could provide (double click to expand). The takeaway is that some effort will be required to adjust the vectorization process to pick optimally large chunk sizes, and sufficient numbers to properly empower the OpenAI model. In this demonstration, based upon vectors and their corresponding text, only three text chunks were harnessed to augment the user prompt. By increasing this number to 5 or 10, and increasing each of the chunk sizes, all of course at the expense of token consumption, one would expect more accurate results from the LLM. Summary This article demonstrated a more secure approach to using OpenAI models as a programmatic endpoint service in which proprietary company information can be kept secure by not using the general purpose, insecure Internet to provide prompts for vectorization and general AI inquiries. Instead, an approach was followed where the Azure OpenAI service was deployed as a private endpoint, exclusively available at an address on a private subnet within an enterprise’s Azure subscription, a subnet with no external access. By utilizing F5 Distributed Cloud Multicloud Networking, existing corporate locations and data centers can be connected to that enterprise’s Azure resource groups and private, encrypted communications can take place between these networks, the necessary routing and tunneling technologies are deployed in a turn-key manner without requiring advanced network skillsets. When leveraging NetApp ONTAP as the continued enterprise storage solution, RAG deployments based upon Azure OpenAI service can continue to be managed and secured with well-developed storage administration skills. In this example, ONTAP housed both the source, sensitive enterprise content and the actual vector database resulting from interactions with the Azure OpenAI embedding LLM. Subsequent to a discussion on vectors and optimal chunking strategies, RAG was utilized to answer questions on private documents using the well-known OpenAI chat-35-turbo model.87Views1like0CommentsProtect multi-cloud and Edge Generative AI applications with F5 Distributed Cloud
F5 Distributed Cloud capabilities allows customers to use a single platform for connectivity, application delivery and security of GenAI applications in any cloud location and at the Edge, with a consistent and simplified operational model, a game changer for streamlined operational experience for DevOps, NetOps and SecOps.525Views3likes0CommentsF5 Distributed Cloud - Customer Edge Site - Deployment & Routing Options
F5 Distributed Cloud Customer Edge (CE) software deployment models for scale and routing for enterprises deploying multi-cloud infrastructure. Today's service delivery environments are comprised of multiple clouds in a hybrid cloud environment. How your multi-cloud solution attaches to your existing on-prem and cloud networks can be the difference between a successful overlay fabric, and one that leave you wanting more out of your solution. Learn your options with F5 Distributed Cloud Customer Edge software.9.2KViews17likes3CommentsSecure RAG for Safe AI Deployments Using F5 Distributed Cloud and NetApp ONTAP
Retrieval Augmented Generation (RAG) is one of the most discussed techniques to empower Large Language Models (LLM) to deliver niche, hyper-focused responses pertaining to specialized, sometimes proprietary, bodies of knowledge documents. Two simple examples might include highly detailed company-specific information distilled from years of financial internal reporting from financial controllers or helpdesk type queries with the LLM harvesting only relevant knowledge base (KB) articles, releases notes, and private engineering documents not normally exposed in their entirety. RAG is highly bantered about in numerous good articles; the two principal values are: LLM responses to prompts (queries) based upon specific, niche knowledge as opposed to the general, vast pre-training generic LLMs are taught with; in fact, it is common to instruct LLMs not to answer specifically with any pre-trained knowledge. Only the content “augmenting” the prompt. Attribution is a key deliverable with RAG. Generally LLM pre-trained knowledge inquiries are difficult to traceback to a root source of truth. Prompts augmented with specific assistive knowledge normally solicit responses that clearly call out the source of the answers provided. Why is the Security of RAG Source Content Particularly Important? To maximize the efficacy of LLM solutions in the realm of artificial intelligence (AI) an often-repeated adage is “garbage in, garbage out” which succinctly states an obvious fact with RAG: valuable and actionable items must be entered into the model to expect valuable, tactical outcomes. This means exposing key forms of data, examples being data which might include patented knowledge, intellectual property not to be exposed in raw form to competitors. Actual trade secrets, which will infuse the LLM but need to remain confidential in their native form. In one example around trade secrets, the Government of Canada spells out a series of items courts will look at in determining compensation for misuse (theft) of intellectual property. It is notable that the first item listed is not the cost associated with creation of the secret material (“the cost in money or time of creating or developing the information”) but rather the very first item is instead how much effort was made to keep the content secure (“the measures taken to maintain secrecy”). With RAG, incoming queries are augmented with rich, semantically similar enterprise content. The content has already been populated into a vector database by converting documents, they might be pdf or docx as examples, into raw text form and converting chunks of text into vectors. The vectors are long sequences of numbers with similar mathematical attributes for similar content. As a trivial example, one-word chunks such as glass, cup, bucket, jar might be semantically related, meaning similarities can be construed by both human minds and LLMs. On the other hand, empathy, joy, and thoughtfulness maintain similarities of their own. This semantic approach means a phrase/sentence/paragraph (chunk) using bow to mean “to bend in respect” will be highly distinct from chunks referring to the “front end of a ship" or “something to tie one’s hair back with”, even a tool every violinist would need. The list goes on; all semantic meanings of bow are very different in these chunks and would have distinctive embeddings within a vector database. The word embedding is likely derived from “fixing” or “planting” an object. In this case, words are “embedded” into a contextual understanding. The typical length of the number sequence describing the meaning of items has typically been more than 700, but this number of “dimensions” applied is always a matter of research, and the entire vector database is arrived at with an embedding LLM, distinct from the main LLM that will produce generative AI responses to our queries. Incoming queries destined for the main generative AI LLM can, in turn, be converted to vectors themselves by the very same text-embedding “helper” LLM and through retrieval (the “R” in RAG) similar textual content can buttress the prompt presented to the main LLM (double click to expand). Since a critical cog in the wheel of the RAG architecture is the ingestion of valuable and sensitive source documents into the vector database, using the embedding LLM, it is not just prudent but critical that this source content be brought securely over networks to the embedding engine. F5 Distributed Cloud Secure Multicloud Networking and NetApp ONTAP For many practical, time-to-market reasons, modern LLMs, both the main and embedding instances, may not be collocated with the data vaults of modern enterprises. LLMs benefit from cloud compute and GPU access, something often in short supply for on-premises production roll outs. A typical approach assisted by the economies of scale might be to harvest public cloud providers, such as Azure, AWS, and Google Cloud Platform for the compute side of AI projects. Azure, as one example, can turn up virtual machines with GPUs from NVIDIA like A100, A2, and Tesla T4 to name a few. The documents needed to feed an effective RAG solution may well be on-premises, and this is unlikely to change for reasons including governance, regulatory, and the weight of decades of sound security practice. One of the leading on-premises storage solutions of the last 25 years is the NetApp ONTAP storage appliance family, and reflected in this quote from NVIDIA: "Nearly half of the files in the world are stored on-prem on NetApp." — Jensen Huang, CEO of NVIDIA A key deliverable of F5 Distributed Cloud is providing encrypted interconnectivity of disparate physical sites and heterogeneous cloud instances such as Azure VNETs or AWS VPCs. As such, there are two immediate, concurrent F5 features that come to mind: Secure interconnectivity of on-premises NetApp volumes (NAS) or LUNs (Block) containing critical documents for ingestion into RAG. Utilize encrypted L3 connectivity between the enterprise location and the cloud instance where the LLM/RAG are instantiated. TCP load balancers are another alternative for volume sharing NAS protocols like NFS or SMB/CIFS. Secure access to the LLM web interface or RESTful API end points, with HTTPS load balancers including key features like WAF, anti-bot mechanisms, and API automatic rate limiting for abusive prompt sources. The following diagram presents the topology this article set out to create, REs are “regional edge” sites maintained internationally by F5 and harness private RE to RE, high-speed global communication links. DNS names, such as the target name of an LLM service, will leverage mappings to anycast IP addresses, thus users entering the RE network from southeast Asian might, for example, enter the Singapore RE while users in Switzerland might enter via a Paris or Frankfurt RE. Complementing the REs are Customer Edge (CE) nodes. These are virtual or physical appliances which act as security demarcation points. For instance, a CE placed in an Azure VNET can protect access to the server supporting the LLM, removing any need for Internet access to the server, which is now entirely accessible only through a private RFC-1918 type of private address. External access to the LLM for just employees or, maybe employees and contractors, or potentially access for the Internet community is enabled by a distributed HTTPS load balancer. In the example depicted above, oriented towards full Internet access, the FQDN of the LLM is projected by the load balancer into the global DNS and consumers of the service resolve the name to one IP address and are attracted to the closest RE by BGP-4’s support for anycast. As the name “distributed” load balancer suggests, the origin pool can be in an entirely different site than the incoming RE, in this case the origin pool is the LLM behind the CE in the Azure VNET. The LLM requests travel from RE to CE via a highspeed networking underlay. The portion of the solution that securely ties the LLM to the source content required for RAG to embed vectors is, in this case, utilizing layer 3 multicloud networking (MCN). The solution is turnkey, routing table are automatically connected to members of the L3 MCN, in this case the inside interfaces of the Azure CE and Redmond, Washington on-premises CE and traffic flows over an encrypted underlay network. As such, the NetApp ONTAP cluster can securely expose volumes with key file ware via a protocol like Network File System (NFS), no risk of data exposure to third-party prying eyes exists. The following diagram drills into the RE and CE and NetApp interplay (double click to expand). F5 Distributed Cloud App Connect and LLM Setup This article speaks to hands-on experience with web-driven LLM inferencing with augmented prompts derived from a RAG implementation. The AI compute was instantiated on an Azure-hosted Ubuntu 20.04 virtual machine with 4 virtual cores. Installed software included Python 3.10, and libraries such as Langchain, Pypdf (for converting pdf documents to text), FAISS (for similarity searching via a vector database), and other libraries. The actual open source LLM utilized for the generative AI is found here on huggingface.co. The binary, which exceeds 4 GB, is considered effective for CPU-based deployments. The embedding LLM model, critical to seed the vector database with entries derived from secured enterprise documentation, and then used again per incoming query for RAG similarity searches to build augmented prompts, was from Hugging Face: sentence-transformers/all-MiniLM-L6-v2 and can be found here. The AI RAG solution was implemented in Python3, and as such the Azure Ubuntu can be accessed both by SSH or via Jupyter Notebooks. The latter was utilized as this is the preferred final delivery mechanism for standard users, not a web chatbot design or the requirement to use API commands through solutions like Postman or Curl. This design choice, to steer the user experience towards Jupyter Notebook consumption, is in keeping with the fact that it has become a standard in AI LLM usage where the LLM is tactical and vital to an enterprise's lines of business (LOBs). Jupyter Notebooks are web-accessed with a browser like Chrome or Edge and as such, F5’s WAF, anti-bot, and L7 DDoS, all part of the F5 WAAP offering, can easily be laid upon an HTTP load balancer with a few mouse clicks in XC to provide premium security to the user experience. NetApp and F5 Distributed Cloud Secure Multicloud Networking The secure access to files for ingestion into the vector database, for similarity searches when user queries are received, makes use of an encrypted L3 Multicloud Network relationship between the Azure VNET and the LAN on prem in Redmond, Washington hosting the NetApp ONTAP cluster. The specific protocol chosen was NFS and the simplicity is demonstrated by the use of just one Linux command to present key, high-valued documents for the AI to populate the database: #mount -t nfs <IP Address of NetAPP LIF interface on-prem>:/Secure_docs_for_RAG /home/ubuntu_restriced_user/rag_project/docs/Secure_docs_for_RAG. This address is available nowhere else in the world except behind this F5 CE in the Azure VNET. After the pdf files are converted to text, chunked to reasonable sizes with some overlap suggested between the end of one chunk and the start of the next chunk, the embedding LLM will populate the vector database. The files are always only accessed remotely by NFS through the mounted volume, and this mount may be terminated until new documents are ready to be added to the solution. The Objective RAG Implementation - Described In order to have a reasonable facsimile of the real-word use cases this solution will empower today, but not having any sensitive documents to be injected, it was decided to use some seminal “Internet Boom”-era IETF Requests for Comment (RFCs) as source content. With the rise of multi-port routing and switching devices, it became apparent the industry badly needed specific and highly precise definitions around network device (router and switch) performance benchmarking to allow purchasers “apples-to-apples” comparisons. These documents recommend testing parameters, such as what frame or packet sizes to test with, test iteration time lengths, when to use FIFO vs LIFO vs LILO definitions of latency, etc. RFC-1242 (Request for Comment, terminology) and RFC-2544 (methodologies), chaired by Scott Bradner of Harvard University, and the later RFC 2285 (LAN switching terminologies), chaired by Bob Mandeville then of European Network Laboratories are three prominent examples, to which test and measurement solutions aspired to be compliant. Detailed LLM answers for quality assurance engineers in the network equipment manufacturing (NEM) space is the intended use case of the design, answers that must be distilled specifically by generative AI considering queries augmented by RAG and specifically only based upon these industry-approved documents. These documents are, of course, not containing trade secrets or patented engineering designs. They are in fact publicly available from the IETF, however they are nicely representative of the value offered in sensitive environments. Validating RAG – Watching the Context Provided to the LLM To ensure RAG was working, the content being augmented in the prompt was displayed to screen, we would expect to see relevant clauses and sentences from the RFCs being provided to the generative AI LLM. Also, if we were to start by asking questions that were outside the purview of this testing/benchmarking topic, we should see the LLM struggle to provide users a meaningful answer. To achieve this, rather than, say, asking what 802.3/Ethernetv2 frame sizes should be used in throughput measurements, and what precisely is the industry standard definition of the term “throughput” was, the question instead pertained to a recent Netflix release, featuring Lindsay Lohan. Due to the recency of the film, even if the LLM leaned upon its pretrained model, it will come up with nothing meaningful. “Question: Important, only use information provided as context in the prompt, do not use other trained knowledge. Please identify who played Heather in the March 2024 Lindsay Lohan Netflix movie titled Irish Wish?” As seen in the following Jupyter screenshot, the RAG solution can only provide augmented prompts from the database, in this case it has some test and measurement clauses and some rules pertaining to the winter ice sport of curling (double click to enlarge). The supportive context data augmenting the prompt was, as expected, not going to help in this off-the-wall query, as only the RFCs and some sports rules had been introduced to the vector database. What is especially noteworthy is the LLM did not simply admit it could not possibly answer this question but hallucinated a very articulate response that indicated “Heather played for her team against Lindsay Lohan’s Team Ireland”. To validate the RAG architecture for a working scenario, actually pertinent to the movie, a film review in pdf format was added to the ONTAP cluster-housed volume. The supportive documentation was consumed (using Pypdf from the Jupyter Notebook it became text). It was automatically chunked and introduced to the vector database through the Hugging Face embedding LLM. With the RAG vector database populated by the embedding LLM, the responses then became meaningful. The following exchange shows the value of RAG in producing meaningful, thought-out answers: Question: Important, only use information provided as context in the prompt, do not use other pre-trained knowledge. Who does Maddie edit novels for in the Lindsay Lohan March 2024 Netflix movie titled Irish Wish? Answer: Let's think step by step. We know that Maddie Kelly plays a book editor who harbors a big secret, and we have been given information about her job as an editor for Paul Kennedy (played by Alexander Vlahos), which is also part of the question context provided above. Therefore it seems logical to infer that she edits novels in general but specifically those written or co-written by author Paul Kennedy - who happens to be a bestselling romance novelist as stated earlier on, so we can safely conclude Maddie works for Paul himself (as opposed to any other book publishing company). This time we see relevant, meaningful context provided by RAG, with relevant portions of the pdf file review document augmenting the prompt. The augmentations are normally not displayed unless requested by the user or programmer, unlike humans, LLMs do not require superfluous text formatting (advantage: LLMs) and thus the content is packed tightly and efficiently. Fewer characters also mean fewer tokens get used by the targeted AI model and can allow more data before a token context limit is reached. In pay-for-use LLM approaches fewer tokens also help the enterprise’s bottom line financially. Also, note that the answer will likely not always be identical with subsequent asks of the same question as per LLM normal behavior. Features like “temperature setting” can also allow more “creative” ideas in responses, injecting humor and even outlandishness if desired. The RAG workflow is now validated, but the LLMs in question (embedding and main generative LLM) can still be made better with these suggestions: Increase “chunk” sizes so ideas are not lost when excessive breaks make for short chunks. Increase “overlap” so an idea/concept is not lost at the demarcation point of two chunks. Most importantly, provide more context from the vector database as context lengths (maximum tokens in a request/response) are generally increasing in size. Llama2, for instance, typically has a 4,096 context length but can now be used with larger values, such as 32,768. This article used only 3 augmentations to the user query, better results could be attained by increasing this value at a potential cost of more CPU cycles. Using Secure RAG – F5 L3 MCN, HTTPS Load Balancers and NetApp ONTAP Together With the RAG architecture validated to be working, the solution was used to assist the target user entering queries to the Azure server by means of Jupyter Notebooks, with RAG documents ingested over encrypted, private networking to the on-premises ONTAP cluster NFS volumes. The questions posed, which are answerable by reading and understanding key portions spread throughout the Scott Bradner RFCs, was: “Important, only use information provided as context in the prompt, do not use other pre-trained knowledge. Please explain the specific definition of throughput? What 802.3 frame sizes should be used for benchmarking? How long should each test iteration last? If you cannot answer the questions exclusively with the details included in the prompt, simply say you are unable to answer the question accurately. Thank you." The Jupyter Notebook representation of this query, which is made in the Python language and issued from the user’s local browser anywhere in the world and directly against the Azure-hosted LLM, looks like the following (click to expand image): The next screenshot demonstrates the result, based upon the provided secure documents (double click to expand). The response is decent, however, the fact that it is clearly using the provided augmentations to the prompt, that is the key objective of this article. The accuracy of the response can be questionable in some areas, the Bradner RFCs highlighted the importance of 64-byte 802.3/Ethernetv2 frame sizes in testing, as line rate forwarding with this minimum size produces the highest theoretically possible frame per second load. In the era of software driven forwarding in switches and routers this was very demanding. Sixty-four byte frames result in 14,881 fps (frames per second) for 10BaseT, 148,809 fps for 100BaseT, 1.48 million fps for Gigabit Ethernet. These values were frequently more aspirational in earlier times and also a frequent metric used in network equipment purchasing cycles. Suspiciously, the LLM response calls out 64kB in 802.3 testing, not 64B, something which seems to be an error. Again, with this architecture, the actual LLM providing the generative AI responses is increasingly viewed as a commodity, alternative LLMs can be plugged quickly and easily into the RAG approach of this Jupyter Notebook. The end user, and thus the enterprise itself, is empowered to utilize different LLMs, purchased or open-source from sites like Hugging Face, to determine optimal results. The other key change that can affect the overall accuracy of results is to experiment with different embedding models. In fact, there are on-line “leader” boards strictly for embedding LLMs so one can quickly swap in and out various popular embedding LLMs to see the impact on results. Summary and Conclusions on F5 and NetApp as Enablers for Secure RAG This article demonstrated an approach to AI usage that leveraged the compute and GPU availability that can be found today within cloud providers such as Azure. To safely access such an AI platform for a production-grade enterprise requirement, F5 Distributed Cloud (XC) provided HTTPS load balancers to connect worker browsers to a Jupyter Notebook service on the AI platform, this service applies advanced security upon the traffic within the XC, from WAF to anti-bot to L3/L7 DDOS protections. Utilizing secure Multicloud Networking (MCN), F5 provided a private L3 connectivity service between the inside interface on an Azure VNET-based CE (customer edge) node and the inside interface of an on-premises CE node in a building in Redmond, Washington. This secure network facilitated an NFS remote volume, content on spindles/flash in on-premises NetApp ONTAP to be remotely mounted on the Azure server. This secure file access provided peace of mind to exposing potentially critical and private materials from NetApp ONTAP volumes to the AI offering. RAG was configured and files were ingested, populating a vector database within the Azure server, that allowed details, ideas, and recommendations to be harnessed by a generative AI LLM by augmenting user prompts with text gleaned from the vector database. Simple examples were used to first demonstrate that RAG was working by posing queries that should not have been addressed by the loaded secure content; such a query was not suitably answered as expected. The feeding of meaningful content from ONTAP was then demonstrated to unleash the potential of AI to address queries based upon meaningful .pdf files. Opportunities to improve results by swapping in and out the main generative AI model, as well as the embedding model, were also considered.495Views2likes0CommentsDistributed Cloud Support for NAS Migrations from On-Premises Approaches to Azure NetApp Files
F5 Distributed Cloud (XC) Secure Multicloud Networking (MCN) connects and secures distributed applications across offices, data centers, and various cloud platforms. Frequently the technology is web-based, meaning traffic is often carried on ports like TCP port 443, however other traffic types are also prevalent in an enterprise’s traffic mix. Examples include SSH or relational database protocols. One major component of networked traffic is Network-Attached Storage (NAS), a protocol in the past frequently carried over LANs between employees in offices and co-located NAS appliances, perhaps in wiring closets or server rooms. An example of such an appliance would be the ONTAP family from NetApp which can take on physical or virtual form factors. NAS protocols are particularly useful as they integrate file stores into operating systems such as Microsoft Windows or Linux distributions as directories, mounted for easy access to files at any time, often permanently. This contrasts with SSH file transfers, which are often ephemeral actions and not so tightly integral to host operating system health. With the rise of remote work, often the NAS appliances see increasing file reads-and-writes to these directories, traversing wide-area links. In fact, one study analyzing fundamental traffic changes due to the Covid-19 pandemic saw a 22 percent increase in file transfer protocol (FTP) in a single year, suggesting access to files has undergone significant foundational changes in recent years. Distributed Cloud and the Movement towards Centralized Enterprise Storage A traditional concern about serving NAS files to offices from a centralized point, such as a cloud-instantiated file repository, is latency and reliability. With F5’s Distributed Cloud offering a 12 Tbps aggregate backbone and dedicated RE-to-RE links, the behavior of the network component is both highly durable and performant. The efficiencies of a centralized corporate file distribution point, with the required 9’s of guaranteed uptime of modern cloud services, and the logic of moving towards cloud-served NAS solutions makes a lot of sense. With on-premises storage appliances replaced by a secure, networked service eliminates the need to maintain costly spares, which are effectively a shadow NAS appliance infrastructure and onerous RMA procedures. All of this enables accomplishing the goal of shrinking/greening office wiring closets. To demonstrate this centralized model for a NAS architecture, a configuration was created whereby a west coast simulated office was connected by F5 Distributed Cloud to Azure NetApp Files (ANF) instantiated in Azure East-2 region. ANF is Microsoft Azure’s newest native file serving solution, managed by NetApp, with data throughputs that increase in lock step with the amount of reserved storage pool capacity. Different quality of service (QoS) levels are selectable by the consumer. In the streamlined ANF configuration workflow, where various transaction latency thresholds may be requested, even the most demanding relational database operations are typically accommodated. Microsoft offers additional details on ANF here, however, this article should serve to sufficiently demonstrate the ANF and F5 Distributed Cloud Secure MCN solutions for most readers. Distributed Cloud and Azure NetApp Files Deployment Example NAS in the enterprise today largely involves use of either NFS or SMB protocols, both of which can be used within Windows and Linux environments and make remote directories appear and perform as if local to users. In our example, a western US point of presence was leveraged to serve as the simulated remote office and standard Linux hosts to serve as the consumers of NetApp volumes. In the east, a corporate VNET was deployed in an Azure resource group (RG) in US-East-2, with one subnet delegated to provide Azure NetApp Files (ANF). To securely connect the west coast office to the eastern Azure ANF service, F5 Distributed Cloud Secure MCN was utilized to create a Layer 3 multi-cloud network offering. This is achieved by easily dropping an F5 customer edge (CE) virtual appliance into both the office and the Azure VNET in the east. The CE is a 2-port security appliance. The inside interfaces on both CEs were attached to a global virtual network, and exclusive layer-3 associations to allow simple connectivity and fully preserve privacy. In keeping with the promise of SaaS, Distributed Cloud users require no routing protocol setup. The solution takes care of the control plane, including routing and encryption. This concept could be scaled to hundreds of offices, if equipped with CEs, and easily attached to the same global virtual network. CEs, at boot-up, automatically attach via IP Sec (or SSL) tunnels to geographically close F5 backbone nodes, called regional edge (RE) sites. Like tunnel establishment, routing tables are updated under-the-hood to allow for a turn-key security relationship between Azure NetApp File volumes and consuming offices. The setup is depicted as follows: Setup Azure NetApp Files (ANF) Volumes in Minutes To put the centralized approach to offering NAS volumes for remote offices or locations into practice, a series of quick steps are undertaken, which can all be done through the standard Microsoft Azure portal. The four steps are listed below, with screenshots provided for key points in the brief process: If not starting from an existing Resource Group (RG), create a new RG and add an Azure VNET to it. Delegate one subnet in your VNET to support ANF. Under “Delegate Subnet to a Service” select from the pull-down-list the entry “Microsoft.NetApp/volumes”. Within the Resource Group, choose “Create” and make a NetApp account. This will appear in the Azure Marketplace listings as “Azure NetApp Files”. In your NetApp account, under “Storage service” create a capacity pool. The pool should be sized appropriately, larger is typically better, since numerous volumes, supporting your choice of NFS3/4 and SMB protocols, will be created from this single, large disk pool. Create your first volume, select size, NAS protocols to support, and QoS parameters that meet your business requirements. As seen below, when adding a capacity pool simply follow the numerical sequence to add your pool, with a newly created sample 2 TiB pool highlighted; 1,024 TiB (1 PiB) are possible (click image to enlarge). Interestingly, the capacity pool shown is the “Standard” service level, as opposed to “Premium” and “Ultra”. With QoS type of Auto selected, Azure NetApp Files provides increasing throughput in terms of megabytes per second as the number of TiB in the pool increases. The throughput also increases with service levels; for standard, as shown, 8 megabytes per second per TiB will be allocated. Beyond throughput, ANF also provides the lowest latency averages for reads and writes in the Azure portfolio of storage offerings. As such, ANF is a very good fit for database deployments that must see constrained, average latency for mission-critical transactions. Deeper discussion around ANF service levels may be explored through the Microsoft document here. The next screenshot shows the simple click-through sequence for adding a volume to the capacity pool, simply click on volumes and the “+Add volume” button. A resulting sample volume is displayed in the figure with key parameters highlighted. In the above volume (“f5-distributed-cloud-vol-001”) the NAS protocol selected was NFSv3 and the size of the volume (“Quota”) was set to 100GiB. Setup F5 Distributed Cloud Office-to-Azure Connectivity To access the volume in a secured and highly responsive manner, from corporate headquarters, remote offices or existing data centers, three items from F5 Distributed Cloud are required: A customer edge (CE) node, normally with 2-ports, must be deployed in the Azure RG VNET. This establishes the Azure instance as a “site” within the Distributed Cloud dashboard. Hub and spoke architectures may also be used if required, where VNET peering can also allow the secure multi-cloud network (MCN) solution to operate seamlessly. A CE is deployed at a remote office or datacenter, where file storage services are required by various lines of business. The CE is frequently deployed as a virtual appliance or installed on a bare metal server and typically has 2-ports. To instantiate a layer-3 MCN service, the inside ports of the two CEs are “joined” to a virtual global network created by the enterprise in the Distributed Cloud console, although REST API and Terraform are also deployment options. By having each inside port of the Azure and office CE’s joined to the same virtual network, the “inside” subnets can now communicate with each other, securely, with traffic normally exchanged over encrypted high-speed IPSec tunnels into the F5 XC global fabric. The following screenshot demonstrates adding the Azure CE inside interface to a global virtual network, allowing MCN connectivity to remote office clients requiring access to volumes. Further restrictions, to prevent unauthorized clients, are found within NAS protocols themselves, such as simple Export policies in NFS and ACL rules in SMB/CIFS, which can be configured quickly within ANF. Remote Office Access – Establish Read/Write File Access to Azure ANF over F5 Distributed Cloud With both ANF configured and F5 Distributed Cloud now providing a layer-3 muticloud network (MCN) solution, to patch enterprise offices to the centralized storage, some confirmation of the solution working as expected was desired. First off, a choice in protocols was made. When configuring ANF, the normal choices for access are NFSv3/v4 or SMB/CIFS or both protocols concurrently. Historically, Microsoft hosts made use of SMB/CIFS and Linux/Unix hosts preferred NFS, however today both protocols are used throughout enterprises. One example being long-time SAMBA server (SMB/CIFS) support in the world of Linux. Azure NetApp Files will provide all the necessary command samples to get hosts connected without difficulty. For instance, to mount the volume to a folder off the Linux user home directory, such as the sample folder “f5-distributed-cloud-vol-001”, per the ANF suggestion the following one command will connect the office Linux host to the central storage in Azure-East-2: sudo mount -t nfs -o rw,hard,rsize=262144,wsize=262144,vers=3,tcp 10.0.9.4:/f5-distributed-cloud-vol-001 f5-distributed-cloud-vol-001 At this point the volume is available for day-to-day tasks, including read and write operations, as if the NAS solution were local to the office, often literally down the hallway. Remote Office Access - Demonstration of Azure ANF over F5 Distributed Cloud in Action To repeatedly exercise file writes from a west coast US office to an east coast ANF deployment in Azure-East-2 (Richmond, Virgina) a simple shell script was used to perpetually write a file to a volume, delete it, and repeat over time. The following sample wrote a file of 20,000 bytes to the ANF service, waited a few seconds, and then removed the file before beginning another cycle. At the lowest common denominator, packet analysis for the ensuing traffic from the western US office will indicate both network and application latency sample values. As depicted in the following Wireshark trace, the TCP response to a transmitted segment carrying an NFS command, was observed to be just 74.5 milliseconds. This prompt round-trip latency for a cross-continent data plane suggests a performant Distributed Cloud MCN service level. This is easily seen as the offset from the reference timestamp (time equal to zero) of the NFS v3 Create Call. Click on image to expand. Analyzing the NAS response from ANF (packet 185) arrives less than 1 millisecond later, suggesting a very responsive, well-tuned NFS control plane offered by ANF. To measure the actual, write-time of a file from west coast to east coast, the following trace demonstrates the 20,000 byte file write exercise from the shell script. In this case, the TCP segments making up the file, specifically the large packet body lengths called out in the screenshot, are delivered efficiently without TCP retransmissions, TCP zero window events, nor having any indicators of layer 3 and 4 health concerns. The entirety of the write is measured at the packet layer to take only 150.8 milliseconds. Since packet-level analysis is not the most turnkey, easy method to monitor file read and write performance, a set of Linux and Windows utilities can also be leveraged. The Linux utility nfsiostat was concurrently used with the test file writes and produced similar, good latency measurements. Nfsiostat monitoring of the file write testing, from west coast to east coast, for the 20,000-byte file, has indicated an average write time to ANF of 151 milliseconds. The measurements presented here are simply observational, to present rapid, digestible techniques for readers interested in service assurance for running ANF over an XC L3 MCN offering. For more rigorous monitoring treatments, Microsoft provides guidance on performing one’s own measurements of Azure NetApp Files here. Summary As enterprise-class customers continue to rapidly look towards cloud for compute performance, GPU access, and economies-of-scale savings for key workloads, the benefits of a centralized, scalable storage counterpart to this story exists. F5 Distributed Cloud offers the reach and performance levels to securely tie existing offices and data centers to cloud-native storage solutions. One example of this approach to modernize storage was covered in this article, the turn-key ability to begin transitioning from traditionally on-premises NAS appliances to cloud-native scalable volumes. The Azure NetApp Files approach to serving read/write volumes allows modern hosts, including Windows and Linux distributions, to utilize virtually unlimited folder sizes with service levels adjustable to business needs.124Views0likes1CommentThe App Delivery Fabric with Secure Multicloud Networking
This tutorial with accompanying workflow guide deploys customer edge sites and uses Distributed Cloud Multicloud Networking App Connect to establish a Secure MCN App Delivery Fabric, enabling only Layer7 app connectivity between two cloud sites. Manual and automation workflows show how to make this NetOps and DevOps task come to life.186Views2likes0CommentsHow To Run Ollama On F5 AppStack With An NVIDIA GPU In AWS
If you're just getting started with AI, you'll want to watch this one, as Michael Coleman shows Aubrey King, from DevCentral, how to run Ollama on F5 AppStack on an AWS instance with an NVIDIA Tesla T4 GPU. You'll get to see the install, what it looks like when a WAF finds a suspicious conversation and even a quick peek at how Mistral handles a challenge differently than Gemma.141Views2likes0CommentsIntroducing Secure MCN features on F5 Distributed Cloud
Introduction F5 Distributed Cloud Services offers many secure multi-cloud networking features. In the video linked below, I demonstrate how to connect a Secure Mesh Customer Edge (CE) Site running on VMware and using common hardware. This on-prem CE is joined to a site mesh group of three other CE's, two of which are run on the public cloud providers AWS and Azure. Secure Mesh CE is a newly enhanced feature in Distributed Cloud that allows CE's not running in public cloud providers to run on hardware with unique and different configurations. Specifically, it's now possible to deploy site mesh transit networking to all CE's having one, two, or more NIC's, with each CE having its own unique physical configuration for networking. See my article on Secure Mesh Site Networking to learn how to set up and configure secure mesh sites. In addition to secure mesh networking, on-prem CE's can be deployed without app management features, giving organizations the flexibility to conserve deployed resources. Organizations can now choose whether to deploy AppStack CE's, where the CE's can manage and run K8s compute workloads deployed at the site, or use networking-focused CE's freeing up resources that would otherwise be used managing the apps. Whether deploying an AppStack or Secure Mesh CE, both types support Distributed Cloud's comprehensive set of security features, including DDoS, WAF, API protection, Bot, and Risk management. Secure MCN deployment capabilities include the following capabilities: Secure Multi-Cloud Network Fabric (secure connectivity) Discover any app running anywhere across your environments Cloud/On-Prem Customer Edge (CE) Private link connectivity orchestration with F5 XC as-a-service using any transport provider ➡️ Example: AWS PrivateLink, Azure CloudLink, Private transport (IP, MPLS, etc) L3 Network Connect & L7 App Connect capabilities L3/L4 DDoS + Enhanced intent-based firewall policies Security Service insertion w/ support for BIG-IP and Palo Alto Firewalls Application Security Services - WAF, API Protection, L7 DoS, Bot Defense, Client-side defense and more SaaS and Automation for Security, Network, & Edge Compute Powerful monitoring dashboards & troubleshooting tools for the entire secure multi-cloud network fabric Gain visibility into how and which API's are being consumed in workflows ➡️ Monitor and troubleshoot apps including their API's In the following video, I introduce the components that make up a Secure MCN deployment, and then walk through configuring the security features and show how to observe app performance and remediate security related incidents. 0-3:32 - Overview of Secure MCN features 3:32-9:20 - Product Demo Resources Distributed Cloud App Delivery Fabric Workflow Guide (GitHub) Secure MCN Article Series Secure MCN Intro: Introducing Secure MCN features on F5 Distributed Cloud Secure MCN Part 1: Using Distributed Application Security Policies in Secure Multicloud Networking Customer Edge Sites Secure MCN Part 2: The App Delivery Fabric with Secure Multicloud Networking Secure MCN Part 3: Coming Soon: The Secure Network Fabric with Multicloud Network Segmentation & Private Provider Network Connectivity Related Technical Articles 🔥 ➡️ Combining the key aspects of Secure MCN with GenAI apps: Protect multi-cloud and Edge Generative AI applications with F5 Distributed Cloud Secure Mesh Site Networking (DevCentral) A Complete Multi-Cloud Networking Walkthrough (DevCentral) Product Documentation How-To Create Secure Mesh Sites Product Information Distributed Cloud Network Connect Distributed Cloud App Connect1.6KViews1like0CommentsDeploying F5 Distributed Cloud (XC) Services in Cisco ACI - Layer Two Attached Deployment
Introduction F5 Distributed Cloud (XC) Services are SaaS-based security, networking, and application management services that can be deployed across multi-cloud, on-premises, and edge locations. This article will show you how you can deploy F5 Distributed Cloud’s Customer Edge (CE) site in Cisco Application Centric Infrastructure (ACI) so that you can securely connect your application and distribute the application workloads in a Hybrid Multi-Cloud environment. F5 XC Layer Two Attached CE in Cisco ACI Besides Layer Three Attached deployment option, which we discussed in another article, a F5 Distributed Cloud Customer Edge (CE) site can also be deployed with Layer Two Attached in Cisco ACI environment using an ACI Endpoint of an Endpoint Group (EPG). As a reminder, Layer Two Attached is one of the deployment models to get traffic to/from a F5 Distributed Cloud CE site, where the CE can be a single node or a three-nodes cluster. F5 Distributed Cloud supports Virtual Router Redundancy Protocol (VRRP) for virtual IP (VIP) advertisement. When VRRP is enabled for VIPs advertisement, there is a VRRP Master for each of the VIPs and the VRRP Master for each of the VIPs can possibly be distributed across the CE nodes within the cluster. In this article, we will look at how we can deploy a Layer Two Attached CE site in Cisco ACI. F5 XC VRRP Support for VIPs Advertisement F5 XC Secure Mesh Sites are specifically engineered for non-cloud CE deployments, which support additional configurations that are not available using Fleet or regular Site management functionalities such as VRRP for VIPs advertisement. We recommend Secure Mesh Sites for non-cloud CE deployment and specifically, in Layer Two Attached CE deployment model, we recommend deploying CE site as a Secure Mesh Site to take advantage of the VRRPs support for VIPs advertisement. With VRRP enabled for VIPs advertisement, one of the CE nodes within the cluster will become the VRRP Master for a VIP and starts sending gratuitous ARPs (GARPS) while the rest of the CE nodes will become the VRRP Backup. Please note that in CE software, VRRP virtual MAC is not used for the VIP. Instead, the CE node, which is the VRRP Master for the VIP uses its physical MAC address in ARP responses for the VIP. When a failover happens, a VRRP Backup CE will become the new VRRP Master for the VIP and starts sending GARPs to update the ARP table of the devices in the broadcast domain. As of today, there isn't a way to configure the VRRP priority and the VRRP Master assignment is at random. Thus, if there are multiple VIPs, it is possible that a CE node within the cluster can be the VRRP Master for one or more VIPs, or none. F5 XC Layer Two Attached CE in ACI Example In this section, we will use an example to show you how to successfully deploy a Layer Two Attached CE site in Cisco ACI fabric so that you can securely connect your application and distribute the application workloads in a Hybrid Multi-Cloud environment. Topology In our example, CE is a three nodes cluster (Master-0, Master-1 and Master-2) which connects to the ACI fabric using an endpoint of an EPG namedexternal-epg: Example reference - ACI EPG external-epg endpoints table: HTTP load balancersite2-secure-mesh-cluster-app has a Custom VIP of 172.18.188.201/32 epg-xc.f5-demo.com with workloads 10.131.111.66 and 10.131.111.77 in the cloud (Azure) and it advertises the VIP to the CE site: F5 XC Configuration of VRRP for VIPs Advertisement To enable VRRP for VIPs advertisement, go to "Multi-Cloud Network Connect" -> "Manage" -> "Site Management" -> "Secure Mesh Sites" -> "Manage Configuration" from the selected Secure Mesh Site: Next, go to "Network Configuration" and select "Custom Network Configuration" to get to "Advanced Configuration" and make sure "Enable VRRP for VIP(s)" is selected for VIP Advertisement Mode: Validation We can now securely connect to our application: Note from above, after F5 XC is deployed in Cisco ACI, we also use F5 XC DNS as our primary nameserver: To check the requests on the F5 XC Console, go to"Multi-Cloud App Connect" -> "Overview: Applications" to bring out our HTTP load balancer, then go to "Performance Monitoring" -> "Requests": *Note: Make sure you are in the right namespace. As a reminder, VRRP for VIPs advertisement is enabled in our example. From the request shown above, we can see that CE node Master-2 is currently the VRRP Master for VIP 172.18.188.201 and if we go to the APIC, we can see the VIP is learned in the ACI endpoint table for EPG external-epgtoo: Example reference - a sniffer capture of GARP from CE node Master-2 for VIP 172.18.188.201: Summary A F5 Distributed Cloud Customer Edge (CE) site can be deployed with Layer Two Attached deployment model in Cisco ACI environment using an ACI Endpoint of an Endpoint Group (EPG). Layer Two Attached deployment model can be more desirable and easier for CE deployment when compared to Layer Three Attached. It is because Layer Two Attached does not require layer three/routing which means one less layer to take care of and it also brings the applications closer to the edge. With F5 Distributed Cloud Customer Edge (CE) site deployment, you can securely connect your on-premises to the cloud quickly and efficiently. Next Check out this video for some examples of Layer Two Attached CE use cases in Cisco ACI: Related Resources *On-Demand Webinar* Deploying F5 Distributed Cloud Services in Cisco ACI Deploying F5 Distributed Cloud (XC) Services in Cisco ACI - Layer Three Attached Deployment Customer Edge Site - Deployment & Routing Options Cisco ACI Endpoint Learning White Paper353Views0likes0Comments