How to build a scalable ingestion pipeline for enterprise generative AI applications

Written by Enterprise Bot | Sep 2, 2024 7:00:29 AM

Why it’s not as simple as the hello-world guides make it look

When building generative AI systems, the flashy aspects often get the focus, like using the latest GPT model. But the more "boring" underlying components have a greater impact on the overall results of a system.

Your generative AI application, like a customer service chatbot, likely relies on some external data from a knowledge base of PDFs, web pages, images, or other sources.

The ingestion pipeline – the system that collects, processes, and loads this data from all your sources into a structured database, ready to be queried in real-time – significantly impacts the quality of your application's output.

This post examines the Enterprise Bot ingestion pipeline and how it helps us power conversations for leading global enterprises through our conversational AI platform, covering:

What an ingestion pipeline is and what stages it includes.
How the scraping, extraction, parsing, chunking, and indexing of data interact.
Challenges you might not have considered for each stage.
Tools that can help you build your own ingestion pipeline.

What is an ingestion pipeline?

Modern LLMs like GPT-4o have considerable native general knowledge. They can tell you when the French Revolution occurred, write guides on brewing the perfect cup of tea, and tackle pretty much any general topic you can imagine.

But LLMs are still limited in terms of specific knowledge and recent information. LLMs only "know" about events that occurred before the model was trained, so they don't know about the latest news headlines or current stock prices, for example.

LLMs also don’t know about niche topics that weren’t included in their training data or weren’t given much emphasis. Need help with specific tax laws or details about your personalized health insurance policy? An LLM alone likely won't provide what you're looking for.

Modern conversational AI systems don’t rely only on the LLM they use. Instead, they draw on various sources to overcome the limitations of pre-trained models and accurately respond to user queries with current information.

The ingestion pipeline manages the entire journey of data: From its initial entry into the system to its final storage in a structured database, where it can be accessed by an AI assistant.

In this article, we’ll zoom into the "Continuous polling" arrow that feeds into the Enterprise Bot "DocBrain" component in the diagram below, and explain what happens here to allow us to serve millions of end users with relevant and up-to-date answers to their specific questions.

Let's break down each stage of this pipeline and explore how it keeps our AI system at the cutting edge of knowledge.

Ingestion pipelines for retrieval-augmented generation (RAG) applications

In theory, an ingestion pipeline doesn’t need to do much. It takes some inputs such as PDF documents, extracts information from them, turns that information into searchable vectors, and stores those vectors in a vector store.

In reality, there are a ton of nitty-gritty complications. This part of the overall RAG system is the most delicate to change and needs the most customization for excellent results.

For a higher-level overview of RAG systems, read our post on overcoming the limitations of LLMs with RAG.

Five stages of data handling make up an ingestion pipeline:

Scraping: Downloading the required source data and storing it in its raw form, such as HTML pages or PDF files.
Extraction: Getting the raw text and image files from the raw data and storing these in a consistent format.
Parsing: Identifying kinds of content, like paragraphs, lists, tables, links, and more, in the extracted text.
Chunking: Logically grouping information into smaller pieces that still make sense on their own.
Indexing: Turning the chunks into vectors with an embedding model so that the most relevant chunks for a specific user query can be identified.

Each of these steps has its own challenges and complications, and many platforms focus on solving the challenges of a single stage.

Scraping in retrieval-augmented generation

Scraping is a broad category of data collection techniques and isn't specific to generative AI applications. Some common challenges you’ll run into when accessing data on the web are:

Rate limits and automation blocking: Many sites will stop returning content if you make too many requests or use a CAPTCHA to block non-humans from accessing their content.
Interactive JavaScript: Some sites change their content based on JavaScript interactions in the browser, for example, loading in new content as the user scrolls down the page. Automated tools often do not account for this behavior, so the content your scraper downloads might not be the same as what you see in the browser.
Authentication: Sites might require a login to access content, which not all scrapers are capable of handling.
Duplication: The same or very similar content may be found at different locations.

To scrape non-web content, you’ll need to find or build custom tooling. At Enterprise Bot, we built a custom low-code integration tool called Blitzico that solves this problem by letting us access content from virtually all platforms. For popular platforms like Coherence and Sharepoint, we have native connections, and for any others we can easily build Bitzico connectors using a graphical interface like the one shown below.

Other platforms and tools are available to help you with the content-scraping stage of the pipeline, depending on your technical capability and the level of customization you need. Some to look at include:

ParseHub, if you want something simple and low-tech with a UI you can use to select the data you need.
ScrapingBee, if you have a technical team and need more customization and automation.
Scrapy, if you need advanced customization, and don’t mind building it yourself.

Once the initial scrape of your content is complete, you’ll need to figure out how to keep it up to date, either by periodically checking for new and changed content or replacing the entire scrape on a regular basis.

Extraction in retrieval-augmented generation

Once the data you need has been scraped into a single location, the next step is to extract the important parts of that data and discard the rest.

For content scraped from web pages, this usually means at least removing extra CSS and JavaScript code, but also identifying repeated uninteresting elements like headers, footers, sidebars, and adverts.

Extracting relevant data is usually a tradeoff. You’re unlikely to perfectly remove all the content you don’t want while keeping everything you do. So you’ll need to err on the side of caution and let some bad data through or choose a stricter approach and cut some potentially useful content out.

An important consideration here is how you'll keep the scraped and extracted data separate. This separation serves two key purposes:

Troubleshooting: Easily re-run parts of the process without starting from scratch if issues arise.
Data verification: Check your source data independently of the extracted information.

By maintaining this separation, you avoid the need to re-run the entire scraping process for each extraction run, saving time and computational resources.

Tools that can help you with extracting data include:

Beautiful Soup, to define rules for which HTML parts to discard while keeping the rest.
Pandoc, to convert documents between different formats.
LLMs like GPT or Llama, which are great at figuring out which parts of a document are important and converting them to a consistent format.

Parsing in retrieval-augmented generation

The distinction between parsing and extraction isn't always clear, but once you’ve extracted text from your data, there’ll still be more to do.

The parsing stage focuses on identifying and structuring various elements within the text, such as tables, bullet point lists, links to other documents, and sub-structures like chapters, pages, paragraphs, or sentences.

Adding metadata to your structured data at this stage of the pipeline is crucial for:

Improved answer generation: Metadata helps the GenAI system generate more accurate and contextual answers.
Enhanced source referencing: Metadata allows for easier referencing of source data when generating responses.

Properly parsed data allows generated responses to include specific references, for example, "See table 3 on page 12 of our pricing-2024.pdf document for more information on current pricing for this specific plan."

This level of detail not only enhances the accuracy of the information provided but also increases the transparency and credibility of AI-generated responses.

At the parsing stage of the ingestion pipeline, you may want to further augment your data using named entity recognition to identify specific people, places, companies, dates, or other entities and map them to a single concept even if there are variations in how they are written.

Tools like SpaCy, Tabula, and Beautiful Soup can help you with the parsing stage of the pipeline.

Chunking in retrieval-augmented generation

Chunking is an often-underestimated part of the ingestion pipeline that is critical to the quality of a conversational AI system's output. When relevant sections of your knowledge base are retrieved to include as extra context for the LLM, chunk size is crucial: Chunks need to be small enough to allow for several relevant chunks to be included in the context but large enough to make sense independently. Striking this balance ensures that the LLM has sufficient, coherent information to create a useful response to a user's question.

When deciding on a chunking strategy, you’ll want to consider:

How big should your chunks be?
Should your chunks overlap? If yes, by how much?
How will you handle the chunking of lists, tables, and other structures?

Here's an image to illustrate the difference between effective and ineffective chunking strategies.

The right side of the image demonstrates poor chunking, because actions are separated from their "Do" or "Don't" context. This approach to chunking could result in a generative AI assistant misinterpreting instructions and giving a user dangerous advice, like, "You should take an aspirin" in response to a query about what to do when someone's having a stroke.

By contrast, the left side of the image shows good chunking: It maintains the context of the "Do" and "Don't" lists, ensuring each action is clearly associated with the correct category.

Similarly, information contained in tables is easily chunked incorrectly. Data in the rows and columns of a table do not make sense when taken out of the context of the header row or first column. Consider the pricing table below.

In a table like this one, we would need to regard the table as a single chunk. Otherwise, if a user asks, "Is vision included in Plan B", the response might be, "Yes, with a supplement" as the assistant needs to guess what the text "included with supplement" refers to. Or a user who asks, "How much is Plan C" might get the answer "$400" because the assistant does not have the context about which plan that pricing corresponds to.

Indexing in retrieval-augmented generation

Once you’re scraped and pre-processed all of your data, it’s time to index it. Indexing data involves turning the chunks into vectors, or large arrays of numbers the system uses to find the most relevant chunks for a given user query.

To index your data, you’ll need to:

Choose an embedding model. We’ve written a post on how to do that.
Figure out how to store the index, either in your existing database or in a specialized vector database like Pinecone or Qdrant.
Host the embeddings in a way that they can be queried with as little latency as possible. Each time a user talks to your assistant, your embeddings likely need to be searched for relevant chunks, so you want this step to take no more than a few milliseconds at most.

Putting it together: A complete ingestion pipeline for enterprise generative AI conversational agents

Now you're familiar with the pieces of the ingestion pipeline, but putting them all together poses a few more challenges.

In many cases, we’re dealing with sensitive data and personally identifiable information (PII) at every stage in the pipe. You’ll want to ensure you have the tools to monitor and audit access to this data.

The myriad cloud services available to help you build each stage of the ingestion pipeline come at the cost of privacy: You'll need to send all your data to their cloud.

At Enterprise Bot, we can run these pipelines completely on-premise and provide tooling to ensure that your data is never accessed inappropriately.

We choose the optimal settings for each stage for you, but our platform also gives you the option to customize the ingestion pipeline if required.

You also get an overview of all your knowledge bases, across all your assistants.

Let’s see how Enterprise Bot handles the different stages of the ingestion pipeline.

Scraping data in Enterprise Bot

You can easily add new data sources through the Enterprise Bot UI, which accepts everything from a single web page, an entire website, or specific formats via Confluence, Topdesk, and Sharepoint.

Once you’ve added your data sources, we’ll run the scrape for you and let you know once it’s complete.

Extracting and parsing data in Enterprise Bot

You can configure most aspects of the extraction step, including specifying how to handle headers, images, and links. You can also add specific HTML tags to ignore.

Chunking data in Enterprise Bot

In our DocBrain platform, you can inspect each chunk individually to see what data it contains, what metadata was added, and a link back to the source in case you need to double-check that it was correctly chunked.

Indexing data in Enterprise Bot

All indexing and vectorization processes take place on the Enterprise Bot platform, without relying on third-party tools from OpenAI or Anthropic. This means that even when using a third-party LLM like GPT-4o, your full knowledge base is never shared with third-party providers.

Only the chunk identified as relevant to a specific user conversation gets shared, and only after it goes through our PII anonymization filters to ensure your private data remains private.

Should you buy or build your ingestion pipeline?

Your company needs conversational generative AI, but should you buy an off-the-shelf solution or build it yourself?

Our Enterprise Generative AI: Build vs. Buy post examines the pros and cons of each option in detail. But in our experience, it's specifically the challenges of building the ingestion pipeline of a GenAI system that many businesses underestimate.

Whether you choose to build or buy your solution comes down to your timelines, budget, and customization requirements, but don’t assume that it will be cheaper to build yourself. We’ve spent years iterating on the Enterprise Bot ingestion pipeline, and each improvement benefits all our happy customers, delivering efficiencies that are often unattainable when building a custom pipeline for a single platform.

Are you looking for enterprise-ready GenAI?

If you want to delight your customers with high-quality conversational automation without having to worry about any of the challenges of building your own, book your demo to find out how we can help you achieve your goals.

We’ve helped some of the world’s biggest brands reinvent customer support with our chatbot, live chat, voice bot, and email bot solutions.

View full post