Why it’s not as simple as the hello-world guides make it look
When building generative AI systems, the flashy aspects often get the focus, like using the latest GPT model. But the more "boring" underlying components have a greater impact on the overall results of a system.
Your generative AI application, like a customer service chatbot, likely relies on some external data from a knowledge base of PDFs, web pages, images, or other sources.
The ingestion pipeline – the system that collects, processes, and loads this data from all your sources into a structured database, ready to be queried in real-time – significantly impacts the quality of your application's output.
This post examines the Enterprise Bot ingestion pipeline and how it helps us power conversations for leading global enterprises through our conversational AI platform, covering:
Modern LLMs like GPT-4o have considerable native general knowledge. They can tell you when the French Revolution occurred, write guides on brewing the perfect cup of tea, and tackle pretty much any general topic you can imagine.
But LLMs are still limited in terms of specific knowledge and recent information. LLMs only "know" about events that occurred before the model was trained, so they don't know about the latest news headlines or current stock prices, for example.
LLMs also don’t know about niche topics that weren’t included in their training data or weren’t given much emphasis. Need help with specific tax laws or details about your personalized health insurance policy? An LLM alone likely won't provide what you're looking for.
Modern conversational AI systems don’t rely only on the LLM they use. Instead, they draw on various sources to overcome the limitations of pre-trained models and accurately respond to user queries with current information.
The ingestion pipeline manages the entire journey of data: From its initial entry into the system to its final storage in a structured database, where it can be accessed by an AI assistant.
In this article, we’ll zoom into the "Continuous polling" arrow that feeds into the Enterprise Bot "DocBrain" component in the diagram below, and explain what happens here to allow us to serve millions of end users with relevant and up-to-date answers to their specific questions.
Let's break down each stage of this pipeline and explore how it keeps our AI system at the cutting edge of knowledge.
In theory, an ingestion pipeline doesn’t need to do much. It takes some inputs such as PDF documents, extracts information from them, turns that information into searchable vectors, and stores those vectors in a vector store.
In reality, there are a ton of nitty-gritty complications. This part of the overall RAG system is the most delicate to change and needs the most customization for excellent results.
For a higher-level overview of RAG systems, read our post on overcoming the limitations of LLMs with RAG.
Five stages of data handling make up an ingestion pipeline:
Each of these steps has its own challenges and complications, and many platforms focus on solving the challenges of a single stage.
Scraping is a broad category of data collection techniques and isn't specific to generative AI applications. Some common challenges you’ll run into when accessing data on the web are:
To scrape non-web content, you’ll need to find or build custom tooling. At Enterprise Bot, we built a custom low-code integration tool called Blitzico that solves this problem by letting us access content from virtually all platforms. For popular platforms like Coherence and Sharepoint, we have native connections, and for any others we can easily build Bitzico connectors using a graphical interface like the one shown below.
Other platforms and tools are available to help you with the content-scraping stage of the pipeline, depending on your technical capability and the level of customization you need. Some to look at include:
Once the initial scrape of your content is complete, you’ll need to figure out how to keep it up to date, either by periodically checking for new and changed content or replacing the entire scrape on a regular basis.
Once the data you need has been scraped into a single location, the next step is to extract the important parts of that data and discard the rest.
For content scraped from web pages, this usually means at least removing extra CSS and JavaScript code, but also identifying repeated uninteresting elements like headers, footers, sidebars, and adverts.
Extracting relevant data is usually a tradeoff. You’re unlikely to perfectly remove all the content you don’t want while keeping everything you do. So you’ll need to err on the side of caution and let some bad data through or choose a stricter approach and cut some potentially useful content out.
An important consideration here is how you'll keep the scraped and extracted data separate. This separation serves two key purposes:
By maintaining this separation, you avoid the need to re-run the entire scraping process for each extraction run, saving time and computational resources.
Tools that can help you with extracting data include:
The distinction between parsing and extraction isn't always clear, but once you’ve extracted text from your data, there’ll still be more to do.
The parsing stage focuses on identifying and structuring various elements within the text, such as tables, bullet point lists, links to other documents, and sub-structures like chapters, pages, paragraphs, or sentences.
Adding metadata to your structured data at this stage of the pipeline is crucial for:
Properly parsed data allows generated responses to include specific references, for example, "See table 3 on page 12 of our pricing-2024.pdf document for more information on current pricing for this specific plan."
This level of detail not only enhances the accuracy of the information provided but also increases the transparency and credibility of AI-generated responses.
At the parsing stage of the ingestion pipeline, you may want to further augment your data using named entity recognition to identify specific people, places, companies, dates, or other entities and map them to a single concept even if there are variations in how they are written.
Tools like SpaCy, Tabula, and Beautiful Soup can help you with the parsing stage of the pipeline.
Chunking is an often-underestimated part of the ingestion pipeline that is critical to the quality of a conversational AI system's output. When relevant sections of your knowledge base are retrieved to include as extra context for the LLM, chunk size is crucial: Chunks need to be small enough to allow for several relevant chunks to be included in the context but large enough to make sense independently. Striking this balance ensures that the LLM has sufficient, coherent information to create a useful response to a user's question.
When deciding on a chunking strategy, you’ll want to consider:
Here's an image to illustrate the difference between effective and ineffective chunking strategies.
The right side of the image demonstrates poor chunking, because actions are separated from their "Do" or "Don't" context. This approach to chunking could result in a generative AI assistant misinterpreting instructions and giving a user dangerous advice, like, "You should take an aspirin" in response to a query about what to do when someone's having a stroke.
By contrast, the left side of the image shows good chunking: It maintains the context of the "Do" and "Don't" lists, ensuring each action is clearly associated with the correct category.
Similarly, information contained in tables is easily chunked incorrectly. Data in the rows and columns of a table do not make sense when taken out of the context of the header row or first column. Consider the pricing table below.
In a table like this one, we would need to regard the table as a single chunk. Otherwise, if a user asks, "Is vision included in Plan B", the response might be, "Yes, with a supplement" as the assistant needs to guess what the text "included with supplement" refers to. Or a user who asks, "How much is Plan C" might get the answer "$400" because the assistant does not have the context about which plan that pricing corresponds to.
Once you’re scraped and pre-processed all of your data, it’s time to index it. Indexing data involves turning the chunks into vectors, or large arrays of numbers the system uses to find the most relevant chunks for a given user query.
To index your data, you’ll need to:
Now you're familiar with the pieces of the ingestion pipeline, but putting them all together poses a few more challenges.
In many cases, we’re dealing with sensitive data and personally identifiable information (PII) at every stage in the pipe. You’ll want to ensure you have the tools to monitor and audit access to this data.
The myriad cloud services available to help you build each stage of the ingestion pipeline come at the cost of privacy: You'll need to send all your data to their cloud.
At Enterprise Bot, we can run these pipelines completely on-premise and provide tooling to ensure that your data is never accessed inappropriately.
We choose the optimal settings for each stage for you, but our platform also gives you the option to customize the ingestion pipeline if required.
You also get an overview of all your knowledge bases, across all your assistants.
Let’s see how Enterprise Bot handles the different stages of the ingestion pipeline.
You can easily add new data sources through the Enterprise Bot UI, which accepts everything from a single web page, an entire website, or specific formats via Confluence, Topdesk, and Sharepoint.
Once you’ve added your data sources, we’ll run the scrape for you and let you know once it’s complete.
You can configure most aspects of the extraction step, including specifying how to handle headers, images, and links. You can also add specific HTML tags to ignore.
In our DocBrain platform, you can inspect each chunk individually to see what data it contains, what metadata was added, and a link back to the source in case you need to double-check that it was correctly chunked.
All indexing and vectorization processes take place on the Enterprise Bot platform, without relying on third-party tools from OpenAI or Anthropic. This means that even when using a third-party LLM like GPT-4o, your full knowledge base is never shared with third-party providers.
Only the chunk identified as relevant to a specific user conversation gets shared, and only after it goes through our PII anonymization filters to ensure your private data remains private.
Your company needs conversational generative AI, but should you buy an off-the-shelf solution or build it yourself?
Our Enterprise Generative AI: Build vs. Buy post examines the pros and cons of each option in detail. But in our experience, it's specifically the challenges of building the ingestion pipeline of a GenAI system that many businesses underestimate.
Whether you choose to build or buy your solution comes down to your timelines, budget, and customization requirements, but don’t assume that it will be cheaper to build yourself. We’ve spent years iterating on the Enterprise Bot ingestion pipeline, and each improvement benefits all our happy customers, delivering efficiencies that are often unattainable when building a custom pipeline for a single platform.
If you want to delight your customers with high-quality conversational automation without having to worry about any of the challenges of building your own, book your demo to find out how we can help you achieve your goals.
We’ve helped some of the world’s biggest brands reinvent customer support with our chatbot, live chat, voice bot, and email bot solutions.