Normalizing Enterprise Data for Effective Search and RAG

Dave Cliffe

Head of RAG (Rendering AI Guidance) at Atolio

Introduction

Today we continue our series on the challenges of building RAG systems in the enterprise!

As discussed last time, there can be a very diverse set of sources used in a given company. Microsoft Sharepoint, Google Docs, Jira, Slack and others are just a start. This means a lot of connector development work. What it also brings, is a diverse set of data schema and shapes. Coalescing all this data into a uniform system is an important part of enterprise search platforms.

If you’re a developer and new to the search world, it will be tempting to reach for ElasticSearch or OpenSearch and leverage their support for inserting schema-less JSON. This works great for getting off the ground quickly, and for basic levels of plain-text search. Over time though, you begin to see why enterprise search platforms like SOLR and Vespa rely on pre-defined, static schema. As data scientists and relevance engineers are fond of saying, it’s all about the data!

In the following sections, we’ll cover some of the approaches which help ensure effective search across many sources while enabling high quality relevance for all use cases.

Uniform Schema Across Sources

Spend time with your data. It’s a lesson dispensed and learned over and over again. When you spend enough time with enterprise sources and APIs, you begin to see interesting patterns appear. These patterns can be drivers for common schema in enterprise search systems.

As you bring new sources on board, you don’t want to treat them each as unique and special. Instead, you can begin to group them in classes. Some of our preferred classes include documents, tickets, and message bundles. Examples of document patterns include Microsoft Word, Google Docs, Atlassian Confluence, and loose PDFs. Another common class is the ticketing system, such as Jira, Github Issues, Linear, and more. Messaging systems include Slack, Microsoft Teams, and so on.

These aren’t the only classes you can find, but they are a good start and illustrate the goal. There’s no perfect search schema, but you can eliminate many downstream problems by grouping your sources by class and driving them toward a common schema like this. As the classes emerge, you’ll start to see common fields like title, author, date, people involved, free text. We’ll get into the importance of common and consistent fields in the next sections.

Normalized Text for Relevance Foundations

As we noted in the last section, a common schema will have common fields emerge such as title, subtitle, body text, and so on. It’s important to minimize these core fields and their definitions for search implementation and efficiency. You’ll keep your search queries and business logic from getting unwieldy. Even more important, you’ll constrain the number of variables at play for your ranking algorithms.

Digging a little deeper, you’ll also want to start controlling text length. A two page Word document versus a single Slack message can present significant hurdles for ranking and relevance. Fundamental lexical search algorithms like BM-25 are affected by text length. Newer semantic embeddings also require lots of investigation into techniques like truncation, concatenation, and chunking of inputs.

The gritty details are beyond the scope of this post, but in ideal scenarios you begin to see a convergence of the source classes and the common schema in pursuit of relevance. Your documents, tickets, and threads usually have titles with similar lengths. The ticket notes may be extracted and combined as a single text field for the search engine, more comparable to documents. Messages may be conceptually bundled together so that each source presents with similar sized batches of searchable text.

There’s no one-size fits all for normalizing fields of text, but it’s a topic you can’t ignore.

Mapping Common Metadata Fields

While text is important, let’s not forget structured metadata either.

We’ve found that introduction of each new source requires a careful review and cataloging of the incoming fields. Every source treats users, dates, status, labels, and other common fields a little differently. It’s real work to map all the incoming fields to valuable and useful fields in your schema. Then, there’s always a batch of fields that are truly unique to a source and it’s better if you don’t leave them on the cutting room floor.

It sounds important, but what’s the value? Fields with names and dates are important for faceted search and analytics. General metadata is great for filtering and sorting. Then, when these are assembled in a unified schema and engine, you can combine such data queries with full-text search. This combination is what drives a rich set of use cases for employees and AI systems.

Closing

There are several ways to approach these challenges, but the key is that normalized schema and data across diverse sources is critical. It unlocks relevance for both lexical and semantic search, as well as data oriented queries and filters. All this generally takes a lot of work from data design to processing development to tuning of matching and ranking.

Maybe you’d like to skip all that work, but still address your enterprise search and AI readiness needs? We’d love to discuss the challenges in your environment and work to craft solutions that provide value in your unique environment. Reach out!

Dave Cliffe is the Head of RAG (Rendering AI Guidance) at Atolio. Atolio helps enterprises use Large Language Models (LLMs) to find answers privately and securely.

Dave Cliffe

Head of RAG (Rendering AI Guidance) at Atolio

Get the answers you need from your enterprise. Safely.

Subscribe to receive the latest blog posts to your inbox every week.

Book a Demo

Get the answers you need from your enterprise. Safely.