Skip to content

Semantic Search

The semantic matching capability makes it easier for members to find knowledge by providing results that are conceptually related to their search query, even if the exact keywords from the query are not present.

Comparison Between Semantic Search and Keyword Search

Keyword Search Semantic Search
Matches the exact words or phrases entered in a search query to the content in a database or document. It relies on literal text matching. Focuses on the meaning and context behind the query. It uses natural language processing (NLP) and machine learning to understand the intent and deliver conceptually relevant results.
Operates by scanning for specific keywords or phrases. It does not account for synonyms, related concepts, or the intent behind the query. Analyzes the context of the query, considers synonyms, related terms, and user intent, and retrieves results that align with the broader meaning.
High precision for exact matches but may miss relevant content if the query wording does not match the document's text. More flexible and likely to capture relevant results, even when the wording differs from the query.
Simple and quick but limited in scope. Requires advanced algorithms and computational resources to process meaning and context.
Example: It might return articles containing the exact phrase "improve heart health" but may miss content titled "Tips for Cardiovascular Wellness" or "How to Strengthen Your Heart." It could provide results like "Best Practices for Cardiovascular Care," "The Role of Diet in Heart Health," and "Exercise Benefits for a Healthy Heart," recognizing the query's intent and context without requiring exact phrasing.

Lucene is a very popular library for keyword base search. Products like Apache SOLR, Open Search and Elastic Search are widely used in the industry for the keyword based search. Knowledge Graph database is another common search technique.

Avalanchio semantic search gives you a simple pipeline to build necessary artifacts to make the sematic search and knowledge graph queries very easy.

Semantic Search Pipeline in Avalanchio

Figure: Diagram of the pipeline

  1. Collect structure of the dataset and metadata. Build necessary data models and tables in Avalanchio.

  2. Text Extraction: Extract textual content from source. Source may be the databases, websites, pdf files or images. Source may generate data in real time for example, ticketing system, social media posts.

  3. Data Cleaning: Remove duplicates, irrelevant text, and noisy elements (e.g., HTML tags, special characters). Standardize text with consistent casing, spelling corrections, and handling of abbreviations. Identify and translate non-English documents as needed. Store the data to Avalanchio table.

  4. Generate Embedding: Use pre-trained NLP models (e.g., GPT, or domain-specific models like ClinicalBERT for healthcare). Convert all text into high-dimensional vectors using the chosen model. Use OpenAI API service to tokenize text. However, if the data privacy regulations prevent you for using publicly available tokenizers, Avalanchio can help setup in-house machine learning operations inside your own data center. Store the embedding inside Avalanchio tables along side the text content.

  5. Now you are ready for using the tokens for semantic search. For fast information retrieval, build token index. Avalanchio provides several clustering algoriths that are efficient for high dimensional dense embedding vectors and requires limited memory while training even on large dataset. Aim of the indexing to reduce search time complexity from O(n) to O(log(n)).

  6. To increase accuracy of the semantic search we provide some extra tuning model to extract relevent metadata. (A)

  7. In addition to semantic search, entity extraction techniques demonstrate a remarkable capacity to refine search results by utilizing knowledge graph traversal. (A)

(A) The extact methods depend on the nature of the data. Get in touch with Avalanchio team for more details.

Example:

In the demonstration you will find how to:

  • upload data using csv file
  • perform pre-processing
  • create embedding
  • extract metadata
  • execute semantic search without index
  • build index
  • execute sematic search using index