A Guide to Open-Source Embedding Models

Updated on May 19, 2025 • Written By Sherlock Xu

If you are building an AI-powered system for semantic search, recommendation engines, or information retrieval, you’re likely familiar with embedding models. These models are useful for transforming text, images, and other data types into vectors that capture semantic meaning. Embedding models help systems understand and retrieve relevant content based on similarity in meaning.

In this blog post, we’ll explore some of the top open-source embedding models and answer common questions about them.

NV-Embed-v2

NV-Embed-v2 is the latest release of the generalist embedding models developed by NVIDIA. It delivers state-of-the-art performance across a wide variety of tasks, which once ranked No. 1 on the MTEB leaderboard. It achieves an impressive score of 72.31 across 56 different tasks, spanning retrieval, classification, clustering, STS, and more. It’s worth mentioning that the previous version NV-Embed-v1 also earned the top spot on the same leaderboard.

Why choose it:

  • Novel design: The model uses latent-attention pooling, enabling the LLM to attend to specific latent vectors, resulting in better embedding quality. Additionally, it uses a two-staged instruction tuning approach that makes it versatile across both retrieval and non-retrieval tasks.
  • Retrieval excellence: NV-Embed-v2 holds the No. 1 rank in the retrieval sub-category, scoring 62.65 across 15 retrieval tasks, which are essential for applications like RAG.
  • Negative mining: NV-Embed-v2 introduces hard-negative mining techniques to improve the accuracy of embedding tasks by eliminating false negatives more effectively.

Points to be cautious about:

  • Non-commercial license: NV-Embed-v2 is released under the CC-BY-NC-4.0 license, which prohibits commercial usage.

BGE-M3

BGE (BAAI General Embedding) models are a family of text embedding models developed by the Beijing Academy of Artificial Intelligence (BAAI). One of the most popular versions in the series is BGE-M3. It stands out due to its versatility in multi-functionality, multi-linguality, and multi-granularity capabilities, also known as M3.

Why choose it:

  • Multi-functionality: BGE-M3 can simultaneously perform the three common retrieval functionalities of embedding model: dense, multi-vector, and sparse retrieval.
  • Multi-linguality: BGE-M3 supports more than 100 working languages. It learns a common semantic space for different languages. This enables both multilingual retrieval within each language and crosslingual retrieval between different languages.
  • Multi-granularity: BGE-M3 is able to process inputs of different granularities, from short sentences to long documents of up to 8192 tokens.

Points to be cautious about:

  • Generalizability needs further testing: While BGE-M3 performs well on benchmark datasets, the researchers think they need more tests to confirm its effectiveness across real-world datasets.
  • Computational demand for long documents: Although BGE-M3 handles inputs up to 8192 tokens, processing very lengthy documents may pose challenges in terms of computational resources and efficiency.
  • Performance across languages: The researchers claim multi-lingual support, but they also admit there may be potential variations in performance across different language families and linguistic features.

BGE-M3 is just one part of the broader BGE family. If you're looking for English-only alternatives, you may want to explore bge-base-en-v1.5 or bge-en-icl.

all-mpnet-base-v2

MPNet is a novel pre-training method for language understanding tasks. It addresses limitations of masked language modeling (MLM) in BERT and permuted language modeling (PLM) in XLNet. In the MPNet family, all-mpnet-base-v2 is one of the most popular embedding models. It is designed specifically for sentence and short paragraph encoding. The original developers use a contrastive learning objective: given a sentence from a paired dataset, the model predicts which of a set of randomly sampled sentences is the correct pair.

all-mpnet-base-v2 is also a sentence-transformers model. It is able to map sentences and paragraphs into a 768-dimensional dense vector space, ideal for clustering, semantic search, and other NLP tasks.

To date, all-mpnet-base-v2 represents one of the most downloaded embedding models on Hugging Face.

Why choose it:

  • Extensive training: The model is trained on over 1 billion sentence pairs to help it capture fine-grained semantic relationships.
  • Fine-tuning: The model is very adaptable and can be further fine-tuned to optimize performance for specific tasks. As of this writing, it has 149 fine-tuned versions on Hugging Face.
  • Flexible licensing: The model is released under the Apache 2.0 license. This means it supports both personal and commercial use in accordance with the license terms.

Points to be cautions about:

  • Input length limitations: By default, the model truncates inputs longer than 384 word pieces, which may lead to a loss of context in longer text.
  • Moderate performance: Compared to other models of similar size, all-mpnet-base-v2 may not perform as well on certain tasks. It does not rank particularly high on the MTEB leaderboard across a range of benchmarks.

If you want to try all-mpnet-base-v2, I also recommend all-MiniLM-L6-v2. Both are sentence-transformers models and easy to set up. For those new to AI models, they make an excellent starting point for exploring embeddings.

gte-multilingual-base

gte-multilingual-base is the latest model in Alibaba Group’s GTE (General Text Embedding) family. It stands out for its strong performance in multilingual retrieval tasks and comprehensive representation evaluations. With 305 million parameters, this model balances high-quality embeddings with efficient resource usage.

Why choose it:

  • Multilingual support: The model covers more than 70 languages, delivering reliable multilingual performance.
  • Elastic dense embedding: gte-multilingual-base supports elastic output dense representations, optimizing storage and improving efficiency in downstream tasks.
  • Encoder architecture: Built on an encoder-only transformer architecture, gte-multilingual-base is smaller and more resource-efficient than decode-only models like gte-qwen2-1.5b-instruct. It delivers a 10x increase in inference speed.
  • Sparse vectors: In addition to dense representations, it can also generate sparse vectors.

Points to be cautious about:

  • Inconsistent performance across languages: In the paper, the researchers note that performance may vary for certain languages. This is likely due to the limited language data during contrastive pre-training, which may affect the performance for these languages.

Other recommended gte models:

  • gte-Qwen2-7B-instruct: A top-ranking model on the MTEB leaderboard
  • gte-large-en-v1.5: A model optimized for English, with a max sequence length of 8192

GTE-ModernColBERT-v1

GTE-ModernColBERT-v1 is a PyLate-based embedding model built on Alibaba-NLP’s gte-modernbert-base. It specializes in producing dense, low-dimensional representations optimized for semantic textual similarity, using MaxSim for retrieval. This model bridges the high performance of ColBERT-style retrieval with the efficiency of compact vector embeddings.

Why choose it:

  • Exceptional long-context performance: Although trained with a default 300-token document length, GTE-ModernColBERT generalizes well to longer inputs. It achieved state-of-the-art results on LongEmbed benchmarks (88.39 average across LEMB tasks with 32K token input, surpassing models like gte-multilingual-base and voyage-multilingual-2).
  • Streamlined retrieval with PyLate: The PyLate library offers native support for this model, making it easy to index documents and retrieve top results using Voyager HNSW indexing. It also supports reranking in a retrieval pipeline without creating an index.
  • Flexible licensing: Released under the Apache 2.0 license, GTE-ModernColBERT is free for both personal and commercial use.

Points to be cautious about:

  • Document length tuning required: While GTE-ModernColBERT supports longer inputs, users should test and set the document_length parameter based on their use case (e.g., up to 8192 tokens).
  • English-only: This version is optimized for English, so multilingual tasks may require alternative models.

Nomic Embed Text V2

Nomic Embed Text V2 is a multilingual embedding model from Nomic AI and the first to use a Mixture-of-Experts (MoE) architecture for text embeddings. Trained on 1.6 billion contrastive pairs across ~100 languages (from mC4 and multilingual CC News), it delivers high-quality embeddings for tasks like semantic search, RAG, and recommendations.

Why choose it:

  • MoE design: Unlike dense models, this architecture dynamically routes input to only a subset of its total parameters. Nomic AI researchers found alternating MoE layers with 8 experts and top-2 routing delivers the optimal balance between performance and efficiency (activate only 305M out of 475M in total during inference).
  • High performance: The model shows strong performance on BEIR and MIRACL benchmarks, competitive even with models twice its size.
  • Flexible dimension: Leveraging Matryoshka representation learning, it supports truncation from 768 to 256 while maintaining embedding quality. This is ideal for resource-constrained deployments.
  • Open-source: Pre-training and fine-tuning datasets, training code, and model weights are all open-sourced.

Points to be cautious about:

  • Prompt formatting: For optimal performance, use structured input formats like "search_document: <text>" or "search_query: <text>".
  • Input length: The model supports a maximum input length of 512 tokens.
  • Language variability: Performance may vary across different languages, especially low-resource ones.

What are the common use cases of embedding models?

Embedding models are important tools that convert text, images, or other data into vector representations, capturing their underlying semantics and structure. This makes them essential in a wide range of AI applications. To name a few:

  • Semantic search: Embeddings allow you to retrieve semantically similar items by encoding content (text, images, etc.) into vector space, where similar items are close to each other. In search engines, this enables users to easily find relevant content.
  • Information retrieval: Embeddings enable AI models to search through large databases for documents or responses relevant to a given query. A typical use case is RAG, where retrieved data helps improve real-time content generation.
  • Clustering and classification: By grouping similar data points in vector space, embeddings make it easy to classify and organize content. For example, you can group customer reviews by sentiment or documents by topic.
  • Recommendation systems: Embeddings help recommendation engines understand user preferences based on the semantic similarities between user interests. This makes it possible to provide more personalized recommendations.

What should I consider when deploying embedding models?

When deploying embedding models, consider these key factors:

  • Performance and accuracy: Choose a model suited to your specific tasks, like retrieval, clustering, or classification. Review benchmarks like MTEB to ensure the model meets the desired accuracy and performance for your use case.
  • Low latency and fast scaling. Real-time applications, such as search engines or chatbots, require fast, low-latency embeddings. If you have diverse traffic patterns, fast autoscaling (especially fast cold start time) is also important. BentoML provides standardized abstractions to build scalable APIs and lets you run any embedding model on BentoCloud. This inference platform provides fast and scalable infrastructure for model inference and advanced AI applications.
  • Integration for complicated AI systems: Embedding models can be powerful components within compound AI solutions. A simple example is combining an embedding model with an LLM for a RAG system. BentoML offers a suite of toolkits that simplify building and scaling such AI systems, including multi-model chainsdistributed orchestration, and multi-GPU serving.

Here is a code example of serving all-MiniLM-L6-v2 with BentoML:

Serving-all-MiniLM-L6-v2-with-BentoML.png

Deploy all-MiniLM-L6-v2Deploy all-MiniLM-L6-v2

How can I improve the quality of embeddings?

Improving the quality of embeddings can enhance the performance of tasks like search, classification, and clustering. Common strategies include:

  1. Fine-tune on domain-specific data. Start with fine-tuning your embedding model on data that closely resembles your target domain. This can greatly improve relevance and semantic accuracy. It is particularly effective for specialized industries like legal, medical, or e-commerce.
  2. Use contrastive learning. This is probably one of the most mentioned terms when you hear somebody talking about embedding models. It simply means to train embedding models by learning to differentiate between similar (positive) and dissimilar (negative) pairs of samples. This approach helps the model better capture subtle semantic differences.
  3. Experiment with different embedding dimensions. Different dimensions can impact both quality and resource usage. Lower dimensions may simplify and speed up computations but could lose detail, while higher dimensions often capture richer information at the cost of more storage.
  4. Use multimodal embedding training: For applications that require text, images, or other data types, I highly recommend you train the model with multimodal data. This can improve embedding quality by enabling the model to capture cross-modal relationships.

Final thoughts

It is never easy to choose the right embedding model with so many options in the market. I hope this guide has provided clarity on some of the top open-source embedding models. Actually, for each model listed, there are often many great variants worth exploring. My best advice here is to take advantage of the flexibility of open-source models by fine-tuning them with your own data. This can significantly improve embedding accuracy for your specific needs. And finally, remember that choosing the right deployment tool is just as crucial. It can make all the difference in achieving smooth, scalable, and efficient performance.

If you have questions about embedding models, check out the following resources: