Semantic search is used across a wide variety of problem domains to use natural language inputs to find results. It allows us to compare different texts to find ones that are similar based on the “meaning” of those texts, rather than just on matching keywords (as in lexical search).
We’re going to use it to build a demo app for the task of information retrieval, or what you might think of as classic search. However, the same concepts can be applied to tasks like classification or clustering (used for things like recommendations), reranking (reordering search results for relevancy), and getting relevant context to improve the performance of other models.
We’ll assume you’re somewhat comfortable on the command line. These instructions should work on Linux, Mac OS X, and Windows systems using WSL, though some commands may be slightly different depending on your system.
Our data set will be a list of news titles from the Australian Broadcasting Corporation published between 2003 and 2021, which covers major international events.
To implement this, we’re going to use several open source tools:
- the Instructor model to create “embeddings” for our texts
- PostgreSQL with the pgvector extension to store the embeddings, and to calculate similarities when we’re performing searches
- Flask, a Python web framework that we’ll use to interface with the Instructor model and PostgreSQL using the psycopg2 library
What Are Vector Embeddings?
The key part of semantic search is creating “vector embeddings”, which are a numerical representation of a string of text. In practice, these are just an array of numbers, like [0.82, -0.39, …, 0.44]
. These arrays of numbers (vectors) are generated by sending text to a language model, which encodes the “meaning” of the text as these values.
Each of these numbers is called a dimension, and each captures some aspect of the text. If we were hand-encoding these, perhaps we would use the first number to represent the country or region of a news item, the second to represent topic area, the third for positive/negative sentiment, etc. In practice, what each dimension really encodes is mostly opaque to us, and is only significant to the model.
We can do a comparison of these vector embeddings using several techniques, including one we’ll use called cosine similarity. This can compare vectors to determine which have the closest “angle” to each other, and are therefore the most similar.
Step 1: Project Setup
You’ll need to have PostgreSQL installed as well as the pgvector extension. Once you have those, let’s set up a project directory, and activate a virtual environment so that our dependencies can stay scoped to our project:
mkdir news_search
cd news_search
python -m venv venv
And finally, activate it by running:
source venv/bin/activate
Now that it’s active, we’ll install the dependencies:
pip install flask InstructorEmbedding torch numpy tqdm sentence_transformers psycopg2
Finally, download the data set in CSV format from here:
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/SYBGZL
Because it’s about 1.2M rows, we’ll just take the first 5,000 of them so that it doesn’t take too long to get their embeddings and load them in, but should give us enough data to keep it interesting. One way to do that is using the head
command (the csv file has a header so we take an extra line):
head -n 5001 abcnews-date-text.csv > first_5k_headlines.csv
Step 2: Setting Up the Instructor Model
Instructor is an open-source model that is licensed under Apache 2.0, allowing even commercial usage, and which achieves state of the art performance across many embedding tasks.
We’re going to use the smallest Instructor model, instructor-base
, which requires the least amount of hardware to run. Feel free to try the instructor-large
or even instructor-xl
models if you have the resources to do it.
In addition to the text you’re embedding, Instructor also uses an instruction that it uses to calculate the embedding. This instruction is a big part of what makes it work so well across different tasks and problem domains. The instruction defines both the problem domain (i.e. “Represent the news title”) as well as the task (“for retrieval”).
In a file called instructor.py
, put the following code:
from InstructorEmbedding import INSTRUCTOR
import numpy
model = INSTRUCTOR('hkunlp/instructor-base')
def calculate_embedding(instruction, text):
embeddings = model.encode([[instruction, text]])
# convert from numpy ndarray type to regular list
list_embeddings = numpy.ravel(embeddings).tolist()
return list_embeddings
We can test to make sure this works from within a Python REPL. Let’s try it:
python
>>> import instructor
>>> instructor.calculate_embeddings('Represent the sentence for retrieval', 'I am a sentence')
The first run may take a while as it downloads the model, but once it’s ready you should see the vectors returned!
Step 3: Database Setup
You can set up a .pgpass file to store your database credentials, which will be used automatically by the psycopg2
library. To set this up, in your home directory (cd ~
) create a file called .pgpass
with the following contents:
#hostname:port:database:username:password
localhost:5432:news_search:postgres:your_postgres_password
Change any of the values you need if you’ve set non-default values for Postgres. Then, make sure to restrict access to this file to just your user:
chmod 600 ~/.pgpass
Now we can use the psql
utility to connect to the database and set up our tables:
psql -U postgres
postgres=# CREATE DATABASE news_search;
postgres=# \c news_search
postgres=# CREATE EXTENSION vector;
postgres=# CREATE TABLE articles (id BIGSERIAL PRIMARY KEY, published_date VARCHAR(40), title TEXT, embedding vector(768));
We created a 768-dimensional vector because that is the dimensionality of the vectors the Instructor model outputs.
Now let’s set up a module in a file db.py
to create a database connection for us:
import psycopg2
import os
def create_conn():
conn = psycopg2.connect(
dbname='news_search',
user='postgres',
host='localhost',
port='5432'
)
return conn
Again, you may need to fill in your own values for the database connection if they’re different.
We’re now ready to load in our dataset. We could set up an endpoint to save embeddings sent by clients, but for our purposes we can just create an import script, which I’ll call import.py
, to read the CSV file, calculate the embeddings, and add them to the database:
import csv
import psycopg2
import sys
import db
import instructor
if len(sys.argv) != 2:
print("Error: you should provide the file path as an argument to this script")
sys.exit(1)
data_file = sys.argv[1]
conn = db.create_conn()
cur = conn.cursor()
with open(data_file, 'r') as f:
reader = csv.reader(f)
next(reader) # skip the header
count = 0
for row in reader:
published_date = row[0]
title = row[1]
embedding = instructor.calculate_embedding('Represent the news article title for retrieval:', title)
query = 'INSERT INTO articles (published_date, title, embedding) VALUES (%s, %s, %s)'
data = (published_date, title, embedding)
cur.execute(query, data)
count += 1
if count % 500 == 0:
print("On record {0}".format(count))
conn.commit()
cur.close()
conn.close()
Then you can run it with the data file as the argument:
python import.py first_5k_headlines.csv
This could take a little time: it took me about 3 minutes to load these 5k records on a 2022 laptop.
Step 4: Set Up Flask
Flask is a very simple web framework, and will let us tie everything together in a web UI. In a file called app.py
, put the following code:
from flask import Flask, request, jsonify, send_file
import psycopg2
import db
import instructor
app = Flask(__name__)
@app.route("/search", methods=['POST', 'GET'])
def search():
if request.method == 'GET':
return send_file('search.html')
data = request.get_json()
search_string = data.get('query')
instruction = "Represent the news article question for retrieving relevant article titles:"
embedding = "{0}".format(instructor.calculate_embedding(instruction, search_string))
query = "SELECT title, published_date FROM articles ORDER BY embedding <=> %s LIMIT 10;"
conn = db.create_conn()
cur = conn.cursor()
cur.execute(query, (embedding,))
results = cur.fetchall()
conn.close()
return jsonify({'results': results}), 200
To run the server:
flask run
We haven’t created the search.html
page yet, but if you have an HTTP client installed, we can test to make sure the search works. Here is an example using curl
:
curl -X POST -H "Content-Type: application/json" -d '{"query":"airplanes"}' localhost:5000/search
You should see article titles get returned!
Step 5: The Frontend
We’ll now put together a simple frontend to make it easier to send queries. In a file called search.html
, put the following code:
<html>
<head>
<link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet" integrity="sha384-9ndCyUaIbzAi2FUVXJi0CjmCapSmO7SnpJef0486qhLnuZ2cdeRhO02iuK6FUUVM" crossorigin="anonymous">
<style>
body {
padding: 1em;
}
#results {
margin-top: 1em;
}
</style>
<script>
async function search() {
const query = document.getElementById('search').value;
const resultSection = document.getElementById('results');
resultSection.innerHTML = '';
const response = await fetch('/search', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ query: query }),
});
const json = await response.json();
json.results.forEach(r => {
const resultNode = document.createElement('div');
resultNode.innerHTML = `
<h3>${r[0]}</h3>
<p>Published: ${r[1]}</p>
`;
resultSection.appendChild(resultNode);
});
}
</script>
</head>
<body>
<h1>News Search</h1>
<div>
<label>
Search
<input type="text" id="search" />
</label>
<button onclick="search()">Search</button>
</div>
<div id="results"></div>
</body>
</html>
Now, you should be able to visit http://localhost:5000/search
and try searching for different terms. We’ve built a semantic search for news!
Considerations for Improvement
This is of course just a demo app, but should give an idea of how you can put together your own semantic vector search. If you wanted to expand on this, there are a few things to consider.
Instructor is designed with an upper threshold of 512 tokens (~380 words) in mind, and beyond that performance will start to drop significantly. If you are embedding long documents, you may want to split them into smaller chunks and get embeddings for the chunks.
Experimentally, I was able to get embeddings for about 1,700 titles per minute, which would be pretty slow on a large corpus. For instance, loading the entire 1.2M rows takes about 12 hours on my laptop. However, it’s possible to split up the dataset and parallelize the embeddings.
We haven’t created any indexes. While performance can vary widely based on many factors, there will probably be user-noticeable latency once you start approaching 100k rows. You can read up on creating indexes using Approximate Nearest Neighbors search, which trades some accuracy for much better performance, and should allow you to scale to search across millions of records. As with any queries, you can do benchmarking and analysis using EXPLAIN ANALYZE
.
If your company wants help building out semantic search or any other AI workflows, get in touch with us at Revelry!
We're building an AI-powered Product Operations Cloud, leveraging AI in almost every aspect of the software delivery lifecycle. Want to test drive it with us? Join the ProdOps party at ProdOps.ai.