- Instalamos a extensão do Postgres chamada pgvector, que permite que tabelas tenham colunas do tipo vector onde vector é um conjunto de floats. Neste exemplo, usamos um vetor de 768 dimensões, ou seja, um vetor de comprimento 768.
- Criamos uma tabela que salvará os artigos para nossa base de conhecimento — o texto de cada artigo, o título do artigo e uma incorporação vetorial do texto do artigo. Nomeamos a tabela
articles
e as colunastitle
,text
eembedding
. - Extraímos o conteúdo de quatro URLs da Wikipédia e separamos o título e o conteúdo de cada uma.
- Limpamos cada corpo de artigo, dividimos o texto em pedaços de 500 caracteres e usamos um modelo de incorporação para criar um vetor de 768 dimensões de cada pedaço. O vetor é uma representação numérica (um float) do significado do pedaço de texto.
- Salvamos o título, um pedaço do corpo e o vetor de incorporação para o pedaço em uma linha do banco de dados. Para cada artigo, há tantos vetores quantos pedaços.
- Indexamos a coluna vetorial para pesquisa de similaridade na Parte 2.
import psycopg2
from sentence_transformers import SentenceTransformer
import requests
from bs4 import BeautifulSoup
import re
import ollama
# Your connection params here
MY_DB_HOST = 'localhost'
MY_DB_PORT = 5432
MY_DB_NAME = 'nitin'
MY_DB_USER = 'nitin'
MY_DB_PASSWORD = ''
# Set up the database connection
conn = psycopg2.connect(
host=MY_DB_HOST,
port=MY_DB_PORT,
dbname=MY_DB_NAME,
user=MY_DB_USER,
password=MY_DB_PASSWORD
)
cur = conn.cursor()
# Create the articles table with the pgvector extension
# If the pgvector extension is not installed on your machine it will need to be installed.
# See https://github.com/pgvector/pgvector or cloud instances with pgvector installed.
# First create the pgvector extension, then a table with a 768 dim vector column for embeddings.
# Note that the title and full text of the article is also saved with the embedding.
# This allows vector similarity search on the embedding column, returning matched text
# along with matched embeddings depending on what is needed.
# After this SQL command is executed we will have
# a) a pgvector extension installed if it did not already exist
# b) an empty table with a column of type vector along with two columns,
# one to save the title of the article and one to save a chunk of text.
# Postgres does not put a limit on the number of dimensions in pgvector embeddings.
# It is worth experimenting with larger lengths but note they need to match the length of embeddings
# created by the model you use. Embeddings of ~1k, 2k, or more dimensions are common among embeddings APIs.
cur.execute('''
CREATE EXTENSION IF NOT EXISTS vector;
DROP TABLE IF EXISTS articles;
CREATE TABLE articles (
id SERIAL PRIMARY KEY,
title TEXT,
text TEXT,
embedding VECTOR(768)
);
''')
conn.commit()
# Below are the sources of content for creating embeddings to be inserted in our demo vector db.
# Feel free to add your own links but note that different sources other than Wikipedia may
# have different junk characters and may require different pre-processing.
# As a start try other Wikipedia pages, then expand to other sources.
urls= (
'https://en.wikipedia.org/wiki/Pentax_K-x',
'https://en.wikipedia.org/wiki/2008_Tour_de_France',
'https://en.wikipedia.org/wiki/Onalaska,_Texas',
'https://en.wikipedia.org/wiki/List_of_extinct_dog_breeds'
)
# Fetch the HTML at a given link and extract only the text, separating title and content.
# We will use this text to extract content from Wikipedia articles to answer queries.
def extract_title_and_content(url):
try:
response = requests.get(url)
if response.status_code == 200: # success
# Create a BeautifulSoup object to parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')
# Extract the title of the page
title = soup.title.string.strip() if soup.title else ""
# Extract the text content from the page
content = soup.get_text(separator=" ")
return {"title": title, "text": content}
else:
print(f"Failed to retrieve content from {url}. Status code: {response.status_code}")
return None
except requests.exceptions.RequestException as e:
print(f"Error occurred while retrieving content from {url}: {str(e)}")
return None
# Create the embedding model
# This is the model we use to generate embeddings, i.e. to encode text chunks into numeric vectors of floats.
# Sentence Transformers (sbert.net) is a collection of transformer models designed for creating embeddings
# from sentences. These are trained on data sets used for different applications. We use one tuned for Q&A,
# hence the 'qa' in the name. There are other embedding models, some tuned for speed, some for breadth, etc.
# The site sbert.net is worth studying for picking the right model for other uses. It's also worth looking
# at the embedding models of providers like OpenAI, Cohere, etc. to learn the differences, but note that
# the use of an online model involves a potential loss of privacy.
embedding_model = SentenceTransformer('multi-qa-mpnet-base-dot-v1')
articles = ()
embeddings = ()
# Extract title,content from each URL and store it in the list.
for url in urls:
article = extract_title_and_content(url)
if article:
articles.append(article)
for article in articles:
raw_text = article("text")
# Pre-processing: Replace large chunks of white space with a space, eliminate junk characters.
# This will vary with each source and will need custom cleanup.
text = re.sub(r's+', ' ', raw_text)
text = text.replace(")", "").replace("(", "")
# chunk into 500 character chunks, this is a typical size, could be lower if total size of article is small.
chunks = (text(i:i + 500) for i in range(0, len(text), 500))
for chunk in chunks:
# This is where we invoke our model to generate a list of floats.
# The embedding model returns a numpy ndarray of floats.
# Psycopg coerces the list into a vector for insertion.
embedding = embedding_model.encode((chunk))(0).tolist()
cur.execute('''
INSERT INTO articles (title, text, embedding)
VALUES (%s, %s, %s); ''', (article("title"), chunk, embedding)
)
embeddings.append(embedding)
conn.commit()
# Create an index
# pgvector allows different indexes for similarity search.
# See the docs in the README at https://github.com/pgvector/pgvector for detailed explanations.
# Here we use 'hnsw' which is an index that assumes a Hierarchical Network Small Worlds model.
# HNSW is a pattern seen in network models of language. Hence this is one of the indexes
# that is expected to work well for language embeddings. For this small demo it will probably not
# make much of a difference which index you use, and the others are also worth trying.
# The parameters provided in the 'USING' clause are 'embedding vector_cosine_ops'
# The first, 'embedding' in this case, needs to match the name of the column which holds embeddings.
# The second, 'vector_cosine_ops', is the operation used for similarity search i.e. cosine similarity.
# The same README doc on GitHub gives other choices but for most common uses it makes little difference
# hence cosine similarity is used as our default.
cur.execute('''
CREATE INDEX ON articles USING hnsw (embedding vector_cosine_ops);
''')
conn.commit()
cur.close()
conn.close()
# End of file
Parte 2. Recuperar contexto do banco de dados vetorial e consultar o LLM
Na parte 2, fazemos uma pergunta em linguagem natural da nossa base de conhecimento, usando busca por similaridade para encontrar um contexto e usando um LLM (neste caso, o Meta’s Llama 3) para gerar uma resposta à pergunta no contexto fornecido. Os passos:
- Codificamos nossa consulta em linguagem natural como um vetor usando o mesmo modelo de incorporação que usamos para codificar os pedaços de texto que extraímos das páginas da Wikipédia.
- Realizamos uma busca de similaridade neste vetor usando uma consulta SQL. Similaridade, ou especificamente similaridade de cosseno, é uma maneira de encontrar os vetores em nosso banco de dados que estão mais próximos da consulta de vetor. Depois que encontrarmos os vetores mais próximos, podemos usá-los para recuperar o texto correspondente que é salvo com cada vetor. Esse é o contexto para nossa consulta ao LLM.
- Acrescentamos esse contexto ao nosso texto de consulta em linguagem natural, informando explicitamente ao LLM que o texto fornecido deve ser considerado o contexto para responder à consulta.
- Usamos um wrapper programático em torno do Ollama para passar a consulta em linguagem natural e o texto contextual para a API de solicitação do LLM e buscar a resposta. Enviamos três consultas e recebemos a resposta no contexto para cada consulta. Um exemplo de captura de tela para a primeira consulta é mostrado abaixo.
IDG