1. Instalamos a extensão do Postgres chamada pgvector, que permite que tabelas tenham colunas do tipo vector onde vector é um conjunto de floats. Neste exemplo, usamos um vetor de 768 dimensões, ou seja, um vetor de comprimento 768.
  2. Criamos uma tabela que salvará os artigos para nossa base de conhecimento — o texto de cada artigo, o título do artigo e uma incorporação vetorial do texto do artigo. Nomeamos a tabela articles e as colunas title, texte embedding.
  3. Extraímos o conteúdo de quatro URLs da Wikipédia e separamos o título e o conteúdo de cada uma.
  4. Limpamos cada corpo de artigo, dividimos o texto em pedaços de 500 caracteres e usamos um modelo de incorporação para criar um vetor de 768 dimensões de cada pedaço. O vetor é uma representação numérica (um float) do significado do pedaço de texto.
  5. Salvamos o título, um pedaço do corpo e o vetor de incorporação para o pedaço em uma linha do banco de dados. Para cada artigo, há tantos vetores quantos pedaços.
  6. Indexamos a coluna vetorial para pesquisa de similaridade na Parte 2.

import psycopg2
from sentence_transformers import SentenceTransformer
import requests
from bs4 import BeautifulSoup
import re
import ollama

# Your connection params here

MY_DB_HOST = 'localhost'
MY_DB_PORT = 5432
MY_DB_NAME = 'nitin'
MY_DB_USER = 'nitin'
MY_DB_PASSWORD = ''

# Set up the database connection

conn = psycopg2.connect(
    host=MY_DB_HOST,
    port=MY_DB_PORT,
    dbname=MY_DB_NAME,
    user=MY_DB_USER,
    password=MY_DB_PASSWORD
)
cur = conn.cursor()

# Create the articles table with the pgvector extension

# If the pgvector extension is not installed on your machine it will need to be installed.
# See https://github.com/pgvector/pgvector or cloud instances with pgvector installed.
# First create the pgvector extension, then a table with a 768 dim vector column for embeddings.
# Note that the title and full text of the article is also saved with the embedding.
# This allows vector similarity search on the embedding column, returning matched text 
# along with matched embeddings depending on what is needed.
# After this SQL command is executed we will have 
#   a) a pgvector extension installed if it did not already exist
#   b) an empty table with a column of type vector along with two columns,
# one to save the title of the article and one to save a chunk of text.

# Postgres does not put a limit on the number of dimensions in pgvector embeddings. 
# It is worth experimenting with larger lengths but note they need to match the length of embeddings
# created by the model you use. Embeddings of ~1k, 2k, or more dimensions are common among embeddings APIs. 

cur.execute('''
    CREATE EXTENSION IF NOT EXISTS vector;   
    DROP TABLE IF EXISTS articles;
	CREATE TABLE articles (
    	id SERIAL PRIMARY KEY,
    	title TEXT,
    	text TEXT,
    	embedding VECTOR(768)
	);
''')
conn.commit()

# Below are the sources of content for creating embeddings to be inserted in our demo vector db.
# Feel free to add your own links but note that different sources other than Wikipedia may
# have different junk characters and may require different pre-processing.
# As a start try other Wikipedia pages, then expand to other sources.

urls= (
'https://en.wikipedia.org/wiki/Pentax_K-x',
'https://en.wikipedia.org/wiki/2008_Tour_de_France',
'https://en.wikipedia.org/wiki/Onalaska,_Texas',
'https://en.wikipedia.org/wiki/List_of_extinct_dog_breeds'
)

# Fetch the HTML at a given link and extract only the text, separating title and content. 
# We will use this text to extract content from Wikipedia articles to answer queries. 

def extract_title_and_content(url):
    try:
        response = requests.get(url)
        if response.status_code == 200: # success
            # Create a BeautifulSoup object to parse the HTML content
            soup = BeautifulSoup(response.content, 'html.parser')
            # Extract the title of the page
            title = soup.title.string.strip() if soup.title else ""
            # Extract the text content from the page
            content = soup.get_text(separator=" ")
            return {"title": title, "text": content}
        else:
            print(f"Failed to retrieve content from {url}. Status code: {response.status_code}")
            return None
    except requests.exceptions.RequestException as e:
        print(f"Error occurred while retrieving content from {url}: {str(e)}")
        return None

# Create the embedding model

# This is the model we use to generate embeddings, i.e. to encode text chunks into numeric vectors of floats.
# Sentence Transformers (sbert.net) is a collection of transformer models designed for creating embeddings 
# from sentences. These are trained on data sets used for different applications. We use one tuned for Q&A,
# hence the 'qa' in the name. There are other embedding models, some tuned for speed, some for breadth, etc.
# The site sbert.net is worth studying for picking the right model for other uses. It's also worth looking 
# at the embedding models of providers like OpenAI, Cohere, etc. to learn the differences, but note that
# the use of an online model involves a potential loss of privacy.

embedding_model = SentenceTransformer('multi-qa-mpnet-base-dot-v1')

articles = ()
embeddings = ()

# Extract title,content from each URL and store it in the list. 
for url in urls:
    article = extract_title_and_content(url)
    if article:
        articles.append(article)

for article in articles:
    raw_text = article("text")
    # Pre-processing: Replace large chunks of white space with a space, eliminate junk characters.
    # This will vary with each source and will need custom cleanup. 
    text = re.sub(r's+', ' ', raw_text)
    text = text.replace(")", "").replace("(", "")

    # chunk into 500 character chunks, this is a typical size, could be lower if total size of article is small.
    chunks = (text(i:i + 500) for i in range(0, len(text), 500))
    for chunk in chunks:
        # This is where we invoke our model to generate a list of floats.
        # The embedding model returns a numpy ndarray of floats.
        # Psycopg coerces the list into a vector for insertion.
        embedding = embedding_model.encode((chunk))(0).tolist()
        cur.execute('''
            INSERT INTO articles (title, text, embedding)
            VALUES (%s, %s, %s); ''', (article("title"), chunk, embedding)
        )
        embeddings.append(embedding)

conn.commit()

# Create an index

# pgvector allows different indexes for similarity search.
# See the docs in the README at https://github.com/pgvector/pgvector for detailed explanations.
# Here we use 'hnsw' which is an index that assumes a Hierarchical Network Small Worlds model.
# HNSW is a pattern seen in network models of language. Hence this is one of the indexes
# that is expected to work well for language embeddings. For this small demo it will probably not 
# make much of a difference which index you use, and the others are also worth trying.
# The parameters provided in the 'USING' clause are 'embedding vector_cosine_ops'
# The first, 'embedding' in this case, needs to match the name of the column which holds embeddings.
# The second, 'vector_cosine_ops', is the operation used for similarity search i.e. cosine similarity.
# The same README doc on GitHub gives other choices but for most common uses it makes little difference
# hence cosine similarity is used as our default.

cur.execute('''
    CREATE INDEX ON articles USING hnsw (embedding vector_cosine_ops);
''')

conn.commit()
cur.close()
conn.close()

# End of file

Parte 2. Recuperar contexto do banco de dados vetorial e consultar o LLM

Na parte 2, fazemos uma pergunta em linguagem natural da nossa base de conhecimento, usando busca por similaridade para encontrar um contexto e usando um LLM (neste caso, o Meta’s Llama 3) para gerar uma resposta à pergunta no contexto fornecido. Os passos:

  1. Codificamos nossa consulta em linguagem natural como um vetor usando o mesmo modelo de incorporação que usamos para codificar os pedaços de texto que extraímos das páginas da Wikipédia.
  2. Realizamos uma busca de similaridade neste vetor usando uma consulta SQL. Similaridade, ou especificamente similaridade de cosseno, é uma maneira de encontrar os vetores em nosso banco de dados que estão mais próximos da consulta de vetor. Depois que encontrarmos os vetores mais próximos, podemos usá-los para recuperar o texto correspondente que é salvo com cada vetor. Esse é o contexto para nossa consulta ao LLM.
  3. Acrescentamos esse contexto ao nosso texto de consulta em linguagem natural, informando explicitamente ao LLM que o texto fornecido deve ser considerado o contexto para responder à consulta.
  4. Usamos um wrapper programático em torno do Ollama para passar a consulta em linguagem natural e o texto contextual para a API de solicitação do LLM e buscar a resposta. Enviamos três consultas e recebemos a resposta no contexto para cada consulta. Um exemplo de captura de tela para a primeira consulta é mostrado abaixo.

IDG