[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ironcorelabs/ironcore-alloy/blob/main/examples/python/standalone/pinecone-semantic-search-encrypted.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/ironcorelabs/ironcore-alloy/blob/main/examples/python/standalone/pinecone-semantic-search-encrypted.ipynb)

# Using IronCore Labs Cloaked AI with Pinecone

In this walkthrough we will see how to use Pinecone for encrypted semantic search. To begin, we must install the prerequisite libraries:

In [None]:
%pip install -q \
 "pinecone-client[grpc]"==2.2.4 \
 pinecone-datasets==0.6.2 \
 sentence-transformers==2.2.2 \
 ironcore-alloy==0.10.2

## Data Download

We will skip the data preparation steps as they can be very time consuming and jump straight into it with the prebuilt dataset from *Pinecone Datasets*. If you'd rather see how it's all done, please refer to [this notebook](https://github.com/pinecone-io/examples/blob/master/learn/search/semantic-search/semantic-search.ipynb).

Let's go ahead and download the dataset.

In [None]:
from pinecone_datasets import load_dataset

dataset = load_dataset('quora_all-MiniLM-L6-bm25')
dataset.documents.drop(['metadata'], axis=1, inplace=True)
dataset.documents.rename(columns={'blob': 'metadata'}, inplace=True)
# we will use 80K rows of the dataset between rows 240K -> 320K
dataset.documents.drop(dataset.documents.index[320_000:], inplace=True)
dataset.documents.drop(dataset.documents.index[:240_000], inplace=True)
dataset.head()

## Creating an Index

Now that the data is ready, we can set up our index to store it.
We begin by initializing our connection to Pinecone. To do this we need a [free API key](https://app.pinecone.io).

We recommend that you generate a new API key for this example, and delete it once the example is completed.

In [5]:
import pinecone
import ironcore_alloy as alloy
import base64

# get api key from app.pinecone.io
PINECONE_API_KEY = '' # @param {type:"string"}
PINECONE_ENV = 'gcp-starter' # @param {type:"string"}

pinecone.init(
 api_key=PINECONE_API_KEY,
 environment=PINECONE_ENV
)

Now we create a new index called `semantic-search-fast-encrypted`. 

It's important that we align the index `dimension` and `metric` parameters with those required by the `MiniLM-L6` model.

In [7]:
import time

index_name = 'semantic-search-fast-encrypted'

# only create index if it doesn't exist
if index_name not in pinecone.list_indexes():
 pinecone.create_index(
 name=index_name,
 dimension=len(dataset.documents.iloc[0]['values']),
 metric='cosine'
 )
 # wait a moment for the index to be fully initialized
 time.sleep(1)

# now connect to the index
index = pinecone.GRPCIndex(index_name)

Now we initialize IronCore with a dummy key. We name our vectors because different names will result in different derived keys for separate indices. 

We aren't initializing deterministic secrets here because we don't have any metadata that we want to filter on.

In [8]:
# Note: in practice this must be 32 cryptographically-secure bytes
key_bytes = b"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
approximation_factor = 2.0
vector_secrets = {
 "quora":
 alloy.VectorSecret(
 approximation_factor,
 alloy.RotatableSecret(alloy.StandaloneSecret(1, alloy.Secret(key_bytes)), None),
 )
}
standard_secrets = alloy.StandardSecrets(1, [alloy.StandaloneSecret(1, alloy.Secret(key_bytes))])
deterministic_secrets = {}
tenant_id = alloy.AlloyMetadata.new_simple("") # not needed in our case so we'll leave it blank
config = alloy.StandaloneConfiguration(standard_secrets, deterministic_secrets, vector_secrets) # sdk gets set up with required master secrets
sdk = alloy.Standalone(config)

Next we transform the dataset by encrypting the vectors and the metadata (the original sentence which led to the vector).

In [None]:
# first we'll encrypt the vectors
for row in dataset.documents.itertuples():
 plaintext_vector = alloy.PlaintextVector(row.values, "quora", "sentence") # each index and set of vectors encrypted with different derived keys
 # first we encrypt the dense vector
 encrypted_vector = await sdk.vector().encrypt(plaintext_vector, tenant_id)
 # then we encrypt the "metadata" -- in this case the source text used to create the vector
 encrypted_metadata = await sdk.standard().encrypt({"text": bytes(row.metadata["text"], "utf-8")}, tenant_id)
 # update those values in place
 dataset.documents.at[row.Index, 'values'] = encrypted_vector.encrypted_vector
 dataset.documents.at[row.Index, 'metadata'] = {"text": base64.b64encode(encrypted_metadata.document["text"]).decode(), "edek": base64.b64encode(encrypted_metadata.edek).decode()}
dataset.head()

Upsert the data to Pinecone:

In [11]:
for batch in dataset.iter_documents(batch_size=100):
 index.upsert(batch)

## Making Queries

Now that our index is populated, we can begin making queries. We are performing a semantic search for *similar questions*, so we should embed and search with another question.

In [None]:
from sentence_transformers import SentenceTransformer
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = SentenceTransformer('all-MiniLM-L6-v2', device=device)

Now let's query.

In [None]:
query = "which city has the highest population in the world?"
# create the query vector
plaintext_query = model.encode(query).tolist()
xq = alloy.PlaintextVector(plaintext_query, "quora", "sentence")
query_vectors = (await sdk.vector().generate_query_vectors({"vec_1": xq},tenant_id))["vec_1"]

# now query Pinecone
xc = index.query(vector=query_vectors[0].encrypted_vector, top_k=5, include_metadata=True)
xc

[0.13341057300567627, -0.023074619472026825, -0.05944643169641495, 0.04384705424308777, -0.06582783162593842, -0.039888277649879456, 0.008183940313756466, 0.037405528128147125, -0.05933888629078865, 0.04153062403202057, 0.06623378396034241, -0.0868215411901474, 0.008207841776311398, 0.04380275309085846, 0.000194948457647115, -0.015942448750138283, 0.0197485089302063, -0.04541020467877388, 0.06308888643980026, -0.06579320132732391, -0.021352192386984825, -0.05284402519464493, 0.10262910276651382, 0.03393445536494255, 0.011895790696144104, 0.00047149541205726564, -0.01892256736755371, 0.06919790059328079, -0.01179087720811367, 0.004062430001795292, 0.00018103989714290947, 0.05136031284928322, 0.13404083251953125, 0.001299859955906868, 0.016604017466306686, 0.011573227122426033, -0.01089446246623993, -0.033966176211833954, 0.04109123349189758, 0.04152517765760422, 0.025995008647441864, -0.026188267394900322, -0.00993269868195057, -0.044526197016239166, 0.04737164080142975, 0.0274017397314

{'matches': [{'id': '69331',
 'metadata': {'edek': 'AAAAAYIACiQKIIfMTdKK1OAnq9sLfxSRkAcRjbAMZglb7B6pTAoAYbU2EAESRxJFGkMKDCJv4uq4E78NFE9wzRIwslsIKXpbLh2BfYSCmthybrUFMXy4t/WC7B2AloRnREMYg0zUolNG/AkVg5c7HPb1GgEx',
 'text': 'AElST05kAhZYeiavN9DMSc38Y+7ZOMCYbhYH95tcwZs8XU6ck9kadPvsJQdZonLxhEZ9QvnN0C71z7B4jPmMgqlS'},
 'score': 0.63020325,
 'sparse_values': {'indices': [], 'values': []},
 'values': []},
 {'id': '69332',
 'metadata': {'edek': 'AAAAAYIACiQKIDOq3CAbvB0hAc1/8eMwOUueFGFqAJP9y3liWkrr5kiCEAESRxJFGkMKDOyTdVZeoVTXkgfjCBIwX3ArEmEDPcCqeP3UafCjut6h+CBXDT/FmGwE+cjnNPFXPHkgISwX5Scc1oP6dRZ9GgEx',
 'text': 'AElST075rB5v8U1MO+fVHD3UUXVX01anc1glUn/EcXL4UN4+cCRgbAuPlRhZrefT1YEEhRBFCtTidOE='},
 'score': 0.6295936,
 'sparse_values': {'indices': [], 'values': []},
 'values': []},
 {'id': '84749',
 'metadata': {'edek': 'AAAAAYIACiQKINBpNo5BTuz8BSRDh12i3SK37Pj6OtpQhmtb73iNTO1LEAESRxJFGkMKDCiE277l183LVUMzABIwAxAU3DH+v6BKsJUth5N2K9itA5E0mS7WpuRkkkLATLwefHYwr6cHxl9/oAz8Q6WZGgEx',
 'text': 'AElST06NJYdK

In the returned response `xc` we can see the most relevant questions to our particular query — we don't have any exact matches, but we can see that the returned questions have a high similarity score to the input. We can decrypt the matches to see the results.

In [14]:
for result in xc['matches']:
	recreated = alloy.EncryptedDocument(base64.b64decode(result['metadata']['edek']), {"text":base64.b64decode(result['metadata']['text'])})
	decrypted = await sdk.standard().decrypt(recreated, tenant_id) # decrypt the metadata
	# we could also decrypt the vector, but it isn't returned and we don't need it
	print(f"{round(result['score'], 2)}: {decrypted['text'].decode('utf-8')}")

0.63: What's the world's largest city?
0.63: What is the biggest city?
0.6: What are the world's most advanced cities?
0.58: Where is the most beautiful city in the world?
0.53: Which city in India is the best to live?


These are good results; let's try and modify the words being used to see if we still surface similar results.

In [15]:
query = "which metropolis has the highest number of people?" # @param {type:"string"}
# create the query vector
xq = alloy.PlaintextVector(model.encode(query).tolist(), "quora", "sentence")
query_vectors = (await sdk.vector().generate_query_vectors({"vec_1": xq}, tenant_id))["vec_1"]
# now query
xc = index.query(vector=query_vectors[0].encrypted_vector, top_k=5, include_metadata=True)

for result in xc['matches']:
	recreated = alloy.EncryptedDocument(base64.b64decode(result['metadata']['edek']), {"text":base64.b64decode(result['metadata']['text'])})
	decrypted = await sdk.standard().decrypt(recreated, tenant_id)
	print(f"{round(result['score'], 2)}: {decrypted['text'].decode('utf-8')}")

0.56: What is the biggest city?
0.5: Which city in India is the best to live?
0.5: What's the world's largest city?
0.49: What is the most dangerous city in USA? Why?
0.48: What is the greatest, most beautiful city in the world?


Here we used different terms in our query than that of the returned documents. We substituted **"city"** for **"metropolis"** and **"populated"** for **"number of people"**.

Despite these very different terms and *lack* of term overlap between query and returned documents — we get highly relevant results — this is the power of *semantic search*.

You can go ahead and ask more questions above. When you're done, delete the index to save resources.

In [None]:
pinecone.delete_index(index_name)

We now recommend you delete the API key you used here to prevent misuse.

---