RAG¶
Vector search power recommendation engines, chatbots, AI agentes and search engines.
Traditional keyword search works by matching exact words, when needing to search through images, audio, video, code, or unstructured text, keyword search is not effective.
Instead o relying on keywords, vector search uses embeddings to represent data as high-dimensional vectors, capturing semantic meaning and context. This allows for more accurate and relevant search results based on the actual content and meaning of the data.
Go to http://localhost:6333/dashboard
Install packages
Import packages
Download documents
{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
'section': 'General course-related questions',
'question': 'Course - When will the course start?',
'course': 'data-engineering-zoomcamp'}
Create a Qdrant client instance
Verify models compatible with current setup
{
"model": "BAAI/bge-small-zh-v1.5",
"sources": {
"hf": "Qdrant/bge-small-zh-v1.5",
"url": "https://storage.googleapis.com/qdrant-fastembed/fast-bge-small-zh-v1.5.tar.gz",
"_deprecated_tar_struct": true
},
"model_file": "model_optimized.onnx",
"description": "Text embeddings, Unimodal (text), Chinese, 512 input tokens truncation, Prefixes for queries/documents: not so necessary, 2023 year.",
"license": "mit",
"size_in_GB": 0.09,
"additional_files": [],
"dim": 512,
"tasks": {}
}
{
"model": "Qdrant/clip-ViT-B-32-text",
"sources": {
"hf": "Qdrant/clip-ViT-B-32-text",
"url": null,
"_deprecated_tar_struct": false
},
"model_file": "model.onnx",
"description": "Text embeddings, Multimodal (text&image), English, 77 input tokens truncation, Prefixes for queries/documents: not necessary, 2021 year",
"license": "mit",
"size_in_GB": 0.25,
"additional_files": [],
"dim": 512,
"tasks": {}
}
{
"model": "jinaai/jina-embeddings-v2-small-en",
"sources": {
"hf": "xenova/jina-embeddings-v2-small-en",
"url": null,
"_deprecated_tar_struct": false
},
"model_file": "onnx/model.onnx",
"description": "Text embeddings, Unimodal (text), English, 8192 input tokens truncation, Prefixes for queries/documents: not necessary, 2023 year.",
"license": "apache-2.0",
"size_in_GB": 0.12,
"additional_files": [],
"dim": 512,
"tasks": {}
}
Choose model
Vector Search¶
Define a collection
Create points
Generate embeddings and insert into Qdrant
{
"text": "I have faced a problem while reading the large parquet file. I tried some workarounds but they were NOT successful with Jupyter.\nThe error message is:\nIndexError: index 311297 is out of bounds for axis 0 with size 131743\nI solved it by performing the homework directly as a python script.\nAdded by Ibraheem Taha (ibraheemtaha91@gmail.com)\nYou can try using the Pyspark library\nAnswered by kamaldeen (kamaldeen32@gmail.com)",
"section": "Module 1: Introduction",
"question": "Reading large parquet files",
"course": "mlops-zoomcamp"
}
QueryResponse(points=[ScoredPoint(id=237, version=1, score=0.86789715, payload={'text': 'The read_parquet function supports a list of files as an argument. The list of files will be merged into a single result table.', 'section': "error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.", 'course': 'data-engineering-zoomcamp'}, vector=None, shard_key=None, order_value=None)])
Index fields that will be used as filters
Search with filters
Apply filters in search
data-engineering-zoomcamp
No, late submissions are not allowed. But if the form is still not closed and it’s after the due date, you can still submit the homework. confirm your submission by the date-timestamp on the Course page.y
Older news:[source1] [source2]
machine-learning-zoomcamp
Depends on whether the form will still be open. If you're lucky and it's open, you can submit your homework and it will be evaluated. if closed - it's too late.
(Added by Rileen Sinha, based on answer by Alexey on Slack)
mlops-zoomcamp
Please choose the closest one to your answer. Also do not post your answer in the course slack channel.
RAG with Vector Search¶
To run Kafka, you typically need to ensure the Kafka broker is running and then execute your client applications (producers, consumers).
1. **Start the Kafka Broker (if using Docker):**
If you encounter a `kafka.errors.NoBrokersAvailable` error, it likely means your Kafka broker Docker container isn't working. You can confirm this with `docker ps`. To start all instances, navigate to the docker compose yaml file folder and run:
`docker compose up -d`
2. **Run Java Kafka Applications (e.g., Producer/Consumer/KStreams):**
In the project directory, run:
`java -cp build/libs/<jar_name>-1.0-SNAPSHOT.jar:out src/main/java/org/example/JsonProducer.java`
(Note: Replace `JsonProducer.java` with your specific application, like `JsonConsumer.java`, etc.)
3. **Run Python Kafka Applications (e.g., producer.py):**
First, make sure the Docker images (including the Kafka broker) are up and running. If you encounter a "Module `kafka` not found" error, you should create and activate a virtual environment and install the necessary packages.
* To create a virtual environment and install packages (run only once):
`python -m venv env`
`source env/bin/activate` (or `env/Scripts/activate` on Windows)
`pip install -r ../requirements.txt`
* To activate it (you'll need to run it every time you need the virtual env):
`source env/bin/activate` (or `env/Scripts/activate` on Windows)
Then you can run your Python Kafka files within this environment.
Hybrid Search¶
Methods such as Bag of Words, TFIDF and BM25 are still widely used in search applications and sometimes preferred over dense embeddings.
Keyword-based search is also implemented as a vector search, but these vectors are usually sparse, meaning that most of their dimensions are zero. In contrast, dense embeddings have most of their dimensions filled with non-zero values. In sparese vectors, each word/phrase gets a unique position in vector space.
There are plenty of different options for creating sparse embeddings, but BM25 is one of the most popular ones. It's a statistical model, which makes it really fast and lightweight.
BM25 stands for Best Matching 25, and it's a ranking function that helps determine how relevant a document is to a query by combining Term Frequency (TF), Inverse Document Frequency (IDF), and document length normalization to prevent longer documents from being unfairly favored.
Sparse vectors can return no results, if none of the keywords in the query match the keywords in the documents. Dense embeddings, on the other hand, will always return results, even if they are not relevant.
If you get an error while running the command python3 stream.py worker
Run pip uninstall kafka-python
Then run pip install kafka-python==1.4.6
What is the use of Redpanda ?
Redpanda: Redpanda is built on top of the Raft consensus algorithm and is designed as a high-performance, low-latency alternative to Kafka. It uses a log-centric architecture similar to Kafka but with different underlying principles.
Redpanda is a powerful, yet simple, and cost-efficient streaming data platform that is compatible with Kafka® APIs while eliminating Kafka complexity.
{
"text": "Even though the upload works using aws cli and boto3 in Jupyter notebook.\nSolution set the AWS_PROFILE environment variable (the default profile is called default)",
"section": "Module 4: Deployment",
"question": "Uploading to s3 fails with An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.\"",
"course": "mlops-zoomcamp"
}
Reranking¶
Fusion¶
Fusion reranking method
Results
{
"text": "Even though the upload works using aws cli and boto3 in Jupyter notebook.\nSolution set the AWS_PROFILE environment variable (the default profile is called default)",
"section": "Module 4: Deployment",
"question": "Uploading to s3 fails with An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.\"",
"course": "mlops-zoomcamp"
}
When executing an AWS CLI command (e.g., aws s3 ls), you can get the error <botocore.awsrequest.AWSRequest object at 0x7fbaf2666280>.
To fix it, simply set the AWS CLI environment variables:
export AWS_DEFAULT_REGION=eu-west-1
export AWS_ACCESS_KEY_ID=foobar
export AWS_SECRET_ACCESS_KEY=foobar
Their value is not important; anything would be ok.
Added by Giovanni Pecoraro