feat: Hybrid Retrieval #1398

yiyiyi0817 · 2025-01-05T09:21:08Z

Description

Hybrid Retrieval that combines auto retrieval and BM25 retrieval.

Motivation and Context

I have raised an issue to propose this change ([Feature Request] RAG: Hybrid Retrieval with re-rank #1191)

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds core functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (update in the documentation)
Example (update in the folder of example)

More Tasks

In the future, improvements can be made by separating the chunking and processing parts of the original vector-based retrieval and BM25 retrieval code, so that chunks can be uniformly numbered instead of relying on the current version's string matching deduplication operation.

Checklist

Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!

I have read the CONTRIBUTION guide. (required)
My change requires a change to the documentation.
I have updated the tests accordingly. (required for a bug fix or a new feature)
I have updated the documentation accordingly.

liuxukun2000 · 2025-01-06T04:45:22Z

camel/retrievers/hybrid_retrival.py

+from camel.types import EmbeddingModelType, StorageType
+
+
+class HybridRetriever:


Hi! Maybe extending BaseRetriever here would help maintain consistency.

My previous thought was that HybridRetriever and AutoRetrival were similar classes, not the base retrival component. I noticed that AutoRetrival also does not inherit BaseRetrival, so I'm not sure if they should both inherit BaseRetrival, WDYT?

Thank you for pointing this out, what do you think @Wendong-Fan ?

Thanks, @liuxukun2000 and @yiyiyi0817! The BaseRetriever is a minimal, abstract base class, while the HybridRetriever extends both VectorRetriever and BM25Retriever, operating at a higher level. However, we can still inherit from BaseRetriever to implement the process method.
I think it would be better not to include AutoRetriever directly within HybridRetriever. AutoRetriever is a simple implementation designed to allow users to quickly run our RAG pipeline. Its primary purpose is to provide an easy entry point for users to try our RAG functionality, so it should remain at the top level for user interaction. HybridRetriever should depend solely on VectorRetriever and BM25Retriever. Perhaps later, we can consider integrating HybridRetriever into AutoRetriever for an enhanced user experience. WDYT?

@WenDong OK, I agree with you.

liuxukun2000 · 2025-01-08T20:39:13Z

camel/retrievers/hybrid_retrival.py

+            contents=self.content_input_path,
+            top_k=vector_retriever_top_k,
+            similarity_threshold=vector_retriever_similarity_threshold,
+            return_detailed_info=True,


Should be return_detailed_info=return_detailed_info ?

Thank you for your message. I think the setting of return_detailed_info=True here is intended to obtain detailed results from auto_retriever, and this is fixed. The return_detailed_info parameter of the query function is used to specify whether to finally return detialed results that include the rrf score. However, this part may be refactored later to become vector_retriever.

liuxukun2000 · 2025-01-08T20:43:27Z

examples/rag/single_agent_with_hybrid_rag.py

+    return assistant_response.msg.content
+
+
+print(single_agent("What is it like to be a visiting student at KAUST?"))


It would be better to include the expected output for single_agent("What is it like to be a visiting student at KAUST?") at the end of the file, consistent with other examples.

liuxukun2000 · 2025-01-08T20:48:12Z

camel/retrievers/hybrid_retrival.py

+            "Original Query": query,
+            "Retrieved Context": text_retrieved_info,
+        }
+        if return_detailed_info:


The logic in this section can be simplified to avoid redundancy.

Suggested change

if return_detailed_info:

retrieved_info = {

"Original Query": query,

"Retrieved Context": all_retrieved_info if return_detailed_info else [item['text'] for item in all_retrieved_info],

}

return retrieved_info

liuxukun2000 · 2025-01-08T20:50:35Z

camel/retrievers/hybrid_retrival.py

+from camel.types import EmbeddingModelType, StorageType
+
+
+class HybridRetriever:


Thank you for pointing this out, what do you think @Wendong-Fan ?

yiyiyi0817

Thanks for your review. And I will recommit a new version soon.

yiyiyi0817 · 2025-01-09T12:11:12Z

camel/retrievers/hybrid_retrival.py

+from camel.types import EmbeddingModelType, StorageType
+
+
+class HybridRetriever:


@WenDong OK, I agree with you.

Aaron617 · 2025-01-22T07:06:50Z

camel/retrievers/hybrid_retrival.py

+            vector_storage (Optional[BaseVectorStorage]): An optional vector
+                storage used by the VectorRetriever. Defaults to None.
+        """
+        self.vr = VectorRetriever(embedding_model, vector_storage)


do we need error handling for vectorretriever initialization error?

If there is a problem with VectorRetriever initialization, I think the VectorRetriever class will raise a related error, which should remind users of the correct information.

Aaron617 · 2025-01-22T07:09:47Z

camel/retrievers/hybrid_retrival.py

+        vector_retriever_results: List[Dict[str, Any]],
+        bm25_retriever_results: List[Dict[str, Any]],
+        top_k: int,
+        vector_weight: float,


no validation for "vector_weight + bm25_weight == 1", maybe another option would be to just pass one weight parameter like "vector_weight" and calculate bm25_weight as vw - 1

For the calculation of rrf score, the balance coefficients of 0.2 and 0.8 or 20 and 80 have no effect on the result, which is why I did not assert this, but adding the judgment that the sum of the coefficients is 1 may make the user's use more standardized. WDYT?

Aaron617 · 2025-01-22T07:13:20Z

camel/retrievers/hybrid_retrival.py

+
+        vector_ranks = np.array(
+            [
+                info.get('vector_rank', float('inf'))


will this be problematic? This makes it hard to distinguish between "not found" and "ranked last". Also shouldn't "rank" be "int" type

For the calculation of rrf scores, except for top_k, all other scores can be regarded as rank last, that is, their scores are all 0. (ref: https://colab.research.google.com/drive/1iwVJrN96fiyycxN1pBqWlEr_4EPiGdGy#scrollTo=0qh83qGV2dY8)
As for the int type of rank, it seems I am not sure about you meaning, could you exaplain more?

yiyiyi0817

Thanks very much fot mengkang's review. Answer some questions.

yiyiyi0817 · 2025-01-24T03:58:46Z

camel/retrievers/hybrid_retrival.py

+            vector_storage (Optional[BaseVectorStorage]): An optional vector
+                storage used by the VectorRetriever. Defaults to None.
+        """
+        self.vr = VectorRetriever(embedding_model, vector_storage)


If there is a problem with VectorRetriever initialization, I think the VectorRetriever class will raise a related error, which should remind users of the correct information.

yiyiyi0817 · 2025-01-24T04:01:36Z

camel/retrievers/hybrid_retrival.py

+        vector_retriever_results: List[Dict[str, Any]],
+        bm25_retriever_results: List[Dict[str, Any]],
+        top_k: int,
+        vector_weight: float,


For the calculation of rrf score, the balance coefficients of 0.2 and 0.8 or 20 and 80 have no effect on the result, which is why I did not assert this, but adding the judgment that the sum of the coefficients is 1 may make the user's use more standardized. WDYT?

yiyiyi0817 · 2025-01-24T04:42:36Z

camel/retrievers/hybrid_retrival.py

+
+        vector_ranks = np.array(
+            [
+                info.get('vector_rank', float('inf'))


For the calculation of rrf scores, except for top_k, all other scores can be regarded as rank last, that is, their scores are all 0. (ref: https://colab.research.google.com/drive/1iwVJrN96fiyycxN1pBqWlEr_4EPiGdGy#scrollTo=0qh83qGV2dY8)
As for the int type of rank, it seems I am not sure about you meaning, could you exaplain more?

Wendong-Fan

Thanks @yiyiyi0817 !

Co-authored-by: Xukun Liu <[email protected]> Co-authored-by: Wendong-Fan <[email protected]>

yiyiyi0817 added 3 commits January 4, 2025 00:58

add docstring

34a3049

fix mypy

6443e6d

remove local folder for vector storage

a6ef0f8

Wendong-Fan requested review from MuggleJinx and liuxukun2000 January 5, 2025 09:22

Wendong-Fan assigned yiyiyi0817 Jan 5, 2025

Wendong-Fan added the New Feature label Jan 5, 2025

Wendong-Fan added this to the Sprint 20 milestone Jan 5, 2025

Wendong-Fan changed the title ~~RAG: Hybrid Retrieval~~ feat: Hybrid Retrieval Jan 5, 2025

Wendong-Fan modified the milestones: Sprint 20, Sprint 19 Jan 5, 2025

remove unalignment parameter

7167533

liuxukun2000 requested changes Jan 6, 2025

View reviewed changes

liuxukun2000 reviewed Jan 8, 2025

View reviewed changes

yiyiyi0817 commented Jan 9, 2025

View reviewed changes

yiyiyi0817 added 5 commits January 14, 2025 16:48

Merge branch 'master' into hybrid_retrival

344b133

change to vector retrival without docstring

d7ae2bf

simplified output

8e9df2a

Merge branch 'master' into hybrid_retrival

c9507b3

fix pytest

f35b4dd

yiyiyi0817 requested review from liuxukun2000 and Wendong-Fan January 14, 2025 16:05

Merge branch 'master' into hybrid_retrival

542e510

liuxukun2000 approved these changes Jan 17, 2025

View reviewed changes

Aaron617 reviewed Jan 22, 2025

View reviewed changes

Merge branch 'master' into hybrid_retrival

d1be87d

yiyiyi0817 commented Jan 24, 2025

View reviewed changes

Merge branch 'master' into hybrid_retrival

8c43d1b

Wendong-Fan approved these changes Feb 16, 2025

View reviewed changes

Wendong-Fan merged commit e9c14d2 into master Feb 16, 2025
6 checks passed

Wendong-Fan deleted the hybrid_retrival branch February 16, 2025 10:33

apokryphosx pushed a commit that referenced this pull request Feb 16, 2025

feat: Hybrid Retrieval (#1398)

1d2e6e0

Co-authored-by: Xukun Liu <[email protected]> Co-authored-by: Wendong-Fan <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Hybrid Retrieval #1398

feat: Hybrid Retrieval #1398

yiyiyi0817 commented Jan 5, 2025

liuxukun2000 Jan 6, 2025

yiyiyi0817 Jan 6, 2025

liuxukun2000 Jan 8, 2025

Wendong-Fan Jan 9, 2025

yiyiyi0817 Jan 9, 2025

liuxukun2000 Jan 8, 2025

yiyiyi0817 Jan 9, 2025

liuxukun2000 Jan 8, 2025

liuxukun2000 Jan 8, 2025

liuxukun2000 Jan 8, 2025

yiyiyi0817 left a comment

yiyiyi0817 Jan 9, 2025

Aaron617 Jan 22, 2025

yiyiyi0817 Jan 24, 2025

Aaron617 Jan 22, 2025

yiyiyi0817 Jan 24, 2025

Aaron617 Jan 22, 2025

yiyiyi0817 Jan 24, 2025

yiyiyi0817 left a comment

yiyiyi0817 Jan 24, 2025

yiyiyi0817 Jan 24, 2025

yiyiyi0817 Jan 24, 2025

Wendong-Fan left a comment

		from camel.types import EmbeddingModelType, StorageType


		class HybridRetriever:

		return assistant_response.msg.content


		print(single_agent("What is it like to be a visiting student at KAUST?"))

-        if return_detailed_info:
+       retrieved_info = {
+    "Original Query": query,
+    "Retrieved Context": all_retrieved_info if return_detailed_info else [item['text'] for item in all_retrieved_info],
+}
+return retrieved_info

feat: Hybrid Retrieval #1398

feat: Hybrid Retrieval #1398

Conversation

yiyiyi0817 commented Jan 5, 2025

Description

Motivation and Context

Types of changes

More Tasks

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yiyiyi0817 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yiyiyi0817 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Wendong-Fan left a comment

Choose a reason for hiding this comment