-
Notifications
You must be signed in to change notification settings - Fork 890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Hybrid Retrieval #1398
feat: Hybrid Retrieval #1398
Conversation
camel/retrievers/hybrid_retrival.py
Outdated
from camel.types import EmbeddingModelType, StorageType | ||
|
||
|
||
class HybridRetriever: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! Maybe extending BaseRetriever
here would help maintain consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My previous thought was that HybridRetriever
and AutoRetrival
were similar classes, not the base retrival component. I noticed that AutoRetrival
also does not inherit BaseRetrival
, so I'm not sure if they should both inherit BaseRetrival
, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing this out, what do you think @Wendong-Fan ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @liuxukun2000 and @yiyiyi0817! The BaseRetriever
is a minimal, abstract base class, while the HybridRetriever
extends both VectorRetriever
and BM25Retriever
, operating at a higher level. However, we can still inherit from BaseRetriever
to implement the process
method.
I think it would be better not to include AutoRetriever
directly within HybridRetriever
. AutoRetriever
is a simple implementation designed to allow users to quickly run our RAG pipeline. Its primary purpose is to provide an easy entry point for users to try our RAG functionality, so it should remain at the top level for user interaction. HybridRetriever
should depend solely on VectorRetriever
and BM25Retriever
. Perhaps later, we can consider integrating HybridRetriever
into AutoRetriever
for an enhanced user experience. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WenDong OK, I agree with you.
camel/retrievers/hybrid_retrival.py
Outdated
contents=self.content_input_path, | ||
top_k=vector_retriever_top_k, | ||
similarity_threshold=vector_retriever_similarity_threshold, | ||
return_detailed_info=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be return_detailed_info=return_detailed_info
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your message. I think the setting of return_detailed_info=True
here is intended to obtain detailed results from auto_retriever
, and this is fixed. The return_detailed_info
parameter of the query
function is used to specify whether to finally return detialed results that include the rrf score
. However, this part may be refactored later to become vector_retriever
.
return assistant_response.msg.content | ||
|
||
|
||
print(single_agent("What is it like to be a visiting student at KAUST?")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to include the expected output for single_agent("What is it like to be a visiting student at KAUST?")
at the end of the file, consistent with other examples.
camel/retrievers/hybrid_retrival.py
Outdated
"Original Query": query, | ||
"Retrieved Context": text_retrieved_info, | ||
} | ||
if return_detailed_info: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic in this section can be simplified to avoid redundancy.
if return_detailed_info: | |
retrieved_info = { | |
"Original Query": query, | |
"Retrieved Context": all_retrieved_info if return_detailed_info else [item['text'] for item in all_retrieved_info], | |
} | |
return retrieved_info |
camel/retrievers/hybrid_retrival.py
Outdated
from camel.types import EmbeddingModelType, StorageType | ||
|
||
|
||
class HybridRetriever: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing this out, what do you think @Wendong-Fan ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review. And I will recommit a new version soon.
camel/retrievers/hybrid_retrival.py
Outdated
from camel.types import EmbeddingModelType, StorageType | ||
|
||
|
||
class HybridRetriever: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WenDong OK, I agree with you.
vector_storage (Optional[BaseVectorStorage]): An optional vector | ||
storage used by the VectorRetriever. Defaults to None. | ||
""" | ||
self.vr = VectorRetriever(embedding_model, vector_storage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need error handling for vectorretriever initialization error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a problem with VectorRetriever
initialization, I think the VectorRetriever
class will raise a related error, which should remind users of the correct information.
vector_retriever_results: List[Dict[str, Any]], | ||
bm25_retriever_results: List[Dict[str, Any]], | ||
top_k: int, | ||
vector_weight: float, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no validation for "vector_weight + bm25_weight == 1", maybe another option would be to just pass one weight parameter like "vector_weight" and calculate bm25_weight as vw - 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the calculation of rrf score, the balance coefficients of 0.2 and 0.8 or 20 and 80 have no effect on the result, which is why I did not assert this, but adding the judgment that the sum of the coefficients is 1 may make the user's use more standardized. WDYT?
|
||
vector_ranks = np.array( | ||
[ | ||
info.get('vector_rank', float('inf')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this be problematic? This makes it hard to distinguish between "not found" and "ranked last". Also shouldn't "rank" be "int" type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the calculation of rrf scores, except for top_k
, all other scores can be regarded as rank last, that is, their scores are all 0. (ref: https://colab.research.google.com/drive/1iwVJrN96fiyycxN1pBqWlEr_4EPiGdGy#scrollTo=0qh83qGV2dY8)
As for the int
type of rank, it seems I am not sure about you meaning, could you exaplain more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much fot mengkang's review. Answer some questions.
vector_storage (Optional[BaseVectorStorage]): An optional vector | ||
storage used by the VectorRetriever. Defaults to None. | ||
""" | ||
self.vr = VectorRetriever(embedding_model, vector_storage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a problem with VectorRetriever
initialization, I think the VectorRetriever
class will raise a related error, which should remind users of the correct information.
vector_retriever_results: List[Dict[str, Any]], | ||
bm25_retriever_results: List[Dict[str, Any]], | ||
top_k: int, | ||
vector_weight: float, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the calculation of rrf score, the balance coefficients of 0.2 and 0.8 or 20 and 80 have no effect on the result, which is why I did not assert this, but adding the judgment that the sum of the coefficients is 1 may make the user's use more standardized. WDYT?
|
||
vector_ranks = np.array( | ||
[ | ||
info.get('vector_rank', float('inf')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the calculation of rrf scores, except for top_k
, all other scores can be regarded as rank last, that is, their scores are all 0. (ref: https://colab.research.google.com/drive/1iwVJrN96fiyycxN1pBqWlEr_4EPiGdGy#scrollTo=0qh83qGV2dY8)
As for the int
type of rank, it seems I am not sure about you meaning, could you exaplain more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @yiyiyi0817 !
Co-authored-by: Xukun Liu <[email protected]> Co-authored-by: Wendong-Fan <[email protected]>
Description
Hybrid Retrieval that combines auto retrieval and BM25 retrieval.
Motivation and Context
Types of changes
What types of changes does your code introduce? Put an
x
in all the boxes that apply:More Tasks
In the future, improvements can be made by separating the chunking and processing parts of the original vector-based retrieval and BM25 retrieval code, so that chunks can be uniformly numbered instead of relying on the current version's string matching deduplication operation.
Checklist
Go over all the following points, and put an
x
in all the boxes that apply.If you are unsure about any of these, don't hesitate to ask. We are here to help!