Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Hybrid Retrieval #1398

Merged
merged 12 commits into from
Feb 16, 2025
Merged

feat: Hybrid Retrieval #1398

merged 12 commits into from
Feb 16, 2025

Conversation

yiyiyi0817
Copy link
Member

Description

Hybrid Retrieval that combines auto retrieval and BM25 retrieval.

Motivation and Context

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds core functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (update in the documentation)
  • Example (update in the folder of example)

More Tasks

In the future, improvements can be made by separating the chunking and processing parts of the original vector-based retrieval and BM25 retrieval code, so that chunks can be uniformly numbered instead of relying on the current version's string matching deduplication operation.

Checklist

Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!

  • I have read the CONTRIBUTION guide. (required)
  • My change requires a change to the documentation.
  • I have updated the tests accordingly. (required for a bug fix or a new feature)
  • I have updated the documentation accordingly.

@Wendong-Fan Wendong-Fan added this to the Sprint 20 milestone Jan 5, 2025
@Wendong-Fan Wendong-Fan changed the title RAG: Hybrid Retrieval feat: Hybrid Retrieval Jan 5, 2025
@Wendong-Fan Wendong-Fan modified the milestones: Sprint 20, Sprint 19 Jan 5, 2025
from camel.types import EmbeddingModelType, StorageType


class HybridRetriever:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Maybe extending BaseRetriever here would help maintain consistency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My previous thought was that HybridRetriever and AutoRetrival were similar classes, not the base retrival component. I noticed that AutoRetrival also does not inherit BaseRetrival, so I'm not sure if they should both inherit BaseRetrival, WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out, what do you think @Wendong-Fan ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @liuxukun2000 and @yiyiyi0817! The BaseRetriever is a minimal, abstract base class, while the HybridRetriever extends both VectorRetriever and BM25Retriever, operating at a higher level. However, we can still inherit from BaseRetriever to implement the process method.
I think it would be better not to include AutoRetriever directly within HybridRetriever. AutoRetriever is a simple implementation designed to allow users to quickly run our RAG pipeline. Its primary purpose is to provide an easy entry point for users to try our RAG functionality, so it should remain at the top level for user interaction. HybridRetriever should depend solely on VectorRetriever and BM25Retriever. Perhaps later, we can consider integrating HybridRetriever into AutoRetriever for an enhanced user experience. WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WenDong OK, I agree with you.

contents=self.content_input_path,
top_k=vector_retriever_top_k,
similarity_threshold=vector_retriever_similarity_threshold,
return_detailed_info=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be return_detailed_info=return_detailed_info ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your message. I think the setting of return_detailed_info=True here is intended to obtain detailed results from auto_retriever, and this is fixed. The return_detailed_info parameter of the query function is used to specify whether to finally return detialed results that include the rrf score. However, this part may be refactored later to become vector_retriever.

return assistant_response.msg.content


print(single_agent("What is it like to be a visiting student at KAUST?"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to include the expected output for single_agent("What is it like to be a visiting student at KAUST?") at the end of the file, consistent with other examples.

"Original Query": query,
"Retrieved Context": text_retrieved_info,
}
if return_detailed_info:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic in this section can be simplified to avoid redundancy.

Suggested change
if return_detailed_info:
retrieved_info = {
"Original Query": query,
"Retrieved Context": all_retrieved_info if return_detailed_info else [item['text'] for item in all_retrieved_info],
}
return retrieved_info

from camel.types import EmbeddingModelType, StorageType


class HybridRetriever:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out, what do you think @Wendong-Fan ?

Copy link
Member Author

@yiyiyi0817 yiyiyi0817 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review. And I will recommit a new version soon.

from camel.types import EmbeddingModelType, StorageType


class HybridRetriever:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WenDong OK, I agree with you.

vector_storage (Optional[BaseVectorStorage]): An optional vector
storage used by the VectorRetriever. Defaults to None.
"""
self.vr = VectorRetriever(embedding_model, vector_storage)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need error handling for vectorretriever initialization error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a problem with VectorRetriever initialization, I think the VectorRetriever class will raise a related error, which should remind users of the correct information.

vector_retriever_results: List[Dict[str, Any]],
bm25_retriever_results: List[Dict[str, Any]],
top_k: int,
vector_weight: float,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no validation for "vector_weight + bm25_weight == 1", maybe another option would be to just pass one weight parameter like "vector_weight" and calculate bm25_weight as vw - 1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the calculation of rrf score, the balance coefficients of 0.2 and 0.8 or 20 and 80 have no effect on the result, which is why I did not assert this, but adding the judgment that the sum of the coefficients is 1 may make the user's use more standardized. WDYT?


vector_ranks = np.array(
[
info.get('vector_rank', float('inf'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this be problematic? This makes it hard to distinguish between "not found" and "ranked last". Also shouldn't "rank" be "int" type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the calculation of rrf scores, except for top_k, all other scores can be regarded as rank last, that is, their scores are all 0. (ref: https://colab.research.google.com/drive/1iwVJrN96fiyycxN1pBqWlEr_4EPiGdGy#scrollTo=0qh83qGV2dY8)
As for the int type of rank, it seems I am not sure about you meaning, could you exaplain more?

Copy link
Member Author

@yiyiyi0817 yiyiyi0817 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much fot mengkang's review. Answer some questions.

vector_storage (Optional[BaseVectorStorage]): An optional vector
storage used by the VectorRetriever. Defaults to None.
"""
self.vr = VectorRetriever(embedding_model, vector_storage)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a problem with VectorRetriever initialization, I think the VectorRetriever class will raise a related error, which should remind users of the correct information.

vector_retriever_results: List[Dict[str, Any]],
bm25_retriever_results: List[Dict[str, Any]],
top_k: int,
vector_weight: float,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the calculation of rrf score, the balance coefficients of 0.2 and 0.8 or 20 and 80 have no effect on the result, which is why I did not assert this, but adding the judgment that the sum of the coefficients is 1 may make the user's use more standardized. WDYT?


vector_ranks = np.array(
[
info.get('vector_rank', float('inf'))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the calculation of rrf scores, except for top_k, all other scores can be regarded as rank last, that is, their scores are all 0. (ref: https://colab.research.google.com/drive/1iwVJrN96fiyycxN1pBqWlEr_4EPiGdGy#scrollTo=0qh83qGV2dY8)
As for the int type of rank, it seems I am not sure about you meaning, could you exaplain more?

Copy link
Member

@Wendong-Fan Wendong-Fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yiyiyi0817 !

@Wendong-Fan Wendong-Fan merged commit e9c14d2 into master Feb 16, 2025
6 checks passed
@Wendong-Fan Wendong-Fan deleted the hybrid_retrival branch February 16, 2025 10:33
apokryphosx pushed a commit that referenced this pull request Feb 16, 2025
Co-authored-by: Xukun Liu <[email protected]>
Co-authored-by: Wendong-Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

4 participants