Federated search distributes queries across independent indexes
Federated search solves the problem of finding the best results across N independent search indexes without querying all of them.
Federated search is the challenge of finding relevant results across many independent search indexes without querying all of them. It is a well-studied problem in information retrieval, and the core enabler of [[federated]] search engines.
A centralized search engine (Google) maintains one index of the entire web, requiring enormous infrastructure. A federated search engine distributes that burden: Instance A indexes climate science, Instance B indexes local news, Instance C indexes open-source docs. No single instance indexes everything. No single operator controls the whole system.
The Three Stages
1. Resource Selection โ which indexes to query
Given 100 federated instances and a query, you cannot query all of them โ it would be too slow. You need to determine which instances are most likely to have relevant results.
This uses collection summaries: compact representations of what each instance contains (e.g., topic vectors). For each remote instance, compute similarity between the query and the instance’s summary, rank instances by relevance, and select the top K. This is a local operation โ all summaries are cached. No network requests needed.
The classic algorithm for this is CORI (Collection Retrieval Inference).
2. Query Forwarding โ sending the query to selected indexes
The query is sent to the K selected instances. Each executes the query against its local index and returns results.
3. Result Merging โ combining results from multiple sources
Results from K instances plus the local instance need to be merged into a single ranked list. The challenge: different instances may use different scoring parameters, making raw scores incomparable.
[[Reciprocal Rank Fusion]] solves this elegantly by ignoring raw scores and using only rank positions.
Real-World Implementations
BookWyrm (federated Goodreads alternative) has the most mature pattern: it searches the local database, remote BookWyrm instances via [[ActivityPub]], and external APIs, then merges all results. Its search is sequential and blocking โ a known performance issue that async approaches can solve.
Lemmy and Misskey only search locally-federated content (content already delivered to the instance). They have no query-forwarding mechanism โ a gap that true federated search addresses.
PeerTube’s Sepia Search takes the opposite approach: a centralized crawler indexes all PeerTube instances. This works but reintroduces the central point that federation aims to eliminate.
Related Concepts
- [[Federation]] โ the network architecture that federated search operates within
- [[Reciprocal Rank Fusion]] โ the algorithm for merging ranked results from multiple sources
- [[A Note on Distributed Computing]] โ the fundamental challenges of distributed systems apply here too