Sharding is the idea of splitting evenly, in the sense that a keyword is more popular than another we don't want to have hotspots so we would "shard" to split the load evenly.
Lets assume that we have a keyword like "1080p" we know that this keyword is extremely popular. To avoid 5 nodes having to take on a substantial amount of storage and bandwidth just to serve 1 keyword, while the other nodes have serve a small amount of data. We could force each node to only accept up to 50 results per keyword.
We would still have to worry about keyword intersections (if you want to search for more than 1 keyword at a time). Another thing to account for would be knowing which nodes are full and which nodes are not (which nodes should receive the new result).
We should have a distributed tracking method for keywords being 5 nodes closest to the hash of the keyword that stores a count for keywords. For example if we have 56 total results for "1080p" it should be notified when a keyword is added with a cache time of 24 hours.
When a user wishes to add a keyword they will first ask for a count from the tracker for example:
COUNT_CHECK_REQ
1080p
COUNT_CHECK_RES
56
The user now is aware that there is 56 total results for that keyword so it should now hash the keyword in this manor
sha((count/50)+keyword)
. It will then send this as a store keyword to the 5 nodes closest via XOR.
PUT_KEYWORD_REQ
1080p
PUT_KEYWORD_RES
stored
The nodes that received this should now notify the tracker that they are storing the keyword.
COUNT_UPDATE_REQ
1080p
COUNT_UPDATE_RES
updated
The user will hash the keyword sha(keyword)
and find the tracker and request a count example:
COUNT_CHECK_REQ
1080p
COUNT_CHECK_RES
56
Now that the user has an understanding of the total keywords it will select a random number 0 - count/50
which we will call x
. It will then do sha(x+keyword)
and XOR the nodes closest to that hash for results.
GET_KEYWORD_REQ
1080p
GET_KEYWORD_RES
list of infohashes
Now that we have discussed the issue regarding the sharding portion we should get into how we can handle keyword intersections. My theory is that we hash each keyword individually then place them in an index accordingly so that order will not be a factor. We can limit to 10 keywords per torrent, which is the same amount used for website keywords within the meta, most users only search 1.9 words when doing a query anyways as given by Google. We know that the total keywords that would have to be stored every 24 hours would be allot but not unreasonable, at a maximum of 1,023
total keywords stored per torrent.