Wabinab

Posted on Jun 14, 2022Read on Mirror.xyz

How To Code An On-Chain Search Engine

When writing smart contracts, we aim for less gas consumption and less storage used. But as far as one know, both cannot be optimized together. Maybe there are special situations that they can, I don’t know. Usually, it’s like the Heisenberg Uncertainty Principle, where after the minima, further optimizing the storage causes gas to increase, and further optimizing gas causes storage to increase. These are trade-offs to make and decide.

Database searches are fast. They are especially tuned for fast search. More than that, it can also search multiple fields at once, plus even matching partial searches. That’s not the same for searching basic programming storage, like hashmap (dictionary/hash) or vector (array/list). It isn’t optimized for searches, so it’s up to us how we design the data for optimizing search speed. To ensure search speed, we certainly need to sacrifice storage space optimization. There will be duplication of data when we try to improve data-access.

What Was Learnt Before

From fast.ai, one used to learn about how to search quickly. In particular, there’s a language machine learning model that requires storing a dictionary of words. Computer can’t understand words, so we need to convert words to an array/list of numbers. Fast.ai did the conversion, and it come two ways. There is a vector of words, stored like this:

dict = ["this", "is", "a", "vector", "of", "words"]

optimized to convert number to words. So if you have this (note vectors start from 0, not 1):

list_of_numbers = [1, 0, 2, 5]
new_words = []

for i in list_of_numbers:
    new_words.append(dict[i])

" ".join(new_words)
=> "is this a words"  # output

So ok this is easy; but how about the other way round? Searching for value in a list is very slow, extremely slow, so that’s why we use a dictionary/hash:

reverse_dict = {
  "this": 0,
  "is": 1,
  "a": 2,
  "vector": 3
  "of": 4
  "words" 5
}

And if we search for it, it’s very fast to search with a dict this way.

word = "is a words"
vector_of_words = word.split(" ")
# ["is", "a", "words"]

list_of_numbers = []

for word in vector_of_words: 
    list_of_numbers.append(reverse_dict[word])

list_of_numbers
=> [1, 2, 5]  # output

This is a good way to search. Though, we already note the repetition: values are stored twice for optimizing search speed. Let’s say if we want to search for NFTs storing on the blockchain. Let’s replace the “words” above with NFTs “titles”. Note that searches requires full title search to find the NFT. A dictionary doesn’t allow partial matches: it must match exactly, and it’s case-sensitive.

# These are different
"this is a word"
"This Is A Word"

So far, the two caveats we noticed are:

  • It must fit whole wordings.
  • It is case-sensitive.

The second one is easy to solve, just lowercase everything before you store it on the database. Though, we also need to check whether it works for other language, because most likely your users won’t just store English Language! If they use other language, we need to check whether our programming language “lowercase” function won’t damage, especially special characters like “é”, “ç”, etc. Other characters like Chinese most probably works, because there’s no such thing as lowercase chinese and the function most probably is a decoration when inputting Chinese characters.

There’s no easy method to solve the first though. In one sense, you could search partial wordings if you permute partial words, like this:

"i am a tree"

# Removing all non-sensical partial matches
{
  "i am a tree": 1,
  "am a tree": 1,
  "a tree": 1, 
  "tree": 1,
}

But consider how much repetition it causes, and how much extra storage used. Instead, we’ll search by tags.

Searching by tags have exist long ago. It doesn’t exist on Mirror, but if you try writing an article on Medium, they asks you to enter up to 5 tags for reader optimization. Reader choose their tags on what they want to read about, and they’ll be served articles that has those tags. Tags are also short, which we could limit its length to 3 or 4 per tag; and we could limit how many tags we have. Let’s also say 5.

Note: Users need SEO skills for this. If users refuse to add tags, or if they don’t know what tags to add, their contents may be difficult to search. This is like what you enter into “meta title” and “meta description” when writing an article on Mirror, except here with tags. One doesn’t know a way to search by title/description on the blockchain as one don’t know how to perform partial matches. On the other hand, tags are short and general, hence multiple tags makes it more describingly specific. We can also add some most-used tags, depending on your use cases.

We use tags as keys and “vector position” as value. We would repeat the “vector position” multiple times for different keys, here six times: five for tags, one for title. Storing the full title ensure someone that already know the title searching for the full title displays your result at the top of the page, for relevancy.

If you don’t already know what are key-value pairs, this is how it works: what you are searching for is the key, and the corresponding value is returned given the key, so:

var my_dict = {
  "key_1": "value_1",
  "key_2": "value_2"
}

searching_for = "key_2"
my_dict[searching_for]
=> "value_2"  # output

Let’s, get back to what we said. So what we mean by storing six times is this:

# assume article id is 1

{
  "title": 1,
  "tag_1": 1,
  "tag_2": 1,
  "tag_3": 1,
  "tag_4": 1, 
  "tag_5": 1
}

Speed is ensured. Storage isn’t.

And certainly, a tag is not just use once for one article; perhaps the title might be unique, but tags certainly are not. And given how human names could have repetition between different person (there are lots of people called “David”, lots of people called “Alice”), we could assume that title aren’t unique too, and will be reused. So, we could upgrade it a little. Instead of storing a number, we could store a list of numbers.

So, assuming we have article id 2, which only have two tags that are same as above (they don’t exhaust all the tags, just like it’s not necessarily for you to use all 5 tags in Medium articles if it doesn’t goes into all relevant categories), which are “tag_2” and “tag_5”, then plus a different title, we see the change below:

# we put the first empty because one is too lazy to change 1 to 0.
# vector position starts from 0, so position [0, 1, 2]
articles = ['empty', "my first book", "my second book"]

{
  "my first book": [1],
  "my second book": [2],
  "tag_1": [1],
  "tag_2": [1, 2],
  "tag_3": [1],
  "tag_4": [1], 
  "tag_5": [1, 2]
}

So now, searching for tag_2 or tag_5 will return both “my first book” and “my second book”. Voilà!

How About Search Bar?

What we have been talking is the backend searching mechanism, we haven’t include how the frontend search. When people search, they usually don’t just search by tags, but normal searching methods. Usually, they would type a few words, separated by spacings.

With Google (or similar) search engine, we could do partial searches. They uses large machine learning models, with “meta title” and “meta description” taken from all the sites to perform relevant searches. Perhaps we could do so with database also, one don’t know. Certainly, with our searching methods, we couldn’t do partial searches that matches say, the first 3 words (out of 5) for searching. This is a drawback of our search mechanism.

But we aren’t targeting something that complicated. Perhaps you could check out how search engine works, with large language ML models plus other methods to make searches relevant. But for us, we just want to be approximately relevant, so we could split the words into single wordings and search if there are relevant tags for that. It won’t be very accurate, but it still works okay-ish.

So let’s say, we are searching for “search engine on the blockchain”, we could split it into:

["search", "engine", "on", "the", "blockchain"]

after removing trailing and leading whitespaces. For each of these words, we pass it to the contract to search for relevant tags, whether it exist or not. We then displays the results on the frontend page.

How to Arrange The Results

We couldn’t display everything, as it would kill the page with too much result returns. We need to paginate our results. Pagination is tricky, because we aren’t searching for one single tag, which we could always, say, take the first 20 of the array and return. If we have multiple tags from all different part, how we paginate is up to us. One didn’t experiment with this, so one can’t give you some solutions. It’s up to the reader for further experimentation, with real users, which works better.

Anyways, there are a range of solutions to try. The easiest is to split pagination equally. Say, we have 5 words, and we only want 20 results. We could take 4 results from each tags and display them back. If in case some tags don’t return value, we could always take 20 results from each of the 5 words, so that totals 100, then we do our sorting we want at the frontend, removing unnecessary results (or caching them and use them when user directs to page 2, 3, etc.)

The easiest sorting is by newest and oldest, so that just depends on your design. If you have a vector and always push newly created items to the back of the vector, you can safely assume that larger “vector position” represents newer items. Likewise, older ones have smaller “vector position”. If you decide for other sorting mechanism, it might be more difficult, like choosing between two datetimes (in seconds since Unix Epoch). Trying to create another hash/dictionary with unix epoch as key and array of “vector position” (one’ll call this “article id” from now on) is can-do, but it might take up more space. You could try random search, storing the datetime on template’s metadata, then target a random article id and see if it is larger or smaller than the value. Then from both ends, slowly tune your way to meet the criteria. This search sounds slower than normal database search; so you may try and verify it’s really slower than usual.

Generally, one recommend eliminating difficult search methods. Remember, we aren’t using the database, and some search can just be slow. To ensure speed of searching, we make sacrifices by restricting what criteria we can search by and what cannot. It won’t give the best user experience, but unless lots of users always use a single method of search, otherwise you could safely ignore it for users to resort to other search method instead. As far as one’s concerned, google searches doesn’t have a time frame choice, nor is one aware of “newest”. It’s all “by relevance”. So unless your app heavily relies on that search criteria, you could safely ignore it.

Not Too Much Criteria

While SQL allows you to search by column A and column B and column C and … and column N, it’s not the same for on-chain searches. It’s almost impossible to design such multiple search criteria that are strong with database. If you ever need more than 2 criteria, prefer to use a database. Alternatively, you could always return results from different criteria in three separate results, then combine them together and do some magic and put the relevant/newest/whatever one on top and display back to user.

Remember, the more criteria you have, the more Hash/Dictionary you need, then the more space it’s gonna take. One recommends just sticking with tags would be best. If you have a music genre, you could add that as a tag too! Then when result returns, for each result, count occurrences, and the more it hits, the more relevant it is (though this is not ensured).

You could even make criteria become tags. Say, if one have a few music genre, one could make a few tags for the music genre. If people search by genre, that genre results will return.

Tags: What to Eliminate

You might prefer to eliminate some tags. Say, tags like “a”, “and”, “or”, that doesn’t give a meaning to what’s being search. Though, one knows there are some “stop words” library in Python, one isn’t aware of whether there’s a dictionary of words that are too relevant. It’s up to the reader to figure that out. These “stop words” doesn’t give a meaning to searches, and the result returned aren’t irrelevant. Best, don’t store it on-chain: if anyone use that tag, prefer to discard it rather than saving it down, plus informing the users that it’s being discarded because it makes no meaning.

And Weakness

Aside from those we mentioned, “relevancy” is difficult. When author creates their item, the item’s “vector position” (ok, it’s not article id anymore, because it could be music not a blog post) is appended to the end of the vector. This just sorts by newest and oldest. How do we sort retrieve items for relevancy? These will be done from the smart contract side. Though, it isn’t easy.

Ignore newest and oldest, that’s easy, that’s how we saved our data. For relevancy, one choice is to compare the “thumbs up” of that item, how many people buy, etc; whatever your formula to calculate popularity and relevancy. Then, we could sacrifice some space again and save another array as a 2D array for each tags, like this:

{
  "tag_1": [
    [1, 2, 3],  # store by newest/oldest.
    [2, 1, 3],  # store by relevancy. 
  ]
}

Just to save space, newest and oldest we could just access array starting from front or back, so saving as one array is fine. Just relevancy needs frequent updates as people updates their votes, etc. One isn’t sure how exactly to do that, and how frequently to do it, because we need to access the element’s metadata to check how their votes compare before moving it front or back. It’s difficult, nevertheless. Perhaps you could find another way to deal with it. If you do, please tell me! One would appreciate your sharing!

Conclusion

It’s not saving on normal program objects are slow. Programming language evolves for a long time, and there are fast way to search for objects. It’s just, given lots of criteria, it isn’t easy to search for something on a programming object compared to on a database. If we are just doing basic search, with some sacrifices we could make a basic search engine that works on-chain.

If you have any feedback, please post on Twitter tagging me! Preferably, tag me on read.cash as I check this more often.