Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions search_engine.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Greenstand should create a search engine for finding data and autocompleting user queries.

* Status: proposed

* Deciders: @kparikh9, @dadiorchen

* Date: 2022-05-18

Technical Story:
[Treetracker Query API Issue #90](https://github.com/Greenstand/treetracker-query-api/issues/90)

The search engine would be used to quickly and efficiently return rank-based results to a user. While Postgres can achieve the user stories described in the linked ticket with its full-text searching ability and use of the `LIKE` operator for pattern matching, it will not be able to order results based on relevance, popularity, or any other search based metric (which would have to be recorded in a separate column). Postgres assumes the inputted text appears somewhere in the values of the column it's querying, but won't find results if a character is out of order. A search engine will be able to fuzzy search what the user types and rank the results based on metrics it collects.

## Context and Problem Statement

What if an organization would like to look up the tree portfolio/wallet of a competing organization but didn't know the exact name of its wallet? What if a user wanted to search for a planter named 'Onyega Innocent G.' and all the trees he's planted but didn't remember how to spell his name?

## Decision Drivers

* ElasticSearch - can integrate well with other products of the Elastic Stack like Kibana, Logstash. Easiest to experiment with, since there are free trials available for Elastic Cloud (managed ElasticSearch deployment)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we just got rid of our ELK stack, which we were using for consolidated logging of microservices. It was a very difficult to manage for the current cloud team and having it deployed into our cluster. I presume we would not need the whole ELK stack to achieve what you are looking to do here? Kibana really stressed our cloud resources. However, maybe there is a more stripped down deployment option that would meet your use case.


## Considered Options
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider sphinx? https://stackshare.io/stackups/lucene-vs-sphinx Great for docs search I know

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe something from this list? https://www.educba.com/elasticsearch-alternatives/

Copy link
Author

@kparikh9 kparikh9 May 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the links, @mckornfield! I'll check these out and see if any fit better than ES


* ElasticSearch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you spike these with a sample dataset?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I took about 20 rows from the public.planters and public.trees tables and all the rows from the public.organizations table in the treetracker database. I tested autocomplete/search hinting queries on three separate indexes (1 for each table) and on one single index that contained all three types of data rows (planters, trees, organizations).

* Apache Solr
* Apache Lucene
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you say more about why the Apache projects were not chosen? I don't have experience with either, but I do know that CKAN (our chose data portal) uses Solr.


## Decision Outcome

Chosen option: ElasticSearch ~ **Nothing determined yet**, but I've been exploring it for the past month - documentation is very specific and it seems to indicate that it is a very complex library that can suit almost any use case. Open to any input.

### Positive Consequences

* Extensive documentation
* Several examples online (blogs, case studies, etc.)

### Negative Consequences

* Steep learning curve?
* Requires more experimentation on what architecture is the best for Greenstand's use case (i.e. search over multiple indexes vs. one index)
* Heavy memory usage (requires 4.0 GB RAM just for ElasticSearch, probably more for Kibana and Logstash) - can be expensive since it requires larger compute servers and this would need to remain on at all times.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely costs a lot as far as resources. Also there's no good auth support in the free versions of ELK

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ELK, I believe we can set up service accounts to request and use tokens for authorization to pass requests to the Elastic cluster https://www.elastic.co/guide/en/elasticsearch/reference/current/token-authentication-services.html. This don't seem to be limited to Elastic Cloud (which is just a managed-deployment of the ELK stack)