-
Notifications
You must be signed in to change notification settings - Fork 12
First draft of 0004-greenstand-search-engine #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
| # Greenstand should create a search engine for finding data and autocompleting user queries. | ||
|
|
||
| * Status: proposed | ||
|
|
||
| * Deciders: @kparikh9, @dadiorchen | ||
|
|
||
| * Date: 2022-05-18 | ||
|
|
||
| Technical Story: | ||
| [Treetracker Query API Issue #90](https://github.com/Greenstand/treetracker-query-api/issues/90) | ||
|
|
||
| The search engine would be used to quickly and efficiently return rank-based results to a user. While Postgres can achieve the user stories described in the linked ticket with its full-text searching ability and use of the `LIKE` operator for pattern matching, it will not be able to order results based on relevance, popularity, or any other search based metric (which would have to be recorded in a separate column). Postgres assumes the inputted text appears somewhere in the values of the column it's querying, but won't find results if a character is out of order. A search engine will be able to fuzzy search what the user types and rank the results based on metrics it collects. | ||
|
|
||
| ## Context and Problem Statement | ||
|
|
||
| What if an organization would like to look up the tree portfolio/wallet of a competing organization but didn't know the exact name of its wallet? What if a user wanted to search for a planter named 'Onyega Innocent G.' and all the trees he's planted but didn't remember how to spell his name? | ||
|
|
||
| ## Decision Drivers | ||
|
|
||
| * ElasticSearch - can integrate well with other products of the Elastic Stack like Kibana, Logstash. Easiest to experiment with, since there are free trials available for Elastic Cloud (managed ElasticSearch deployment) | ||
|
|
||
| ## Considered Options | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did you consider sphinx? https://stackshare.io/stackups/lucene-vs-sphinx Great for docs search I know
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or maybe something from this list? https://www.educba.com/elasticsearch-alternatives/
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the links, @mckornfield! I'll check these out and see if any fit better than ES |
||
|
|
||
| * ElasticSearch | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did you spike these with a sample dataset?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I took about 20 rows from the |
||
| * Apache Solr | ||
| * Apache Lucene | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you say more about why the Apache projects were not chosen? I don't have experience with either, but I do know that CKAN (our chose data portal) uses Solr. |
||
|
|
||
| ## Decision Outcome | ||
|
|
||
| Chosen option: ElasticSearch ~ **Nothing determined yet**, but I've been exploring it for the past month - documentation is very specific and it seems to indicate that it is a very complex library that can suit almost any use case. Open to any input. | ||
|
|
||
| ### Positive Consequences | ||
|
|
||
| * Extensive documentation | ||
| * Several examples online (blogs, case studies, etc.) | ||
|
|
||
| ### Negative Consequences | ||
|
|
||
| * Steep learning curve? | ||
| * Requires more experimentation on what architecture is the best for Greenstand's use case (i.e. search over multiple indexes vs. one index) | ||
| * Heavy memory usage (requires 4.0 GB RAM just for ElasticSearch, probably more for Kibana and Logstash) - can be expensive since it requires larger compute servers and this would need to remain on at all times. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Definitely costs a lot as far as resources. Also there's no good auth support in the free versions of ELK
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For ELK, I believe we can set up service accounts to request and use tokens for authorization to pass requests to the Elastic cluster https://www.elastic.co/guide/en/elasticsearch/reference/current/token-authentication-services.html. This don't seem to be limited to Elastic Cloud (which is just a managed-deployment of the ELK stack) |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we just got rid of our ELK stack, which we were using for consolidated logging of microservices. It was a very difficult to manage for the current cloud team and having it deployed into our cluster. I presume we would not need the whole ELK stack to achieve what you are looking to do here? Kibana really stressed our cloud resources. However, maybe there is a more stripped down deployment option that would meet your use case.