Skip to content
This repository has been archived by the owner on Aug 23, 2022. It is now read-only.

Aggregates #22

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 33 additions & 23 deletions LOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,27 +17,37 @@
* Turns out drawing a random number has side effects, I was unaware of this so spent time reading about why. I think I get it now and have a branch with
random number generation working. Now I need to figure out how to wire up a series of events to build a user + pass along to generator + return to frontend
* Found some sources of inspiration re, text editors: https://dkodaj.github.io/rte/, https://package.elm-lang.org/packages/mweiss/elm-rte-toolkit/latest/
* Sep 2021: Yes it's been awhile. I've built up some Elm / Lamdera chops, now starting to think about how to use Lamdera in the context of data-analysis. I have a week off,

#### Sep 2021:
Yes it's been awhile. I've built up some Elm / Lamdera chops, now starting to think about how to use Lamdera in the context of data-analysis. I have a week off,
and want to start exploring what it'd be like to incorporate Lamdera into an existing ecosystem. My thoughts atm:
* While "pure lamdera" is my happy place, it's a bit far from most company's realities. Could a datastore + Lamdera hybrid approach help convince others this paradigm isn't crazy?
- Pros:
* might help keep Lamdera cost down as I approach "medium data", pro looks like it could get pricey for what I want to achieve. Also hobbyst tier stops at 5MB (tho I think that might be negotiable)
* provide type safety "wrapped around" the data store - need to jam on this one more.
- cons:
* We are intentionally crossing a `semantic boundary`. Reducing such boundaries is a large motivation for Lamdera in the first place. I'm setting myself up for some upstream-swimming.
* Options that come to mind:
- MongoDB
* Very fast, BSON basically is just JSON
* Quickly scouring the web, I see no MongoDB protocol implemented in Elm.
- ElasticSearch
* HTTP API out of the box
* API support ElasticQuery DSL, which has (limited) Elm support.
* In additional to being a quasi-datastore, ES also has search features. If the data-viz stuff doesn't work out, there still might be fun things to experiment with.
- FaunaDB also looks interesting, but it's too much of a leap relative to my current skillset. I want to stay focused on the Lamdera aspect, if I'm successful Fauna deserves a closer look for sure.
- DataWarehousing products like BigQuery and Snowflake have lots of setup, and are higher latency. Punting this, though needs more consideration.
* I'm going with Elastic Search for now, with the expectation of running into HTTP-related latency issues.
* starting to piece together a project plan, see `ideas.md` in this directory
* using the idea of fewest semantic boundaries. Example is pulling in predisential data. Exporting to JSON and using the Elastic Cloud UI is how I'm going to do it. Current idea is to maintain proper lineage that this data is from a non-reproducible source, and using evergreen migrations to keep old pipelines up to date.
* oy, json decoding/encoding.. But more importantly it seems that the Elastic Cloud API has differences from the one I'm used to. Need to spend time to double check I'm barking up the corret tree and this "engine" (we're choice of wording IMO) can do the BI functions, and not have to use a different Elastic product..
* .. and .. bummer, `aggs` is not supported on their cloud offering.
* .. and .. another set-back, `elm-ui` and `elm-charts` don't seem to get along. At this point I can't tell if it's a large issue, but `viewport` seems to be the culprit (it's not inheriting the height, width dimensions of parent `row`s/`column`s)
* While "pure lamdera" is my happy place, it's a bit far from most company's realities. Could a datastore + Lamdera hybrid approach help convince others this paradigm isn't crazy?
- Pros:
* might help keep Lamdera cost down as I approach "medium data", pro looks like it could get pricey for what I want to achieve. Also hobbyst tier stops at 5MB (tho I think that might be negotiable)
* provide type safety "wrapped around" the data store - need to jam on this one more.
- cons:
* We are intentionally crossing a `semantic boundary`. Reducing such boundaries is a large motivation for Lamdera in the first place. I'm setting myself up for some upstream-swimming.
* Options that come to mind:
- MongoDB
* Very fast, BSON basically is just JSON
* Quickly scouring the web, I see no MongoDB protocol implemented in Elm.
- ElasticSearch
* HTTP API out of the box
* API support ElasticQuery DSL, which has (limited) Elm support.
* In additional to being a quasi-datastore, ES also has search features. If the data-viz stuff doesn't work out, there still might be fun things to experiment with.
- FaunaDB also looks interesting, but it's too much of a leap relative to my current skillset. I want to stay focused on the Lamdera aspect, if I'm successful Fauna deserves a closer look for sure.
- DataWarehousing products like BigQuery and Snowflake have lots of setup, and are higher latency. Punting this, though needs more consideration.
* I'm going with Elastic Search for now, with the expectation of running into HTTP-related latency issues.
* starting to piece together a project plan, see `ideas.md` in this directory
* using the idea of fewest semantic boundaries. Example is pulling in predisential data. Exporting to JSON and using the Elastic Cloud UI is how I'm going to do it. Current idea is to maintain proper lineage that this data is from a non-reproducible source, and using evergreen migrations to keep old pipelines up to date.
* oy, json decoding/encoding.. But more importantly it seems that the Elastic Cloud API has differences from the one I'm used to. Need to spend time to double check I'm barking up the corret tree and this "engine" (we're choice of wording IMO) can do the BI functions, and not have to use a different Elastic product..
* .. and .. bummer, `aggs` is not supported on their cloud offering.
* .. and .. another set-back, `elm-ui` and `elm-charts` don't seem to get along. At this point I can't tell if it's a large issue, but `viewport` seems to be the culprit (it's not inheriting the height, width dimensions of parent `row`s/`column`s)
* Elastic Cloud doesn't support the aggregation stuff I need. I kinda felt this when I noticed the word `engine` being used where I felt `index` was more appropriate. My guess is their optimizing for the profitable case, dev-friendly product search. To this goal, I think it's pretty cool. Kibana seems to be a package deal, so, Plan B. VM on GCP: http://34.121.52.200:9200/
* I've again run into a problem I'm setting out to solve:
* used the Python pipeline to update elastic search on the new index, http://34.121.52.200:9200/presidential-approval-ratings/_search
* aggregations don't support `text` fields, which is the data type ES guessed `president_name` to be. Trying to change the mapping doesn't work for `text` -> `keyword` (requires too much memeory).
An interesting thing about Elastic Search is how configurable it is. The error message even says it can be overwritten at the cost of (much) high memory usage. I'm pretty impressed with the open
source version of Elastic Search so far, this sort of thing makes me trust it.
* Solution 1: update pipeline script - been there done that, no thanks
* Solution 2: evergreen!
7 changes: 4 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,6 @@

Check out the [demo](https://fir-lamdera.lamdera.app/)

Ideas:
* During development of the twitch_vod pipeline, I stumbled upon this. I'm starting to see an idea emerge where Lamdera can be used to manage metadata https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest.html

References:

Expand All @@ -19,8 +17,11 @@ The Python scripts in this repo assume the environment variable `ELASTIC_CLOUD_A

Radar:
* Entity relation diagram in Elm: https://github.com/azimuttapp/azimutt
* I think elm-animator has more features but this seems a little more ready to go. Worth experimenting with both https://github.com/andrewMacmurray/elm-simple-animation/tree/2.1.0


Features on the docket:
Ideas / features on the docket:
* regex pre-validation of input data, UI based - ran into issues with ES cloud validator
* simple nesting / un-nesting, UI based - ran into issues with Slides exporter not supporting row-indices, which would've solved this problem, see `presidential_approval_pipeline.py` for an example (the nested `for` loop)
* Lamdera-cron + roll-up jobs, https://www.elastic.co/guide/en/elasticsearch/reference/current/rollup-get-rollup-caps.html
* During development of the twitch_vod pipeline, I stumbled upon this. I'm starting to see an idea emerge where Lamdera can be used to manage metadata https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest.html
2 changes: 1 addition & 1 deletion elastic-search/presidential_approval_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ def publish_to_es(validated_data: t.List[ApprovalRatingDoc], config: _Config):
if __name__ == "__main__":
config = _Config(
es_host="http://34.121.52.200:9200/",
destination_index="presidential-approval-ratings-dev",
destination_index="presidential-approval-ratings",
# destination_index="presidential-approval-ratings",
source_dir=Path(Path.home(), "data/presidential_approval"),
dry_run=False,
Expand Down
3 changes: 2 additions & 1 deletion elastic-search/shell-scripts/pipeline.sh
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/bin/bash


# intended to run locally
# intended to run locally
docker-compose run dev_kit python presidential_approval_pipeline.py
3 changes: 0 additions & 3 deletions ideas.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,3 @@ Here's a diagram of the architecture I'm pursuing first.
#### Another idea:
An example of the post-schema change notification panel. When adding new types of data to an index, elastic cloud notifies you to confirm the types (it presents it's best guess). This is pretty cool, I like this approach better than [Great Expectations](https://greatexpectations.io/). Is there a possible UI that can give us an "evergreen-ed" version of this?
![schema fields](./assets/fig1.png)



69 changes: 69 additions & 0 deletions src/Api/Catalog.elm
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
module Api.Catalog exposing (CatalogIndices, catalogIndicesDecoder)

import Json.Decode
import Json.Encode



-- Required packages:
-- * elm/json


type alias CatalogIndices =
{ docsU46count : String
, docsU46deleted : String
, health : String
, index : String
, pri : String
, priU46storeU46size : String
, rep : String
, status : String
, storeU46size : String
, uuid : String
}


catalogIndicesDecoder : Json.Decode.Decoder (List CatalogIndices)
catalogIndicesDecoder =
Json.Decode.list rootObjectDecoder


rootObjectDecoder : Json.Decode.Decoder CatalogIndices
rootObjectDecoder =
let
fieldSet0 =
Json.Decode.map8 CatalogIndices
(Json.Decode.field "docs.count" Json.Decode.string)
(Json.Decode.field "docs.deleted" Json.Decode.string)
(Json.Decode.field "health" Json.Decode.string)
(Json.Decode.field "index" Json.Decode.string)
(Json.Decode.field "pri" Json.Decode.string)
(Json.Decode.field "pri.store.size" Json.Decode.string)
(Json.Decode.field "rep" Json.Decode.string)
(Json.Decode.field "status" Json.Decode.string)
in
Json.Decode.map3 (<|)
fieldSet0
(Json.Decode.field "store.size" Json.Decode.string)
(Json.Decode.field "uuid" Json.Decode.string)


encodedRoot : List CatalogIndices -> Json.Encode.Value
encodedRoot root =
Json.Encode.list encodedRootObject root


encodedRootObject : CatalogIndices -> Json.Encode.Value
encodedRootObject rootObject =
Json.Encode.object
[ ( "docs.count", Json.Encode.string rootObject.docsU46count )
, ( "docs.deleted", Json.Encode.string rootObject.docsU46deleted )
, ( "health", Json.Encode.string rootObject.health )
, ( "index", Json.Encode.string rootObject.index )
, ( "pri", Json.Encode.string rootObject.pri )
, ( "pri.store.size", Json.Encode.string rootObject.priU46storeU46size )
, ( "rep", Json.Encode.string rootObject.rep )
, ( "status", Json.Encode.string rootObject.status )
, ( "store.size", Json.Encode.string rootObject.storeU46size )
, ( "uuid", Json.Encode.string rootObject.uuid )
]