SNOWFLAKE DATA WAREHOUSE is a fully relational ANSI SQL data warehouse with a patented technology. Some features are:
- Cloud based
- Zero Management
- Time Travel
- Any scale (Data, Compute, Users)
- Pay as you go usage
and much more
AWS Lambda lets you run code without provisioning or managing servers. You pay only for the compute time you consume - there is no charge when your code is not running.
This write-up is about how to code a simple AWS Lambda function that communicates with SnowFlake. It demonstrates a working example and reflects on points to consider. The code is present in the GitHub - lambdasnowflake.
Typically when loading data into snowflake the preferred approach is to collect large amount of data ex: into an s3 bucket
and load (example via COPY
command). This is also true to extent of loading mini batched streaming data in the
size of atleast 100mb. Blog list:
There are some scenarios where data obtained is very smaller in size or record counts is less (like less than 1000 records), in these case a direct write operation seems a better approach. This code writeup is a demonstration of one such hypothetical scenario.
Kijiji ,here in Canada, is a social platform to buy and sell stuffs. This is somewhat similar to well known craigslist. It offers a RSS feed of the ads posted, an ex Kijiji-rss-link, shows ads across canada and across all ad category. Typically on each retrieval you get around 20 or so records. The feed does gets updated frequently.
For this write up the AWS Lambda function retrieves the feed ,parses the content ,extracts some information and stores the data into snowflake.
The entry point to the lambda function is implemented in class KijijiLoaderFn
method handle
. The lambda is set to get triggered based on cloud watch scheduler, at the rate of every 5 min.
To communicate with snowflake, typically you can go about creating a application service account with users and password. Snowflake also provides authentication using keypair methodology, with a rotational functionality key-pair docs
In a normal execution environment, the private key is stored up in a secure location and the program would access from the path. In a lambda environment, since the container is dynamic, a regular file location can not be provided. One possible way to overcome this is by having the function load the private key file from specific protected S3 bucket.
A better approach is by using the AWS SECRETS MANAGER. We could store all the necessary connection for a specific snowflake here. In this example I have stored the following
- URL
- Snowflake account
- Private key passphrase
- Private key file content
At run-time, a quick api call to this specific key would return the entire record as a json data. The program could
parse the json and go about authenticating with snowflake. The code for this is demonstrated in the trait class
DBConnPoolInitializer
method retrieveDBConnInfo
and to parse the json in method parseDBSecretInfo
.
NOTE: Though here in the code it is hardcoded; but in actual functioning logic; i would recomend retrieving the awsSecretSnowflakeKey
via an environment variable. The environment variable can be defined as part of the deployment process.
The code instantiates and setups a connection pool, via the well know library HikariCP. The common adopted pattern of connecting to a data source using the user/password works; however there is currently no implementation for connecting to a data source using keypair methodology mentioned above.
To overcome this; I created a simple implementation which can be configured in Hikari CP. The datasource implementation
is the class SnowflakeKeypairDataSource. The parsing of the private keypair and connecting is present in the methods getEncryptedPrivateKey
and getConnection
. Configuring the Hikari library is in the class DBConnPoolInitializer on the method instantiateDataSource
.
The lambda execution happens randomly and there is no control on how a lambda execution container gets allocated. There are no licycle events to indicate to your application code that the container is initialized. By the time your lambda function executes the container is already initialized.The article from Yan Cui provided a good explanation when i initially ventured out.
In such a condition, creating and maintaining a connection pool would be un-deterministic; However, its well demonstrated by other that AWS tries to reuse an existing lambda container, if there are frequent invocations. Hence implementing a static block in the lambda class provides a good spot for such one time initialization. I adopted this approach and is done via the object class [ContainerInitializer (./KijijiLoaderLambda/src/main/scala/com/hashmap/blog/initializer/ContainerInitializer.scala). The class gets instantiated as part of the main class KijijiLoaderFn.
NOTE: Do not use this approach to store application states, as the container could be destroyed at any random point by AWS.
As part of communicating with snowflake, for ideal performance you would need to set a folder for snowflake to store cache.
For this, we could use the '/tmp' folder from the lambda execution environment. Hence we set the environment variables to this folder during runtime, as demonstrated in DBConnPoolInitializer method initializeConfigurationToDB
.
NOTE: Though it is ok with limitation to use the /tmp
folder; However I generally would avoid misusing the folder to store large amount of processing data, as the storage size is very limit (currently at 512 MB) and could change.
Currently an AWS Lambda function can run to max of 15 minutes, knowledge of such limits should be considered when implementing the function. The limits are documented here, follow the link. This code however executes at an average of 3-5 secs; well below this limit.
NOTE: The functionality demonstrated here does not mean the lambda is slow (comparitive to certain use cases); SLA was not a consideration, the functionality is merely demonstration of connecting to snowflake from lambda.
When developing it is usually a pain to build the code ,deploy to aws and run. It’s time consuming and also could cost you, with those frequent uploads. I recommend developing using the AWS SAM CLI tool. By using this I could test locally and also deploy it to AWS Cloud Formation. The template to use for local testing local_samcli.yaml and when deploying to AWS use the template deploy.yaml.
For proper functioning the lambda function needs the following policies:
- S3 bucket get
- AWS screts manager read
- Lambda basic execution Being lazy, for this demo, I used the existing pre-defined managed policies as below.
just ignore the SNS from the definition; it was used for some other demo.
Here is the screenshot from the monitoring after this lambda ran for a couple of hours :
The lambda gets triggered every 5 minutes, and each invocation is typically 3 seconds in duration.
Even though i have set the max concurrency to 5, since the function finishes within the limit, AWS opts to reuse the
same instance on each reschedule. Thus normally there is only 1 execution ,observed in the Concurent execution
graph.
From the cloud watch extract log
;
You can confirm that initialization happened only once but normal function which writes to snowflake happens regardless.
This demonstrates that AWS tends to re-use the lambda container between execution
You might be wondering; whats the big spike around 11:00 how come execution time went up; Well that’s the perils of budget.
I ran out of my snowflake credits.
This does demonstrate that when you are designing for such mini batch dataload, consider the cost. If in your situation, you can delay the loading to a later time (like every hour) and if there is a possibility to store the data (like in a kafka queue); better have the data loading to every hour instead of every minute and using the COPY function.
Though this walk thru is about interfacing with snowflake and doing a mini batch data loading. You could learn from this exercise and build your lambda function ,coupled with a AWS CLOUD API, to host a REST end point. Then you could invoke the end-point to provide a real-time data retrieval (doing a massive calculation in snowflake warehouse) and display in your webpage/dashboard.
As explained, connecting from snowflake from AWS Lambda is possible; I have explained just the basics and there are other considerations to when adopting. We have developed AWS Lambda in python also communicating with snowflake and doing simillar functionality at our client engagements too.
Share this blog, If you 👍 it.