Getting Started→
A hello world example showing the development of a simple transform is given below. Alternatively you can follow the embedded youtube video.
Dated
The version in the tutorial is a bit outdated but the flow should work. We are working on an updated tutorial
Workspace Setup→
-
Background
Note that this platform and development environment is in python. So you should have appropriate environment setup including the operating-system specific packages. There are no known os-specific dependencies.
-
Setup the working environment.
Now setup the virtual env to install enrichpkg. We normally use Virtualenvwrapper. You can use other venvs as long as it supports pip.
-
Install SDK
It looks like installing any other python package:
# Replace the server with the installed server domain name $ export ENRICH_ENV="enrichenv" # Make the virtualenv. This command is specific to virtualenvwrapper. $ mkvirtualenv $ENRICH_ENV # Install SDK from your private pypi distribution site. The version could be # different from the one shown below. $ pip install -U enrichsdk ... Building wheels for collected packages: enrichsdk Building wheel for enrichsdk (setup.py) ... done Created wheel for enrichsdk: filename=enrichsdk-2.7.6-py3-none-any.whl size=161210 sha256=68a386a5b12bab51e3ebcf310c84b8425e1d341ae4ba4da651ed5cc27127fb73 Stored in directory: /home/pingali/.cache/pip/wheels/f0/1b/0d/6046ca9cf0501aeb76e8134bc7815ab58913c862373147d0a5 Successfully built enrichsdk Installing collected packages: docutils, python-dateutil, Flask, enrichsdk .... Successfully installed Flask-0.12.2 docutils-0.15 enrichsdk-2.7.6 python-dateutil-2.8.0
-
Test SDK
First check if the sdk has been installed correctly. Occasionally we have conflicts between python packages dependencies but should normally work:
$ enrichpkg Usage: enrichpkg [OPTIONS] COMMAND [ARGS]... init/test/install Enrich modules and access server Getting started: version: Version of this sdk start: First time instructions env: Setup/check the setup Development: init: Bootstrap modules including transforms* test: Test transforms, manage datasets doodle: Access Doodle metadata server manage: Manage services such as mongo Server: api: Access the server API Utils: sample: Sample data for sharing Helpers: show-log: Pretty print log output *Command used to be called bootstrap Options: --help Show this message and exit.
Local development of transforms requires you to first bootstrap the environment. This includes workspace, environment variables etc. A simple command tells you what the process is. We go through each below:
$ enrichpkg start --minimal ______ _ __ _____ ____ __ __ / ____/___ _____(_)____/ /_ / ___// __ \/ //_/ / __/ / __ \/ ___/ / ___/ __ \ \__ \/ / / / ,< / /___/ / / / / / / /__/ / / / ___/ / /_/ / /| | /_____/_/ /_/_/ /_/\___/_/ /_/ /____/_____/_/ |_| Note: Use --commands to see how to implement these steps [ 1] Understand Enrich and SDK [ 2] set ENRICH_ROOT and populate [ 3] Check and update siteconf and versionmap [ 4] [OPTIONAL] For complex deployments, create an environment/context file [ 5] [OPTIONAl] Create a simple settings file [ 6] Change dir to $ENRICH_CUSTOMERS [ 7] GIT checkout a code repository [ 8] Change to checked out repository [ 9] Bootstrap the repo if not already done [10] Create usecase, say Marketing [11] Change to usecase [12] Bootstrap a transform [13] Install repo-specific requirements [14] Test the transform [15] Bootstrap a pipeline [16] Check pipeline [17] Bootstrap a Prefect workflow [18] Check workflow
A more detailed set of instructions can be obtained as well:
$ enrichpkg start ______ _ __ _____ ____ __ __ / ____/___ _____(_)____/ /_ / ___// __ \/ //_/ / __/ / __ \/ ___/ / ___/ __ \ \__ \/ / / / ,< / /___/ / / / / / / /__/ / / / ___/ / /_/ / /| | /_____/_/ /_/_/ /_/\___/_/ /_/ /____/_____/_/ |_| [ 1] Understand Enrich and SDK [ 2] set ENRICH_ROOT and populate $ export ENRICH_ROOT=/home/pingali/work/enrich $ mkdir -p $ENRICH_ROOT $ enrichpkg env populate [ 3] Check and update siteconf and versionmap $ enrichpkg env check $ cat $ENRICH_ROOT/etc/siteconf.json [ 4] [OPTIONAL] For complex deployments, create an environment/context file $ enrichpkg env sample-context > context.yaml [ 5] [OPTIONAl] Create a simple settings file $ # file to be sourced before you start working. use the appropriate $ # environment activation mechanism $ echo "#source /home/pingali/temp/sdk/testenv/bin/activate" >$ENRICH_ROOT/env.sh $ echo "workon testenv" >$ENRICH_ROOT/env.sh $ echo "export ENRICH_ROOT=/home/pingali/work/enrich" >> $ENRICH_ROOT/env.sh [ 6] Change dir to $ENRICH_CUSTOMERS $ cd $ENRICH_ROOT/customers [ 7] GIT checkout a code repository $ #To handle python paths etc. avoid spaces and hyphen in the names $ git checkout git@github.com:alphainc/enrich-acme.git acme [ 8] Change to checked out repository $ cd $ENRICH_ROOT/customers/acme [ 9] Bootstrap the repo if not already done $ enrichpkg bootstrap repo -p . [10] Create usecase, say Marketing $ # Typically usecases have first letter in capital $ enrichpkg bootstrap usecase -p Marketing [11] Change to usecase $ cd Marketing [12] Bootstrap a transform $ # Will create a python script $ enrichpkg bootstrap transform-simple -p transforms/persona.py $ # Will create a python module $ enrichpkg bootstrap transform-simple -p transforms/persona $ # Will create a python package $ enrichpkg bootstrap transform-package -p transforms/persona $ # Will create a hello world script $ enrichpkg bootstrap transform-helloworld -p transforms/helloworld.py [13] Install repo-specific requirements $ # Add any requirements and install them $ vi ../requirements.txt $ pip install -r ../requirements.txt [14] Test the transform $ enrichpkg test transform transforms/helloworld.py $ # To capture all debug logs $ enrichpkg test transform --capture transforms/helloworld.py [15] Bootstrap a pipeline $ enrichpkg bootstrap pipeline-conf -p pipelines/conf/persona.py [16] Check pipeline $ enrichpkg test conf --capture pipelines/conf/persona.py [17] Bootstrap a Prefect workflow $ # prefect is the default workflow engine $ enrichpkg bootstrap prefectjob -p workflows/prefect/daily.py [18] Check workflow $ workflows/prefect/daily.py
-
Bootstrap the environment
First the cli can be used to check the status of the environment:
$ enrichpkg env Usage: enrichpkg env [OPTIONS] COMMAND [ARGS]... Setup the environment. Environment includes: (a) Workspace for enrich to work (b) Minimal configuration This can be specified using environment variables and/or 'context' file. The absolute minimal required environment variable is ENRICH_ROOT (path to enrich workspace e.g., ~/enrich) Options: --help Show this message and exit. Commands: check Check environment populate Populate the directories sample-context Generate sample context file sample-siteconf Generate sample siteconf sample-versionmap Generate sample versionmap $ enrichpkg env check Error! ENRICH_ROOT must be specified in context file or environment # Point to some directory where enrich workspace can be bootstrapped. $ export ENRICH_ROOT=/home/john/work/enrich $ enrichpkg env check Checking Environment =============== ✓ ENRICH_ROOT defined ✓ ENRICH_DATA defined ✓ ENRICH_TEST defined ✓ ENRICH_ETC defined ✓ ENRICH_SHARED defined ✓ ENRICH_VAR defined ✓ ENRICH_LIB defined ✓ ENRICH_OPT defined ✓ ENRICH_LOGS defined ✓ ENRICH_CUSTOMERS defined ✓ ENRICH_RELEASES defined ✓ ENRICH_CUSTOMERS defined ✓ ENRICH_TEST defined Checking Data Configuration =============== ❌ ENRICH_CUSTOMERS is missing ❌ siteconf missing: /home/john/enrich/etc/siteconf.json ❌ versionmap missing: /home/john/enrich/etc/versionmap.json
Populate the workspace:
$ enrichpkg env populate ✓ ENRICH_ROOT created ✓ ENRICH_DATA created ✓ ENRICH_TEST created ✓ ENRICH_ETC created ✓ ENRICH_SHARED created ✓ ENRICH_VAR created ✓ ENRICH_LIB created ✓ ENRICH_OPT created ✓ ENRICH_LOGS created ✓ ENRICH_CUSTOMERS created ✓ ENRICH_RELEASES created # ENRICH_CUSTOMERS exists # ENRICH_TEST exists ✓ siteconf initialized ✓ versionmap initialized $ enrichpkg env check ... ❌ No applications linked in ENRICH_CUSTOMERS /home/john/enrich/customers ✓ Valid siteconf exists ✓ Valid versionmap exists # Check the siteconf. You have to put your credentials in this # json. Dont worry. In production deployment, the siteconf is # obfuscated/encrypted as required $ cat /home/john/enrich/etc/siteconf.json { "customer": "Acme Inc", "dashboard": { "title": "Acme Rich Data Platform" }, "credentials": { "data-bucket": { "nature": "s3", "bucket": "acme-datalake", "readonly": false, "access_key": "AKIAJURXL...Q", "secret_key": "tutww...A" } } }
-
Test Pre-Built Transform
Download and test an existing (trivial) usecase. It consists of two transforms. First, sales transform that collects/generates sales data for each make/model combination:
# Activate the environment (See above) $ workon $ENRICH_ENV # Make sure the enrich_root is specified. this is important. $ export ENRICH_ROOT=/home/john/work/enrich # Now download the dummy organization's code $ cd $ENRICH_ROOT/customers $ git clone git@github.com:pingali/enrich-acme.git acme # Alternatively if you clone it somewhere else, create the symbolic # link to tell enrich about it. $ ln -s .../enrich-acme acme # Test $ cd acme $ cd Marketing # Download datasets... $ ./bin/install.sh ############################### Installing Acme Datasets ############################### Downloaded usedcars dataset into (1) /home/john/enrich/data/acme/Marketing/shared/acme (2) /home/john/enrich/test/CarSales/shared/acme (3) /home/john/enrich/test/CarModel/shared/acme
Now you are ready to test:
$ pwd /home/john/enrich/customers/acme/Marketing # Look at various test options $ enrichpkg test Usage: enrichpkg test [OPTIONS] COMMAND [ARGS]... Test transforms/pipelines etc. Options: --context TEXT Environment file --help Show this message and exit. Commands: conf Minimal testing of pipeline/task configuration data Manage test data spec could be transform or a spec file task-lib Unit testing of a task library transform Unit testing of a package module (transform) $ enrichpkg test transform --help Usage: enrichpkg test transform [OPTIONS] PKGDIR Unit testing of a package module (transform) Options: --capture / --no-capture Capture output --help Show this message and exit. # Test one transform $ enrichpkg test transform pkg/transforms/sales/ ✓ Loaded imported the sales module ✓ Module has a provider attribute ✓ Able to instantiate the module ✓ Module has testdata # Module's testdata usually has non-trivial 'data' element ✓ Testdata appears valid ✓ Able to load test data ✓ Configured the module ✓ Validated the configuration ✓ Starting process ✓ Executed the process function ✓ Validated the results ✓ Stored the results Results in /home/pingali/temp/test/enrich/test/CarSales/state # Test another $ enrichpkg test transform pkg/transforms/cars ✓ Loaded imported the cars module ✓ Module has a provider attribute ✓ Able to instantiate the module ✓ Module has testdata ✓ Testdata appears valid ✓ Able to load test data ✓ Configured the module ✓ Validated the configuration ✓ Starting process ✓ Executed the process function ✓ Validated the results ✓ Stored the results Results in /home/pingali/temp/test/enrich/test/CarModel/state
-
Lets recreate the transforms used above
We recommend that you create a repo to store all code. We could deploy directly from the repo. Lets say your organization is called Alpha and the enrich repo is called enrich-alpha
Create an overall configuration file called enrich.json. This is to tell enrich platform to whom these modules belong. Also create an
__init__.py
as shown below to tell the python system that this is a valid package:# Add ability to discover applications. More on this later. $ cat __init__.py import os, sys from enrich.customers import get_customers_in_dir def get_customers(): thisdir = os.path.abspath(os.path.dirname(__file__)) return get_customers_in_dir(thisdir) # Replace the details as you see fit. Dont worry about the logo # for now. $ cat enrich.json { "org": { "customer": "alpha", "name": "Alpha Inc", "description": "Alpha retail corporation", "logo": "/static/images/company/logo.png" }, "repository": { "author": "John Smith", "email": "john.smith@alpha.com", "giturl": "https://github.com/alpha/enrich-alpha.git" } } # Check what is in the directory $ ls enrich.json __init__.py
Now bootstrap the code for a division such as marketing:
$ enrichpkg bootstrap help The following components are supported: repo : A new git repository usecase : A collection of transforms, pipelines, assets and workflows transform-package : A complex transform that is structured as a full python package transform-simple : A transform structured as a python module transform-helloworld: A simple hello world transform for study transform-query : A transform that uses a pre-existing database querying module transform-metrics : A transform that uses a pre-existing metrics module transform-iris : Simple Iris post-processing example asset : A reusable library dashboard : A django app that fits the Enrich Dashboard framework pipeline : A pipeline configuration prefectjob : Prefect workflow template datasets : Datasets specification file rscript : R-Script template pyscript : A python script template Use the component name like this enrichpkg init repo -p <path> You can define your own templates as well. Please use the -t option # Create a usecase (collection of transforms, pipelines, assets etc.) $ enrichpkg bootstrap usecase -p Marketing Created Marketing/pipelines Created Marketing/pipelines/conf Created Marketing/pipelines/lib Created Marketing/pipelines/jobs Created Marketing/dashboard Created Marketing/bin Created Marketing/docs Created Marketing/commands Created Marketing/transforms Created Marketing/assets Created Marketing/tasks Created Marketing/tasks/conf Created Marketing/tasks/lib Created Marketing/tasks/jobs Created Marketing/workflows Created Marketing/workflows/prefect Created Marketing/workflows/spark Added a readme in Marketing/README.rst > repo_giturl: https://github.com/alpha/enrich-alpha.git > org_logourl: /static/images/company/logo.png > usecase_name: Marketing > author_name: John Smith > org_name: alpha > author_email: john.smith@alpha.com > org_description: Alpha retail corporation
Store all the code:
First Transform→
-
Create a transform skeleton
enrichpkg
has a bootstrap command that will allow you to create a skeleton of a transform. The command might optionally ask for global information that will be used for bootstrap future transforms and pipelines: -
Transform structure
The transform has the following core elements:
- Pre-conditions to meet
- Outputs that the transform provides
- Test data
- Actual processing function
- Housekeeping to create automatic audit logs and documentation
- Validation of the computation
import os, sys import numpy as np import pandas as pd from enrichsdk import Compute from datetime import datetime import logging logger = logging.getLogger("app") class Mycars(Compute): def __init__(self, *args, **kwargs): super(Mycars,self).__init__(*args, **kwargs) self.name = "cars" # This is specifying outputs of this transform. This module # generates two output frames outputframe1 and outputframe2, # they have the specified columns. This list is used to # validate the outputs. The description is used for # automatically generating the documentation of this module. # self.outputs = { "outputframe1": { "col1": "description", "col2": "description" } "outputframe2": { "col1": "description", "col2": "description" } } # This is specifying preconditions to running this # transform. For example, inputframe1 should exist and should # have been touched by transform1 and transform2. inputframe2 # should be touched by transformname3. self.dependencies = { "inputframe1": ["transform1", "transform2"] "inputframe2": "transform3" } # This is specifying what execution-time parameters # are supported by this transform. This is not enabled # by default. In the pipeline definition, you have to # specify a flag (enable_extra_args) self.supported_extra_args = [ { "name": "suffixes", "description": "List of suffixs to be used" "required": True, "default": "0123456789" }, ] # Test data used to check the functionality of this module. # Put this data in ENRICH_DATA/temp/test-output # test-output/ # transform2/ # inputframe1.csv # transform3/ # inputframe2.csv # self.testdata = { 'conf': { 'args': { 'path': "%(data_root)s/shared/output.csv" } }, 'data': { 'inputframe1': { 'filename': 'inputframe1.csv', 'transform': 'transform2', 'params': { 'sep': ',', } }, 'inputframe2': { 'filename': 'inputframe2.csv', 'transform': 'transform3', 'params': { 'sep': ',', } } } } def validate_args(self, what, state): """ """ args = self.args fail = False msg = "" if "input" not in args: fail = True msg += "Element 'input' is missing\n" if fail: logger.error("Invalid configuration", extra=self.config.get_extra({ 'transform': self.name, 'data': msg })) raise Exception("Invalid configuration") def process(self, state): """ Run the computation and update the state """ logger.debug("{} - process".format(self.name), extra=self.config.get_extra({ 'transform': self.name })) ############################################### # => Initialize ############################################### # Dataframe object. This will expose additional functions # missing in the underlying dataframe (e.g., pandas) frame = self.config.dataframe # => Get the frame details frame1_detail = state.get_frame('inputframe1') frame2_detail = state.get_frame('inputframe2') # => Extract the dataframe. This is usually a pandas # dataframe. df1 = frame1_detail['df'] df2 = frame1_detail['df'] ############################################### # => Compute ############################################### # Do the computation. Generate the updated/new pandas # dataframe. outputdf1 = ... outputdf2 = ... ############################################### # => Update state ############################################### # Annotate the dataframe with all/some columns that have been # introduced. If the output frame is a new one derived from # input frames, the first gather information columns of the # input frame. self.collapse_columns(frame1_detail) Otherwise # gather all the columns columns = {} for c in list(outputdf1.columns): columns[c] = { 'touch': self.name, # Who is introducing this column 'datatype': frame.get_generic_dtype(df, c), # What is its type 'description': self.get_column_description('outputframe1', c) # text associated with this column } # => Gather the update parameters updated_detail = { 'df': outputdf1, 'transform': self.name, 'params': [ { 'type': 'compute', 'columns': columns } ], 'history': [ # Add a log entry describing the change { 'transform': self.name, 'log': 'your description', } ] } # Update the state. state.update_frame('outputframe1', updated_detail, create=True) # Do the same thing for the second update dataframe ########################################### # => Return ########################################### return state def validate_results(self, what, state): """ Check to make sure that the execution completed correctly """ frame = self.config.dataframe #################################################### # => Output Dataframe 1 #################################################### name = 'outputframe1' if not state.reached_stage(name, self.name): raise Exception("Could not find new frame created for {}".format(name)) detail = state.get_frame(name) df = detail['df'] # => Make sure it is not empty assert frame.shape(df)[0] > 0 cols = frame.columns(df) for c in ['col1', 'col2']: if c not in cols: logger.error("Missing column: {}".format(c), extra=self.config.get_extra({ 'transform': self.name })) raise Exception("Invalid output generated") provider = Mycars
-
Unit testing transform
The test command is intelligent enough to catch syntax errors, load relevant test data and drive the unit testing of transform. In case of a renderer, the test command will run an application server to check whether the content is being rendered correctly.
Sample output for a syntax error looks like this:
$ enrichpkg test transform transforms/cars/ ❌ Test directory missing ✓ Loaded imported the cars module ✓ Module has a provider attribute ✓ Able to instantiate the module ✓ Module has testdata ✓ Testdata appears valid No files to load for inputframe1 No files to load for inputframe2 ✓ Able to load test data ✓ Configured the module ✓ Validated the configuration ✓ Starting process ❌ Could not execute process File "transforms/cars/__init__.py", line 105, in process frame1_detail = state.get_frame('inputframe1') File "/work/pingali/Code/scribble-enrichsdk/enrichsdk/package/mock.py", line 199, in get_frame raise Exception("Cannot find the required frame") Exception: Cannot find the required frame
In this case test specification requires a dataframe named 'inputframe1' to be loaded into the execution state from the testpath (ENRICH_TEST/transform2/inputframe1.csv). The errors shows why the transform could not be executed (inputframe1 path doesnt exist):
'data': { 'inputframe1': { 'filename': 'inputframe1.csv', 'transform': 'transform2', 'params': { 'sep': ',', } }, ....
Fix this by replacing the variables by relevant names. Acme\'s transform looks like:
... 'data': { "sales": { "transform": "CarSales", "filename": "state/sales.csv", "params": { "sep": "," } } ...
Successful test looks like this:
$ enrichpkg test transform transforms/cars Checking: pkg/transforms/cars ✓ Loaded imported the cars module ✓ Module has a provider attribute ✓ Able to instantiate the module ✓ Module has testdata ✓ Testdata appears valid ✓ Able to load test data ✓ Configured the module ✓ Executed the process function ✓ Validated the results ✓ Stored the results Results in ...enrich/test/cars