Skip to content

Getting Started

A hello world example showing the development of a simple transform is given below. Alternatively you can follow the embedded youtube video.

Dated

The version in the tutorial is a bit outdated but the flow should work. We are working on an updated tutorial

Workspace Setup

  1. Background

    Note that this platform and development environment is in python. So you should have appropriate environment setup including the operating-system specific packages. There are no known os-specific dependencies.

  2. Setup the working environment.

    Now setup the virtual env to install enrichpkg. We normally use Virtualenvwrapper. You can use other venvs as long as it supports pip.

  3. Install SDK

    It looks like installing any other python package:

    # Replace the server with the installed server domain name 
    $ export ENRICH_ENV="enrichenv"
    
    # Make the virtualenv. This command is specific to virtualenvwrapper.
    $ mkvirtualenv $ENRICH_ENV
    
    # Install SDK from your private pypi distribution site. The version could be
    # different from the one shown below. 
    $ pip install  -U enrichsdk
    ...
    Building wheels for collected packages: enrichsdk
      Building wheel for enrichsdk (setup.py) ... done
      Created wheel for enrichsdk: filename=enrichsdk-2.7.6-py3-none-any.whl size=161210 sha256=68a386a5b12bab51e3ebcf310c84b8425e1d341ae4ba4da651ed5cc27127fb73
      Stored in directory: /home/pingali/.cache/pip/wheels/f0/1b/0d/6046ca9cf0501aeb76e8134bc7815ab58913c862373147d0a5
    Successfully built enrichsdk
    Installing collected packages: docutils, python-dateutil, Flask, enrichsdk
    ....
    Successfully installed Flask-0.12.2 docutils-0.15 enrichsdk-2.7.6 python-dateutil-2.8.0
    
  4. Test SDK

    First check if the sdk has been installed correctly. Occasionally we have conflicts between python packages dependencies but should normally work:

    $ enrichpkg
    Usage: enrichpkg [OPTIONS] COMMAND [ARGS]...
    
      init/test/install Enrich modules and access server
    
      Getting started:
         version:  Version of this sdk
         start:    First time instructions
         env:      Setup/check the setup
    
      Development:
         init:   Bootstrap modules including transforms*
         test:   Test transforms, manage datasets
         doodle: Access Doodle metadata server
         manage: Manage services such as mongo
    
      Server:
         api:       Access the server API
    
      Utils:
         sample:    Sample data for sharing
    
      Helpers:
         show-log:  Pretty print log output
    
      *Command used to be called bootstrap
    
    Options:
      --help  Show this message and exit.
    

    Local development of transforms requires you to first bootstrap the environment. This includes workspace, environment variables etc. A simple command tells you what the process is. We go through each below:

    $ enrichpkg start --minimal
        ______           _      __       _____ ____  __ __
       / ____/___  _____(_)____/ /_     / ___// __ \/ //_/
      / __/ / __ \/ ___/ / ___/ __ \    \__ \/ / / / ,<   
     / /___/ / / / /  / / /__/ / / /   ___/ / /_/ / /| |  
    /_____/_/ /_/_/  /_/\___/_/ /_/   /____/_____/_/ |_|  
    
    
    Note: Use --commands to see how to implement these steps
    
    [ 1] Understand Enrich and SDK
    [ 2] set ENRICH_ROOT and populate
    [ 3] Check and update siteconf and versionmap
    [ 4] [OPTIONAL] For complex deployments, create an environment/context file
    [ 5] [OPTIONAl] Create a simple settings file
    [ 6] Change dir to $ENRICH_CUSTOMERS
    [ 7] GIT checkout a code repository
    [ 8] Change to checked out repository
    [ 9] Bootstrap the repo if not already done
    [10] Create usecase, say Marketing
    [11] Change to usecase
    [12] Bootstrap a transform
    [13] Install repo-specific requirements
    [14] Test the transform
    [15] Bootstrap a pipeline
    [16] Check pipeline
    [17] Bootstrap a Prefect workflow
    [18] Check workflow
    

    A more detailed set of instructions can be obtained as well:

    $ enrichpkg start 
        ______           _      __       _____ ____  __ __
       / ____/___  _____(_)____/ /_     / ___// __ \/ //_/
      / __/ / __ \/ ___/ / ___/ __ \    \__ \/ / / / ,<   
     / /___/ / / / /  / / /__/ / / /   ___/ / /_/ / /| |  
    /_____/_/ /_/_/  /_/\___/_/ /_/   /____/_____/_/ |_|  
    
    
    [ 1] Understand Enrich and SDK
    
    
    [ 2] set ENRICH_ROOT and populate
    
        $ export ENRICH_ROOT=/home/pingali/work/enrich
        $ mkdir -p $ENRICH_ROOT
        $ enrichpkg env populate
    
    [ 3] Check and update siteconf and versionmap
    
        $ enrichpkg env check
        $ cat $ENRICH_ROOT/etc/siteconf.json
    
    [ 4] [OPTIONAL] For complex deployments, create an environment/context file
    
        $ enrichpkg env sample-context > context.yaml
    
    [ 5] [OPTIONAl] Create a simple settings file
    
        $ # file to be sourced before you start working. use the appropriate
        $ # environment activation mechanism
        $ echo "#source /home/pingali/temp/sdk/testenv/bin/activate" >$ENRICH_ROOT/env.sh
        $ echo "workon testenv" >$ENRICH_ROOT/env.sh
        $ echo "export ENRICH_ROOT=/home/pingali/work/enrich" >> $ENRICH_ROOT/env.sh
    
    [ 6] Change dir to $ENRICH_CUSTOMERS
    
        $ cd $ENRICH_ROOT/customers
    
    [ 7] GIT checkout a code repository
    
        $ #To handle python paths etc. avoid spaces and hyphen in the names
        $ git checkout git@github.com:alphainc/enrich-acme.git acme
    
    [ 8] Change to checked out repository
    
        $ cd $ENRICH_ROOT/customers/acme
    
    [ 9] Bootstrap the repo if not already done
    
        $ enrichpkg bootstrap repo -p .
    
    [10] Create usecase, say Marketing
    
        $ # Typically usecases have first letter in capital
        $ enrichpkg bootstrap usecase -p Marketing
    
    [11] Change to usecase
    
        $ cd Marketing
    
    [12] Bootstrap a transform
    
        $ # Will create a python script
        $ enrichpkg bootstrap transform-simple -p transforms/persona.py
        $ # Will create a python module
        $ enrichpkg bootstrap transform-simple -p transforms/persona
        $ # Will create a python package
        $ enrichpkg bootstrap transform-package -p transforms/persona
        $ # Will create a hello world script
        $ enrichpkg bootstrap transform-helloworld -p transforms/helloworld.py
    
    [13] Install repo-specific requirements
    
        $ # Add any requirements and install them
        $ vi ../requirements.txt
        $ pip install -r ../requirements.txt
    
    [14] Test the transform
    
        $ enrichpkg test transform transforms/helloworld.py
        $ # To capture all debug logs
        $ enrichpkg test transform --capture transforms/helloworld.py
    
    [15] Bootstrap a pipeline
    
        $ enrichpkg bootstrap pipeline-conf -p pipelines/conf/persona.py
    
    [16] Check pipeline
    
        $ enrichpkg test conf --capture pipelines/conf/persona.py
    
    [17] Bootstrap a Prefect workflow
    
        $ # prefect is the default workflow engine
        $ enrichpkg bootstrap prefectjob -p workflows/prefect/daily.py
    
    [18] Check workflow
    
        $ workflows/prefect/daily.py
    
  5. Bootstrap the environment

    First the cli can be used to check the status of the environment:

    $ enrichpkg env
    Usage: enrichpkg env [OPTIONS] COMMAND [ARGS]...
    
      Setup the environment. Environment includes:
    
          (a) Workspace for enrich to work
          (b) Minimal configuration
    
      This can be specified using environment variables and/or 'context' file.
    
      The absolute minimal required environment variable is ENRICH_ROOT (path to
      enrich workspace e.g., ~/enrich)
    
    Options:
      --help  Show this message and exit.
    
    Commands:
      check              Check environment
      populate           Populate the directories
      sample-context     Generate sample context file
      sample-siteconf    Generate sample siteconf
      sample-versionmap  Generate sample versionmap
    
    $ enrichpkg env check
      Error! ENRICH_ROOT must be specified in context file or environment
    
    # Point to some directory where enrich workspace can be bootstrapped.
    $ export ENRICH_ROOT=/home/john/work/enrich
    
    $ enrichpkg env check
    Checking Environment
    ===============
    
     ✓ ENRICH_ROOT defined 
     ✓ ENRICH_DATA defined 
     ✓ ENRICH_TEST defined 
     ✓ ENRICH_ETC defined 
     ✓ ENRICH_SHARED defined 
     ✓ ENRICH_VAR defined 
     ✓ ENRICH_LIB defined 
     ✓ ENRICH_OPT defined 
     ✓ ENRICH_LOGS defined 
     ✓ ENRICH_CUSTOMERS defined 
     ✓ ENRICH_RELEASES defined 
     ✓ ENRICH_CUSTOMERS defined 
     ✓ ENRICH_TEST defined
    
    Checking Data Configuration
    ===============
    
     ❌ ENRICH_CUSTOMERS is missing 
     ❌ siteconf missing: /home/john/enrich/etc/siteconf.json 
     ❌ versionmap missing: /home/john/enrich/etc/versionmap.json
    

    Populate the workspace:

    $ enrichpkg env populate
      ENRICH_ROOT created 
      ENRICH_DATA created 
      ENRICH_TEST created 
      ENRICH_ETC created 
      ENRICH_SHARED created 
      ENRICH_VAR created 
      ENRICH_LIB created 
      ENRICH_OPT created 
      ENRICH_LOGS created 
      ENRICH_CUSTOMERS created 
      ENRICH_RELEASES created 
     # ENRICH_CUSTOMERS exists 
     # ENRICH_TEST exists 
      siteconf initialized 
      versionmap initialized 
    
    $ enrichpkg env check
    ...    
      No applications linked in ENRICH_CUSTOMERS /home/john/enrich/customers 
      Valid siteconf exists 
      Valid versionmap exists 
    
    # Check the siteconf. You have to put your credentials in this
    # json. Dont worry. In production deployment, the siteconf is
    # obfuscated/encrypted as required
    $ cat /home/john/enrich/etc/siteconf.json
    {
        "customer": "Acme Inc",
        "dashboard": {
            "title": "Acme Rich Data Platform"
        },
        "credentials": {
            "data-bucket": {
                "nature": "s3",
                "bucket": "acme-datalake",
                "readonly": false,
                "access_key": "AKIAJURXL...Q",
                "secret_key": "tutww...A"
            }
        }
    }  
    
  6. Test Pre-Built Transform

    Download and test an existing (trivial) usecase. It consists of two transforms. First, sales transform that collects/generates sales data for each make/model combination:

    # Activate the environment (See above)
    $ workon $ENRICH_ENV 
    
    # Make sure the enrich_root is specified. this is important. 
    $ export ENRICH_ROOT=/home/john/work/enrich
    
    # Now download the dummy organization's code
    $ cd $ENRICH_ROOT/customers
    $ git clone git@github.com:pingali/enrich-acme.git acme 
    
    # Alternatively if you clone it somewhere else, create the symbolic
    # link to tell enrich about it. 
    $ ln -s .../enrich-acme acme
    
    # Test
    $ cd acme    
    $ cd Marketing
    
    # Download datasets...
    $ ./bin/install.sh
     ###############################
     Installing Acme Datasets
     ###############################
    
     Downloaded usedcars dataset into 
        (1) /home/john/enrich/data/acme/Marketing/shared/acme
        (2) /home/john/enrich/test/CarSales/shared/acme
        (3) /home/john/enrich/test/CarModel/shared/acme
    

    Now you are ready to test:

    $ pwd
    /home/john/enrich/customers/acme/Marketing
    
    # Look at various test options
    $ enrichpkg test 
    Usage: enrichpkg test [OPTIONS] COMMAND [ARGS]...
    
      Test transforms/pipelines etc.
    
    Options:
      --context TEXT  Environment file
      --help          Show this message and exit.
    
    Commands:
      conf       Minimal testing of pipeline/task configuration
      data       Manage test data spec could be transform or a spec file
      task-lib   Unit testing of a task library
      transform  Unit testing of a package module (transform)
    
    $ enrichpkg test transform --help
    Usage: enrichpkg test transform [OPTIONS] PKGDIR
    
      Unit testing of a package module (transform)
    
    Options:
      --capture / --no-capture  Capture output
      --help                    Show this message and exit.
    
    # Test one transform
    $ enrichpkg test transform pkg/transforms/sales/
      Loaded imported the sales module 
      Module has a provider attribute 
      Able to instantiate the module 
      Module has testdata 
     # Module's testdata usually has non-trivial 'data' element 
      Testdata appears valid 
      Able to load test data 
      Configured the module 
      Validated the configuration 
      Starting process 
      Executed the process function 
      Validated the results 
      Stored the results 
    Results in /home/pingali/temp/test/enrich/test/CarSales/state
    
    # Test another  
    $ enrichpkg test transform pkg/transforms/cars
      Loaded imported the cars module 
      Module has a provider attribute 
      Able to instantiate the module 
      Module has testdata 
      Testdata appears valid 
      Able to load test data 
      Configured the module 
      Validated the configuration 
      Starting process 
      Executed the process function 
      Validated the results 
      Stored the results 
    Results in /home/pingali/temp/test/enrich/test/CarModel/state
    
  7. Lets recreate the transforms used above

    We recommend that you create a repo to store all code. We could deploy directly from the repo. Lets say your organization is called Alpha and the enrich repo is called enrich-alpha

    $ cd $ENRICH_ROOT/customers 
    $ git clone git@github.com:alpha/enrich-alpha.git alpha 
    $ cd alpha
    

    Create an overall configuration file called enrich.json. This is to tell enrich platform to whom these modules belong. Also create an __init__.py as shown below to tell the python system that this is a valid package:

    # Add ability to discover applications. More on this later.
    $ cat __init__.py
    import os, sys 
    
    from enrich.customers import get_customers_in_dir
    
    def get_customers(): 
    
        thisdir = os.path.abspath(os.path.dirname(__file__))
        return get_customers_in_dir(thisdir) 
    
    # Replace the details as you see fit. Dont worry about the logo
    # for now. 
    $ cat enrich.json 
    {
        "org": {
            "customer": "alpha",
            "name": "Alpha Inc",
            "description": "Alpha retail corporation", 
            "logo": "/static/images/company/logo.png"
        },
        "repository": { 
            "author": "John Smith", 
            "email": "john.smith@alpha.com", 
            "giturl": "https://github.com/alpha/enrich-alpha.git" 
        }
    }
    
    # Check what is in the directory
    $ ls 
    enrich.json __init__.py 
    

    Now bootstrap the code for a division such as marketing:

    $ enrichpkg bootstrap help
    The following components are supported:
    
       repo                : A new git repository
       usecase             : A collection of transforms, pipelines, assets and workflows
       transform-package   : A complex transform that is structured as a full python package
       transform-simple    : A transform structured as a python module
       transform-helloworld: A simple hello world transform for study
       transform-query     : A  transform that uses a pre-existing database querying module
       transform-metrics   : A  transform that uses a pre-existing metrics module
       transform-iris      : Simple Iris post-processing example
       asset               : A reusable library 
       dashboard           : A django app that fits the Enrich Dashboard framework
       pipeline            : A pipeline configuration
       prefectjob          : Prefect workflow template
       datasets            : Datasets specification file
       rscript             : R-Script template
       pyscript            : A python script template
    
    Use the component name like this
        enrichpkg init repo -p <path>
    
    You can define your own templates as well. Please use the -t option
    
    # Create a usecase (collection of transforms, pipelines, assets etc.)
    $ enrichpkg bootstrap usecase -p Marketing
    Created Marketing/pipelines
    Created Marketing/pipelines/conf
    Created Marketing/pipelines/lib
    Created Marketing/pipelines/jobs
    Created Marketing/dashboard
    Created Marketing/bin
    Created Marketing/docs
    Created Marketing/commands
    Created Marketing/transforms
    Created Marketing/assets
    Created Marketing/tasks
    Created Marketing/tasks/conf
    Created Marketing/tasks/lib
    Created Marketing/tasks/jobs
    Created Marketing/workflows
    Created Marketing/workflows/prefect
    Created Marketing/workflows/spark
    Added a readme in Marketing/README.rst
    > repo_giturl: https://github.com/alpha/enrich-alpha.git
    > org_logourl: /static/images/company/logo.png
    > usecase_name: Marketing
    > author_name: John Smith
    > org_name: alpha
    > author_email: john.smith@alpha.com
    > org_description: Alpha retail corporation
    

    Store all the code:

    $ git add . 
    $ git commit -a -m "Bootstrapped the repo" 
    

First Transform

  1. Create a transform skeleton

    enrichpkg has a bootstrap command that will allow you to create a skeleton of a transform. The command might optionally ask for global information that will be used for bootstrap future transforms and pipelines:

    $ cd Marketing 
    $ enrichpkg bootstrap transform-simple -p transforms/cars
    ==============================
      Welcome to Scribble Enrich 
    =============================
    
    
    Bootstrapping a simple transform
    
    > name: Cars
    Bootstrapped script in transforms/cars/__init__.py
    
  2. Transform structure

    The transform has the following core elements:

    • Pre-conditions to meet
    • Outputs that the transform provides
    • Test data
    • Actual processing function
    • Housekeeping to create automatic audit logs and documentation
    • Validation of the computation
    import os, sys
    import numpy as np 
    import pandas as pd 
    from enrichsdk import Compute 
    from datetime import datetime 
    import logging 
    
    logger = logging.getLogger("app") 
    
    class Mycars(Compute): 
    
        def __init__(self, *args, **kwargs): 
            super(Mycars,self).__init__(*args, **kwargs) 
            self.name = "cars" 
    
            # This is specifying outputs of this transform. This module
            # generates two output frames outputframe1 and outputframe2,
            # they have the specified columns. This list is used to
            # validate the outputs. The description is used for
            # automatically generating the documentation of this module.
            # 
            self.outputs = { 
                "outputframe1": { 
                    "col1": "description", 
                    "col2": "description"
            }          
                "outputframe2": { 
                    "col1": "description", 
                    "col2": "description"
            }          
            }          
    
            # This is specifying preconditions to running this
            # transform. For example, inputframe1 should exist and should
            # have been touched by transform1 and transform2. inputframe2
            # should be touched by transformname3.
        self.dependencies = { 
            "inputframe1": ["transform1", "transform2"]
                "inputframe2": "transform3" 
        }
    
            # This is specifying what execution-time parameters
            # are supported by this transform. This is not enabled
            # by default. In the pipeline definition, you have to
            # specify a flag (enable_extra_args) 
            self.supported_extra_args = [
                {
                    "name": "suffixes",
                    "description": "List of suffixs to be used"
                    "required": True,
                    "default": "0123456789"
                },
            ]
    
        # Test data used to check the functionality of this module. 
        # Put this data in ENRICH_DATA/temp/test-output
            # test-output/
            #     transform2/
            #          inputframe1.csv 
            #     transform3/
            #          inputframe2.csv 
            # 
            self.testdata = {
                  'conf': {
                'args': {
                        'path': "%(data_root)s/shared/output.csv" 
            }
            },
                'data': { 
                    'inputframe1': {
                        'filename': 'inputframe1.csv', 
                        'transform': 'transform2',
                        'params': {
                            'sep': ',', 
                        }
                    },
                    'inputframe2': {
                        'filename': 'inputframe2.csv', 
                        'transform': 'transform3',
                        'params': {
                            'sep': ',', 
                        }
                    }
                }
            }
    
        def validate_args(self, what, state):
            """
    
            """
            args = self.args
    
            fail = False
            msg = ""
    
            if "input" not in args:
                fail = True
                msg += "Element 'input' is missing\n"
    
            if fail:
                logger.error("Invalid configuration",
                             extra=self.config.get_extra({
                                 'transform': self.name,
                                 'data': msg
                             }))
                raise Exception("Invalid configuration")
    
        def process(self, state): 
            """
            Run the computation and update the state 
            """
            logger.debug("{} - process".format(self.name),
                         extra=self.config.get_extra({
                             'transform': self.name 
                         }))
    
            ###############################################
            # => Initialize 
            ###############################################
            # Dataframe object. This will expose additional functions
            # missing in the underlying dataframe (e.g., pandas)
            frame = self.config.dataframe 
    
            # => Get the frame details 
            frame1_detail = state.get_frame('inputframe1') 
            frame2_detail = state.get_frame('inputframe2') 
    
            # => Extract the dataframe. This is usually a pandas
            # dataframe. 
            df1 = frame1_detail['df']
            df2 = frame1_detail['df']
    
            ###############################################
            # => Compute
            ###############################################
            # Do the computation. Generate the updated/new pandas
            # dataframe.
            outputdf1 = ...
            outputdf2 = ...
    
            ###############################################
            # => Update state 
            ###############################################
    
            # Annotate the dataframe with all/some columns that have been
            # introduced. If the output frame is a new one derived from
            # input frames, the first gather information columns of the
            # input frame.  self.collapse_columns(frame1_detail) Otherwise
            # gather all the columns
            columns = {} 
            for c in list(outputdf1.columns): 
                columns[c] = { 
                    'touch': self.name, # Who is introducing this column
                    'datatype': frame.get_generic_dtype(df, c), # What is its type 
                    'description': self.get_column_description('outputframe1', c) # text associated with this column 
                } 
    
            # => Gather the update parameters 
            updated_detail = { 
                'df': outputdf1, 
                'transform': self.name, 
                'params': [
                    {
                        'type': 'compute',
                        'columns': columns 
                    }
                ], 
                'history': [
                    # Add a log entry describing the change 
                    {
                        'transform': self.name, 
                        'log': 'your description', 
                    }
                ]
            }
    
            # Update the state. 
            state.update_frame('outputframe1', updated_detail, create=True) 
    
            # Do the same thing for the second update dataframe
    
            ###########################################
            # => Return 
            ###########################################
            return state 
    
        def validate_results(self, what, state): 
            """
            Check to make sure that the execution completed correctly
            """
    
            frame = self.config.dataframe 
    
            ####################################################
            # => Output Dataframe 1 
            ####################################################
            name = 'outputframe1'
            if not state.reached_stage(name, self.name): 
                raise Exception("Could not find new frame created for {}".format(name))
    
            detail = state.get_frame(name) 
            df = detail['df'] 
    
            # => Make sure it is not empty 
            assert frame.shape(df)[0] > 0 
    
            cols = frame.columns(df) 
            for c in ['col1', 'col2']: 
                if c not in cols: 
                    logger.error("Missing column: {}".format(c), 
                                 extra=self.config.get_extra({
                                     'transform': self.name 
                                 }))
                    raise Exception("Invalid output generated") 
    
    provider = Mycars 
    
  3. Unit testing transform

    The test command is intelligent enough to catch syntax errors, load relevant test data and drive the unit testing of transform. In case of a renderer, the test command will run an application server to check whether the content is being rendered correctly.

    Sample output for a syntax error looks like this:

    $ enrichpkg test transform transforms/cars/
     Test directory missing 
     Loaded imported the cars module 
     Module has a provider attribute 
     Able to instantiate the module 
     Module has testdata 
     Testdata appears valid 
    No files to load for inputframe1
    No files to load for inputframe2
     Able to load test data 
     Configured the module 
     Validated the configuration 
     Starting process 
     Could not execute process 
    File "transforms/cars/__init__.py", line 105, in process
      frame1_detail = state.get_frame('inputframe1')
    File "/work/pingali/Code/scribble-enrichsdk/enrichsdk/package/mock.py", line 199, in get_frame
      raise Exception("Cannot find the required frame")
    Exception: Cannot find the required frame
    

    In this case test specification requires a dataframe named 'inputframe1' to be loaded into the execution state from the testpath (ENRICH_TEST/transform2/inputframe1.csv). The errors shows why the transform could not be executed (inputframe1 path doesnt exist):

    'data': {
        'inputframe1': {
            'filename': 'inputframe1.csv',
            'transform': 'transform2',
            'params': {
                'sep': ',',
            }
        },
    ....
    

    Fix this by replacing the variables by relevant names. Acme\'s transform looks like:

    ...
    'data': {
        "sales": {
            "transform": "CarSales",
            "filename": "state/sales.csv",
            "params": {
                "sep": ","
            }
        }
    ...
    

    Successful test looks like this:

    $ enrichpkg test  transform transforms/cars
    
    Checking: pkg/transforms/cars
      Loaded imported the cars module 
      Module has a provider attribute 
      Able to instantiate the module 
      Module has testdata 
      Testdata appears valid 
      Able to load test data 
      Configured the module 
      Executed the process function 
      Validated the results 
      Stored the results 
    Results in ...enrich/test/cars