Testing→

There are two mechanisms to test transforms:

enrichpkg - Enrich's own unit testing mechanism
pytest - Python standard testing mechanism

pytest is used for transform testing, and enrichpkg for everything else. Our objective is to move to pytest for all.

Transform Testing→

enrichpkg bootstraps each transform with a skeleton code and fixtures shown below. Note that this is done only if each transform is bootstrapped using 'transform-package' option that constructs each transform as a full python package. It simulates the pipeline execution with paths over-ridden with test directories. At the end of the execution of the transform, the state object is also dumped. All the output is cleaned by default. If need for study, you can comment the line removing the temporary output.

The fixtures include (a) Test configuration files (b) Dataset. :

fixtures/
fixtures/configs
fixtures/configs/1.json
fixtures/configs/2.json
fixtures/data
fixtures/data/CarModel
fixtures/data/CarModel/cars.csv
fixtures/data/CarModel/sales.csv

The test configuration file specifies how the pipeline state should be loaded before executing the transform. It lists one or more frames to be loaded. These frames could be one of: (a) pandas - tables to be loaded as pandas dataframe (b) dict - JSON files

Params element provides additional information to enable loading of the files. In case of pandas, it specifies pandas read_csv params. The filename refers :

$ cat fixtures/configs/1.json 
{
    "conf": {
  "enable": true
    },
    "data": {
  "cars": {
      "frametype": "pandas",
          "transform": "CarModel",
      "filename": "%(inputdir)s/%(transform)s/cars.csv", 
      "params": {
      "sep": ","
      }       
  }
    }
}

@pytest.mark.parametrize('testdata',
                         [
                             'fixtures/configs/1.json',
                             'fixtures/configs/2.json'
                         ],
                         indirect=True)
def test_transform(testdata): 
    """
    Testdata fixtures 
    """

    cls = testtransform.provider 
    pipeline = MockPipeline()

    # Where is the dataframe input and output?
    root = tempfile.mkdtemp(prefix="enrich")
    testdata.update({

        # This is the input data to load 
        'inputdir': os.path.join(os.path.dirname(__file__),
                                 "fixtures", "data"),

        # This is where the output is stored...
        'outputdir': os.path.join(root,'output'),
        'statedir': os.path.join(root,'state'),
    })

    # Now run the pipeline 
    pipeline.execute(cls, testdata, save_state=True)

    # Check the output
    walk(root)

    # Cleanup
    shutil.rmtree(root)

A simple run looks like this :

$ cd .../transforms
pytest -vv transforms/testtransform/
============================================================= test session starts =============================================================
platform linux -- Python 3.5.2, pytest-3.2.1, py-1.5.3, pluggy-0.4.0 -- ...virtualenvs/dev/bin/python3
cachedir: transforms/testtransform/.cache
spark version -- Spark 2.2.0 built for Hadoop 2.7.3 | Build flags: -Phadoop-2.7 -Psparkr -Phive -Phive-thriftserver -Pyarn -Pmesos -DzincPort=3036
rootdir: ...transforms/testtransform, inifile:
plugins: spark-0.4.5, cov-2.5.1
collected 3 items

transforms/testtransform/tests/test_module.py::test_configuration PASSED
transforms/testtransform/tests/test_module.py::test_transform[fixtures/configs/1.json] PASSED
transforms/testtransform/tests/test_module.py::test_transform[fixtures/configs/2.json] PASSED

You can see the output structure in a no-capture mode of the pytest.

$ pytest -s -vv testtransform/
============================================================= test session starts =============================================================
platform linux -- Python 3.5.2, pytest-3.2.1, py-1.5.3, pluggy-0.4.0 -- /home/pingali/.virtualenvs/dev/bin/python3
cachedir: transforms/testtransform/.cache
spark version -- Spark 2.2.0 built for Hadoop 2.7.3 | Build flags: -Phadoop-2.7 -Psparkr -Phive -Phive-thriftserver -Pyarn -Pmesos -DzincPort=3036
rootdir: /work/pingali/Code/enrich-scribble/Contrib/transforms/testtransform, inifile:
plugins: spark-0.4.5, cov-2.5.1
collected 3 items

transforms/testtransform/tests/test_module.py::test_configuration PASSED
transforms/testtransform/tests/test_module.py::test_transform[fixtures/configs/1.json] 
Output:
------ enrichpc_ths12
--------- output
------------ run-2015021
--------------- outputs
------------------ cars.csv
------------------ cars.sqlite
--------- state
------------ cars.csv
PASSED
transforms/testtransform/tests/test_module.py::test_transform[fixtures/configs/2.json] 
Output:
------ enrichzvx5ukre
--------- output
------------ run-2015021
--------------- outputs
------------------ cars.csv
------------------ cars.sqlite
--------- state
------------ cars.csv
PASSED

Test Data Setup→

The configuration of datasets can be done standalone in a [datasets.py]{.title-ref} in the usecase directory or in the transform as part of the testdata section.

The Dataset Management{.interpreted-text role="ref"} module has been created to add capabilities over time.

Example standane specification:

$ cat datasets.py
...
from acmeapp.datasets import get_dataset_config
datasets = get_dataset_config(source='server-name', sourcetype='s3')

Example in testdata section of a transform:

{
   "command": "aws s3 sync s3://%(backuppath)s %(targetpath)s",
   'params': {
       'enrich_data_dir': '/home/ubuntu/enrich/data',
       'backup_root': 'some-s3-path',
       'node': 'some hostname'
    },
   'available': [
        # Complete list...
        Dataset({
           'name': "inventory_dataset",
           ...
        }),
         ...
   ],
   'used': [
      # Used by this transform
     'inventory_dataset'
   ]
}

Once the datasets is configured, you can take a number of actions:

$ enrichpkg test-data ./datasets.py  --help 
Usage: enrichpkg test-data [OPTIONS] SPEC COMMAND [ARGS]...

  Manage test data.

  spec could be transform or a spec file

Options:
  --capture / --no-capture  Capture output
  --help                    Show this message and exit.

Commands:
  download  Download dataset for a specified range
  list      List available test datasets
  show      Show the details of a given dataset

List the available datasets to download:

$ enrichpkg test-data ./datasets.py list
[0] SensorEvents
[1] Orders
[2] Customer
[3] Persona
[5] Adjustments
[6] Taxonomy

Look at the details of each dataset:

$ enrichpkg test-data ./datasets.py show orderhistory 
✓ Found dataset - orderhistory 
------
Description: Per-day history of orders along with events
------
[test] ...enrich/data/_test/shared/datasets/orderhistory/v3
[remote local] /home/ubuntu/enrich/data/acme/Marketing/shared/datasets/orderhistory/v3
[host local] ...enrich/data/acme/PLP/shared/datasets/orderhistory/v3
[match] Pattern: %Y-%m-%d
   : 2020-04-06T00:00:00 => 2020-04-06
   : 2020-04-07T00:00:00 => 2020-04-07
   : 2020-04-08T00:00:00 => 2020-04-08
   : 2020-04-09T00:00:00 => 2020-04-09
   : 2020-04-10T00:00:00 => 2020-04-10
   : 2020-04-11T00:00:00 => 2020-04-11
   : 2020-04-12T00:00:00 => 2020-04-12
   : 2020-04-13T00:00:00 => 2020-04-13

Download any dataset in a specified local directory. :

$ enrichpkg test-data ./datasets.py download alpha 2020-01-02 2020-01-10 --target local
✓ Found dataset - alpha
------
Description: Per-day history of alpha along with events
------
[test] ...enrich/data/_test/shared/datasets/orderhistory/v3
[remote local] /home/ubuntu/enrich/data/acme/Marketing/shared/datasets/orderhistory/v3
[host local] ...enrich/data/acme/Marketing/shared/datasets/orderhistory/v3
[backup] enrich-acme/backup/aip.acme.com/data/acme/Marketing/shared/datasets/orderhistory/v3
[match] Pattern: %Y-%m-%d

[commands] Please run these after checking
aws s3 sync s3://enrich-acme/backup/aip.acme.com/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-02 ...enrich/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-02
aws s3 sync s3://enrich-acme/backup/aip.acme.com/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-03 ...enrich/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-03
aws s3 sync s3://enrich-acme/backup/aip.acme.com/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-04 ...enrich/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-04
aws s3 sync s3://enrich-acme/backup/aip.acme.com/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-05 ...enrich/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-05
aws s3 sync s3://enrich-acme/backup/aip.acme.com/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-06 ...enrich/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-06
aws s3 sync s3://enrich-acme/backup/aip.acme.com/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-07 ...enrich/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-07
aws s3 sync s3://enrich-acme/backup/aip.acme.com/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-08 ...enrich/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-08
aws s3 sync s3://enrich-acme/backup/aip.acme.com/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-09 ...enrich/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-09
aws s3 sync s3://enrich-acme/backup/aip.acme.com/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-10 ...enrich/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-10