Testing→
There are two mechanisms to test transforms:
- enrichpkg - Enrich's own unit testing mechanism
- pytest - Python standard testing mechanism
pytest is used for transform testing, and enrichpkg for everything else. Our objective is to move to pytest for all.
Transform Testing→
enrichpkg bootstraps each transform with a skeleton code and fixtures shown below. Note that this is done only if each transform is bootstrapped using 'transform-package' option that constructs each transform as a full python package. It simulates the pipeline execution with paths over-ridden with test directories. At the end of the execution of the transform, the state object is also dumped. All the output is cleaned by default. If need for study, you can comment the line removing the temporary output.
The fixtures include (a) Test configuration files (b) Dataset. :
fixtures/
fixtures/configs
fixtures/configs/1.json
fixtures/configs/2.json
fixtures/data
fixtures/data/CarModel
fixtures/data/CarModel/cars.csv
fixtures/data/CarModel/sales.csv
The test configuration file specifies how the pipeline state should be loaded before executing the transform. It lists one or more frames to be loaded. These frames could be one of: (a) pandas - tables to be loaded as pandas dataframe (b) dict - JSON files
Params element provides additional information to enable loading of the files. In case of pandas, it specifies pandas read_csv params. The filename refers :
$ cat fixtures/configs/1.json
{
"conf": {
"enable": true
},
"data": {
"cars": {
"frametype": "pandas",
"transform": "CarModel",
"filename": "%(inputdir)s/%(transform)s/cars.csv",
"params": {
"sep": ","
}
}
}
}
@pytest.mark.parametrize('testdata',
[
'fixtures/configs/1.json',
'fixtures/configs/2.json'
],
indirect=True)
def test_transform(testdata):
"""
Testdata fixtures
"""
cls = testtransform.provider
pipeline = MockPipeline()
# Where is the dataframe input and output?
root = tempfile.mkdtemp(prefix="enrich")
testdata.update({
# This is the input data to load
'inputdir': os.path.join(os.path.dirname(__file__),
"fixtures", "data"),
# This is where the output is stored...
'outputdir': os.path.join(root,'output'),
'statedir': os.path.join(root,'state'),
})
# Now run the pipeline
pipeline.execute(cls, testdata, save_state=True)
# Check the output
walk(root)
# Cleanup
shutil.rmtree(root)
A simple run looks like this :
$ cd .../transforms
pytest -vv transforms/testtransform/
============================================================= test session starts =============================================================
platform linux -- Python 3.5.2, pytest-3.2.1, py-1.5.3, pluggy-0.4.0 -- ...virtualenvs/dev/bin/python3
cachedir: transforms/testtransform/.cache
spark version -- Spark 2.2.0 built for Hadoop 2.7.3 | Build flags: -Phadoop-2.7 -Psparkr -Phive -Phive-thriftserver -Pyarn -Pmesos -DzincPort=3036
rootdir: ...transforms/testtransform, inifile:
plugins: spark-0.4.5, cov-2.5.1
collected 3 items
transforms/testtransform/tests/test_module.py::test_configuration PASSED
transforms/testtransform/tests/test_module.py::test_transform[fixtures/configs/1.json] PASSED
transforms/testtransform/tests/test_module.py::test_transform[fixtures/configs/2.json] PASSED
You can see the output structure in a no-capture mode of the pytest.
$ pytest -s -vv testtransform/
============================================================= test session starts =============================================================
platform linux -- Python 3.5.2, pytest-3.2.1, py-1.5.3, pluggy-0.4.0 -- /home/pingali/.virtualenvs/dev/bin/python3
cachedir: transforms/testtransform/.cache
spark version -- Spark 2.2.0 built for Hadoop 2.7.3 | Build flags: -Phadoop-2.7 -Psparkr -Phive -Phive-thriftserver -Pyarn -Pmesos -DzincPort=3036
rootdir: /work/pingali/Code/enrich-scribble/Contrib/transforms/testtransform, inifile:
plugins: spark-0.4.5, cov-2.5.1
collected 3 items
transforms/testtransform/tests/test_module.py::test_configuration PASSED
transforms/testtransform/tests/test_module.py::test_transform[fixtures/configs/1.json]
Output:
------ enrichpc_ths12
--------- output
------------ run-2015021
--------------- outputs
------------------ cars.csv
------------------ cars.sqlite
--------- state
------------ cars.csv
PASSED
transforms/testtransform/tests/test_module.py::test_transform[fixtures/configs/2.json]
Output:
------ enrichzvx5ukre
--------- output
------------ run-2015021
--------------- outputs
------------------ cars.csv
------------------ cars.sqlite
--------- state
------------ cars.csv
PASSED
Test Data Setup→
The configuration of datasets can be done standalone in a [datasets.py]{.title-ref} in the usecase directory or in the transform as part of the testdata section.
The Dataset Management
{.interpreted-text role="ref"} module has been
created to add capabilities over time.
Example standane specification:
$ cat datasets.py
...
from acmeapp.datasets import get_dataset_config
datasets = get_dataset_config(source='server-name', sourcetype='s3')
Example in testdata section of a transform:
{
"command": "aws s3 sync s3://%(backuppath)s %(targetpath)s",
'params': {
'enrich_data_dir': '/home/ubuntu/enrich/data',
'backup_root': 'some-s3-path',
'node': 'some hostname'
},
'available': [
# Complete list...
Dataset({
'name': "inventory_dataset",
...
}),
...
],
'used': [
# Used by this transform
'inventory_dataset'
]
}
Once the datasets is configured, you can take a number of actions:
$ enrichpkg test-data ./datasets.py --help
Usage: enrichpkg test-data [OPTIONS] SPEC COMMAND [ARGS]...
Manage test data.
spec could be transform or a spec file
Options:
--capture / --no-capture Capture output
--help Show this message and exit.
Commands:
download Download dataset for a specified range
list List available test datasets
show Show the details of a given dataset
-
List the available datasets to download:
$ enrichpkg test-data ./datasets.py list [0] SensorEvents [1] Orders [2] Customer [3] Persona [5] Adjustments [6] Taxonomy
-
Look at the details of each dataset:
$ enrichpkg test-data ./datasets.py show orderhistory ✓ Found dataset - orderhistory ------ Description: Per-day history of orders along with events ------ [test] ...enrich/data/_test/shared/datasets/orderhistory/v3 [remote local] /home/ubuntu/enrich/data/acme/Marketing/shared/datasets/orderhistory/v3 [host local] ...enrich/data/acme/PLP/shared/datasets/orderhistory/v3 [match] Pattern: %Y-%m-%d : 2020-04-06T00:00:00 => 2020-04-06 : 2020-04-07T00:00:00 => 2020-04-07 : 2020-04-08T00:00:00 => 2020-04-08 : 2020-04-09T00:00:00 => 2020-04-09 : 2020-04-10T00:00:00 => 2020-04-10 : 2020-04-11T00:00:00 => 2020-04-11 : 2020-04-12T00:00:00 => 2020-04-12 : 2020-04-13T00:00:00 => 2020-04-13
-
Download any dataset in a specified local directory. :
$ enrichpkg test-data ./datasets.py download alpha 2020-01-02 2020-01-10 --target local ✓ Found dataset - alpha ------ Description: Per-day history of alpha along with events ------ [test] ...enrich/data/_test/shared/datasets/orderhistory/v3 [remote local] /home/ubuntu/enrich/data/acme/Marketing/shared/datasets/orderhistory/v3 [host local] ...enrich/data/acme/Marketing/shared/datasets/orderhistory/v3 [backup] enrich-acme/backup/aip.acme.com/data/acme/Marketing/shared/datasets/orderhistory/v3 [match] Pattern: %Y-%m-%d [commands] Please run these after checking aws s3 sync s3://enrich-acme/backup/aip.acme.com/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-02 ...enrich/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-02 aws s3 sync s3://enrich-acme/backup/aip.acme.com/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-03 ...enrich/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-03 aws s3 sync s3://enrich-acme/backup/aip.acme.com/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-04 ...enrich/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-04 aws s3 sync s3://enrich-acme/backup/aip.acme.com/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-05 ...enrich/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-05 aws s3 sync s3://enrich-acme/backup/aip.acme.com/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-06 ...enrich/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-06 aws s3 sync s3://enrich-acme/backup/aip.acme.com/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-07 ...enrich/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-07 aws s3 sync s3://enrich-acme/backup/aip.acme.com/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-08 ...enrich/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-08 aws s3 sync s3://enrich-acme/backup/aip.acme.com/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-09 ...enrich/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-09 aws s3 sync s3://enrich-acme/backup/aip.acme.com/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-10 ...enrich/data/acme/Marketing/shared/datasets/orderhistory/v3/2020-01-10