Skip to content

Security

Scribble is committed to security of the data. It is achieved through a combination of industry best practices followed by Scribble and Enrich Platform, and also governed by the security policies of the client. Scribble operates transparently and in collaboration with the clients to achieve the security and privacy goals.

Deployment

A few observations about the service:

  1. Private deployment. The Enrich data preparation platform operates behind the firewall of customers on machines that are not accessible form public internet.
  2. Restricted access. It is not a general purpose system that is accessible to end-customers or all employees of the client organization. The client determines who has access to data, how, and why. Typically the access is restricted to ML engineers.
  3. The service has a number of security enhancing features: (a) All accesses use SSL/https (b) All accesses require login except APi accesses via API Key. (c) All credentials are obfuscated or encrypted as needed on the server (d) Most user activity is logged (e) Downloads of data are encrypted by default, and can be disabled (f) Upgrades are allowed only by users with 'staff' permissions. Typically only one person at the client has staff permission. (g) Security enhacing data prep modules such as libraries and transforms. (h) Metadata and admin interface have elements to handle sensitive data
  4. Enrich has a compliance 'app' to help with compliance tasks outside Enrich itself.
  5. Scribble is often tasked to generate the datasets as a service. Server is also upgraded regularly with new features due the fast moving space. As a result, Scribble usually has access to server with Enrich. As a security conscious company, we take a number of precautions to ensure security and compliance aspects:

    1. Only one staff member has access to the SSH keys - usually the CEO. Keys and other configuration files are stored automatically encrypted using PGP and stored in Github. They are unlocked only when machine access is required.
    2. Laptops of staff use encrypted filesystems
    3. Any access follows a protocol developed in conjunction with the client including the use of jump servers and over VPN.
    4. Data is often required for local testing of the data prep pipelines. A cap is placed on data transferred (say 0.01% of volume) and/or coordinated with the client. In most cases, we request anonymized data for the data prep. Most data prep doesnt require the PII information.
    5. The client, if required, can perform a full code review of the Enrich server at any point in time.
    6. All client-specific data preparation modules are completely separated from the Enrich codebase and stored in client's github repository only. None of the Enrich modules are client-specific.

Depending on the client, the details of the compliance vary. A very detailed compliance checklist can be found in the resources page of the server.

Sudo Access

Service Explanation Specifics
OS Packages OS/Library dependencies that need to be updated from time to time Run apt-get, apt-update, apt-search, dpkg
Nginx Reverse proxy to the dashboard, metadata and other services Start/stop/reload nginx Write access to /etc/nginx
Supervisor Various services (~ 10) including LLM microservices, dashboard Start/stop/reload supervisord Write permissions to /var/run/supervisor.sock Write permissions to /etc/supervisor
Redis Store state, background tasks start/stop redis
Cron Schedule workflows using crontab Write to /var/spool/cron/crontabs/<username>
$ sudo -l

(ALL) NOPASSWD: /usr/bin/systemctl start nginx
(ALL) NOPASSWD: /usr/bin/systemctl stop nginx
(ALL) NOPASSWD: /usr/bin/systemctl restart nginx
(ALL) NOPASSWD: /usr/bin/systemctl start supervisor
(ALL) NOPASSWD: /usr/bin/systemctl stop supervisor
(ALL) NOPASSWD: /usr/bin/systemctl restart supervisor
(ALL) NOPASSWD: /usr/bin/systemctl start redis
(ALL) NOPASSWD: /usr/bin/systemctl stop redis
(ALL) NOPASSWD: /usr/bin/systemctl restart redis
(ALL) NOPASSWD: /usr/sbin/service supervisor *
(ALL) NOPASSWD: /usr/sbin/service nginx *
(ALL) NOPASSWD: /usr/sbin/service redis *
(ALL) NOPASSWD: /usr/bin/apt-get *
(ALL) NOPASSWD: /usr/bin/dpkg *
(ALL) NOPASSWD: /usr/bin/vi /var/spool/cron/crontabs/<username>
(ALL) NOPASSWD: /usr/bin/crontab -u <username> -e
(ALL) NOPASSWD: /usr/bin/vi /etc/nginx/*
(ALL) NOPASSWD: /usr/bin/vi /etc/supervisor/*
(ALL) NOPASSWD: /usr/bin/cat /var/log/syslog*

# supervisorctl need read/write access to work
$ sudo chmod o+rw /var/run/supervisor.sock
$ ls -l /var/run/supervisor.sock
srwxrw-rw- 1 root root 0 Aug  7 17:14 /var/run/supervisor.sock

# Readable python global libraries
$ sudo chmod -R o+r /usr/lib/python3.9

Credentials

The nature of credentials required

Name Explanation Details
Storage Ingest data/Backup Read/write credentials to Sharepoint/Azureblob/S3 based on usecase
DB Ingest data/Write output Read/write credentials to based on usecase
API Ingest data/Write output Read/write credentials to based on usecase
SMTP Notifications server, username/password, port, SSL

Network Access

Service Explanation Specifics
VPN Access to privately deployed server OpenVPN/other
GUI Scribble’s GUI running on the port Port 443, Inbound
SSH Maintenance/upgrades Port 22, Inbound, VPN only
Pypi Python package service pypi,org , could be through proxy
Github Some custom python packages not in pypi Self-serve upgrade github.com , could be through proxy
Ubuntu OS package distribution service *.ubuntu.com

Log Distribution

Nature Location Notes
System Log enrich/logs/app/app.json.log JSON format, rotated regularly, most detailed
Pipeline & Tasks enrich/data/<owner>/<organization>/outputs/<name>/<run-id>/log.json Pipeline and task-specific log for each run in JSON format
Taskblade enrich/llm-agents/logs/<name>/app.json.log Microservice request-level log in JSON format
Workflows enrich/logs/workflows/*.log One for each workflow (in cron)
Doodle enrich/logs/doodle/doodle.log Metadata server log
Dashboard enrich/logs/gunicorn/gunicorn_supervisor.log Minimal log. Useful mainly for exceptions
Thirdparty enrich/logs/netdata, enrich/logs/jupyterlab Optional thirdparty services that are deployed

System

This is the most important log capturing the health of the system

Column Details
logid Unique ID for this log entry
message Main text
levelname Log level (DEBUG/ERROR etc)
name Name of the log ('app' is default)
asctime Timestamp
funcName Method from where the log has been generated
lineno Line of call
path Filename of the method
module Python module (same as path)
created Timestamp in seconds since epoch (1970-01-01)
data Extract application context to help with debugging
exc_info Stacktrace if there is an exception
{"message": "Task result", "levelname": "DEBUG", "name": "app", "asctime": "2023-07-30 00:13:45,400", "funcName": "txnsearch_result", "lineno": 407, "pathname": "/home/scribble/enrich/customers/acme/Compliance/dashboard/compapp/views.py", "module": "views", "created": 1690656225.4009845, "data": "\ne3be563f-57fa-4240-924b-27b43f2e0e0b -> PENDING [{}]", "transform": "", "logid": "1690654664.21"}
{"message": "Task result", "levelname": "DEBUG", "name": "app", "asctime": "2023-07-30 00:13:48,102", "funcName": "txnsearch_result", "lineno": 407, "pathname": "/home/scribble/enrich/customers/acme/Compliance/dashboard/compapp/views.py", "module": "views", "created": 1690656228.1029518, "data": "\nf5c8794f-c67e-4cf9-8501-13199b9ffc43 -> SUCCESS [{'start_date': '2023-06-01', 'end_date': '2023-07-01', 'referrer': 'https://aip.acmeinc.com/dashboard/usecases/Compliance/applications/persona/details?persona=Customer+KYCs&table=Search&query=ALPHA1234', 'source': 'https://aip.acmeinc.com/dashboard/usecases/Compliance/applications/persona/details?persona=Customer+KYCs&table=Search&query=ALPHA1234', 'kyc_txn_ids': ['ALPHA1234'], 'name': 'txnsearch-Customer KYCs-Search-ALPHA1234-2023-07-30'}]", "transform": "", "logid": "1690654664.22"}
{"message": "Task result", "levelname": "DEBUG", "name": "app", "asctime": "2023-07-30 00:14:01,821", "funcName": "txnsearch_result", "lineno": 407, "pathname": "/home/scribble/enrich/customers/acme/Compliance/dashboard/compapp/views.py", "module": "views", "created": 1690656241.8213236, "data": "\ne3be563f-57fa-4240-924b-27b43f2e0e0b -> SUCCESS [{'start_date': '2023-06-01', 'end_date': '2023-07-01', 'referrer': 'https://aip.acmeinc.com/dashboard/usecases/Compliance/applications/persona/details?persona=Customer+KYCs&table=Search&query=BETA1234', 'source': 'https://aip.acmeinc.com/dashboard/usecases/Compliance/applications/persona/details?persona=Customer+KYCs&table=Search&query=BETA1234', 'kyc_txn_ids': ['BETA1234'], 'name': 'txnsearch-Customer KYCs-Search-BETA1234-2023-07-30'}]", "transform": "", "logid": "1690654664.27"}
{"message": "Task result", "levelname": "DEBUG", "name": "app", "asctime": "2023-07-30 00:14:03,359", "funcName": "txnsearch_result", "lineno": 407, "pathname": "/home/scribble/enrich/customers/acme/Compliance/dashboard/compapp/views.py", "module": "views", "created": 1690656243.3595865, "data": "\n7e284ef1-6080-441b-87cd-bdfa97c6ac9c -> PENDING [{}]", "transform": "", "logid": "1690654664.28"}
{"message": "Task result", "levelname": "DEBUG", "name": "app", "asctime": "2023-07-30 00:14:06,662", "funcName": "txnsearch_result", "lineno": 407, "pathname": "/home/scribble/enrich/customers/acme/Compliance/dashboard/compapp/views.py", "module": "views", "created": 1690656246.6628153, "data": "\nbd23220e-350b-49f4-97f2-5334d0f1a1d5 -> SUCCESS [{'start_date': '2023-06-01', 'end_date': '2023-07-01', 'referrer': 'https://aip.acmeinc.com/dashboard/usecases/Compliance/applications/persona/details?persona=Customer+KYCs&table=Search&query=GAMMA1234', 'source': 'https://aip.acmeinc.com/dashboard/usecases/Compliance/applications/persona/details?persona=Customer+KYCs&table=Search&query=GAMMA1234', 'kyc_txn_ids': ['GAMMA1234'], 'name': 'txnsearch-Customer KYCs-Search-GAMMA1234-2023-07-30'}]", "transform": "", "logid": "1690654664.29"}

Pipelines & Tasks

The pipelines also use the 'app' logger configuration. In addition to the columns mentioned above, we have columns that help understand the performance and correctness of the computation.

Column Details
customer/usecase Usecase Group to which this pipeline belongs
transform Transformation module that is the source of this log entry
conf Pipeline filename
runid Run to which this log belongs
application Name of this pipeline/task
{"asctime": "2023-03-09T12:04:48+0530", "name": "app", "levelname": "DEBUG", "message": "FileOperations - process", "transform": "FileOperations", "runid": "daily-20230309-120433", "conf": "test.py", "usecase": "Contrib", "customer": "Contrib", "application": "TestPy", "ts": "2023-03-09T12:04:48+05:30", "data": "", "logid": 125}
{"asctime": "2023-03-09T12:04:48+0530", "name": "app", "levelname": "DEBUG", "message": "FileOperations - Completed", "transform": "FileOperations", "data": "\nCopy: scribble/Contrib/output/TestPy/daily-20230309-120433/cars1.csv => scribble/Contrib/shared/cars/2022-11-02/cars.csv\nCopy: scribble/Contrib/output/TestPy/daily-20230309-120433/hello.json => scribble/Contrib/shared/cars/2022-11-02/hello.json\nCopy: scribble/Contrib/output/TestPy/daily-20230309-120433/searchmeta.json => scribble/Contrib/shared/cars/2022-11-02/searchmeta.json\nCopy: scribble/Contrib/output/TestPy/daily-20230309-120433/viz/cars1.pickle => scribble/Contrib/shared/cars/2022-11-02/cars.pickle\nCopy: scribble/Contrib/output/TestPy/daily-20230309-120433/viz/cars1.pickle => scribble/Contrib/shared/cars/cars.pickle\n", "runid": "daily-20230309-120433", "conf": "test.py", "usecase": "Contrib", "customer": "Contrib", "application": "TestPy", "ts": "2023-03-09T12:04:48+05:30", "logid": 126}
{"asctime": "2023-03-09T12:04:48+0530", "name": "app", "levelname": "DEBUG", "message": "Process completed", "transform": "FileOperations", "runid": "daily-20230309-120433", "conf": "test.py", "usecase": "Contrib", "customer": "Contrib", "application": "TestPy", "ts": "2023-03-09T12:04:48+05:30", "data": "", "logid": 127}
{"asctime": "2023-03-09T12:04:48+0530", "name": "app", "levelname": "DEBUG", "message": "Validated results", "transform": "FileOperations", "runid": "daily-20230309-120433", "conf": "test.py", "usecase": "Contrib", "customer": "Contrib", "application": "TestPy", "ts": "2023-03-09T12:04:48+05:30", "data": "", "logid": 128}

TaskBlade Audit

The taskblades also use the 'app' logger configuration. In addition to the columns mentioned above, we have additional columns to allow us to trace the request through entire system.

Column Details
Source Source of log entry. Could be a pipeline or a (micro)service
request_id UUID for each customer request
dataset Dataset being queried
username User who triggered the query
{"message": "[datagpt] Returning existing instance", "levelname": "DEBUG", "name": "app", "asctime": "2023-07-28 16:54:53,016", "funcName": "get_agent_details", "lineno": 351, "pathname": "/home/pingali/Code/scribble-llmsdk/llmsdk/services/lib.py", "module": "lib", "created": 1690543493.016718, "source": "service", "user": "venkata", "dataset": "acme-retail"}
{"message": "Query Status: pending", "levelname": "DEBUG", "name": "app", "asctime": "2023-07-28 16:54:53,351", "funcName": "qna_status", "lineno": 301, "pathname": "/home/pingali/Code/scribble-llmsdk/llmsdk/services/datagpt.py", "module": "datagpt", "created": 1690543493.3518183, "source": "service", "request_id": "7386fcfe-273e-4a1d-80e8-8b1848114362", "dataset": "acme-retail", "user": "venkata", "data": "{\n    \"query\": \"how many rows are there?\",\n    \"status\": \"pending\",\n    \"user\": \"venkata\",\n    \"dataset\": \"acme-retail\",\n    \"params\": {\n        \"user\": \"venkata\",\n        \"dataset\": \"acme-retail\",\n        \"context\": \"\",\n        \"namespace\": \"datagpt\",\n        \"query\": \"how many rows are there?\",\n        \"policy\": {\n            \"schema\": \"v1\",\n            \"policies\": [],\n            \"runtime\": {\n                \"clear_agent_memory\": false\n            }\n        },\n        \"mode\": \"economy\"\n    }\n}"}
{"message": "Updated Result: success", "levelname": "DEBUG", "name": "app", "asctime": "2023-07-28 16:54:58,399", "funcName": "query_update_result", "lineno": 724, "pathname": "/home/pingali/Code/scribble-llmsdk/llmsdk/services/lib.py", "module": "lib", "created": 1690543498.3995056, "source": "service", "request_id": "7386fcfe-273e-4a1d-80e8-8b1848114362", "user": "venkata", "dataset": "acme-retail", "data": "{\n    \"status\": \"success\",\n    \"result\": {\n        \"intermediate_steps\": [\n            [\n                \"AgentAction(tool='sql_db_list_tables', tool_input='', log='Action: sql_db_list_tables\\\\nAction Input: ')\",\n                \"sales\"\n            ],\n            [\n                \"AgentAction(tool='sql_db_schema', tool_input='sales', log='The only table in the database is \\\"sales\\\". I should query the schema of the \\\"sales\\\" table to see the structure of the data.\\\\nAction: sql_db_schema\\\\nAction Input: sales')\",\n                \"\\nCREATE TABLE sales (\\n\\t\\\"index\\\" INTEGER, \\n\\tinvoice_no TEXT, \\n\\tstock_code TEXT, \\n\\tdescription TEXT, \\n\\tquantity INTEGER, \\n\\tunit_price REAL, \\n\\tcustomer_id INTEGER, \\n\\tcountry TEXT, \\n\\tsales REAL, \\n\\tinvoice_day INTEGER, \\n\\tinvoice_month INTEGER, \\n\\tinvoice_year INTEGER\\n)\\n\\n/*\\n3 rows from sales table:\\nindex\\tinvoice_no\\tstock_code\\tdescription\\tquantity\\tunit_price\\tcustomer_id\\tcountry\\tsales\\tinvoice_day\\tinvoice_month\\tinvoice_year\\n0\\t536365\\t85123A\\tWHITE HANGING HEART T-LIGHT HOLDER\\t6\\t2.55\\t17850\\tUnited Kingdom\\t15.299999999999999\\t1\\t12\\t2010\\n1\\t536365\\t71053\\tWHITE METAL LANTERN\\t6\\t3.39\\t17850\\tUnited Kingdom\\t20.34\\t1\\t12\\t2010\\n2\\t536365\\t84406B\\tCREAM CUPID HEARTS COAT HANGER\\t8\\t2.75\\t17850\\tUnited Kingdom\\t22.0\\t1\\t12\\t2010\\n*/\"\n            ]\n        ],\n        \"cascade\": {\n            \"id\": \"economy\",\n            \"platform\": \"openai\",\n            \"model\": \"gpt-3.5-turbo\"\n        },\n        \"success\": true,\n        \"tries\": [\n            {\n                \"seq\": 0,\n                \"cascade_id\": \"economy\",\n                \"success\": true\n            }\n        ],\n        \"query\": \"how many rows are there?\",\n        \"answer\": 3,\n        \"type\": \"json\",\n        \"raw_thoughts\": [\n            \"\",\n            \"\",\n            \"> Entering new  chain...\",\n            \"Action: sql_db_list_tables\",\n            \"Action Input: \",\n            \"Observation: sales\",\n            \"Thought:The only table in the database is \\\"sales\\\". I should query the schema of the \\\"sales\\\" table to see the structure of the data.\",\n            \"Action: sql_db_schema\",\n            \"Action Input: sales\",\n            \"Observation: \",\n            \"CREATE TABLE sales (\",\n            \"\\t\\\"index\\\" INTEGER, \",\n            \"\\tinvoice_no TEXT, \",\n            \"\\tstock_code TEXT, \",\n            \"\\tdescription TEXT, \",\n            \"\\tquantity INTEGER, \",\n            \"\\tunit_price REAL, \",\n            \"\\tcustomer_id INTEGER, \",\n            \"\\tcountry TEXT, \",\n            \"\\tsales REAL, \",\n            \"\\tinvoice_day INTEGER, \",\n            \"\\tinvoice_month INTEGER, \",\n            \"\\tinvoice_year INTEGER\",\n            \")\",\n            \"\",\n            \"/*\",\n            \"3 rows from sales table:\",\n            \"index\\tinvoice_no\\tstock_code\\tdescription\\tquantity\\tunit_price\\tcustomer_id\\tcountry\\tsales\\tinvoice_day\\tinvoice_month\\tinvoice_year\",\n            \"0\\t536365\\t85123A\\tWHITE HANGING HEART T-LIGHT HOLDER\\t6\\t2.55\\t17850\\tUnited Kingdom\\t15.299999999999999\\t1\\t12\\t2010\",\n            \"1\\t536365\\t71053\\tWHITE METAL LANTERN\\t6\\t3.39\\t17850\\tUnited Kingdom\\t20.34\\t1\\t12\\t2010\",\n            \"2\\t536365\\t84406B\\tCREAM CUPID HEARTS COAT HANGER\\t8\\t2.75\\t17850\\tUnited Kingdom\\t22.0\\t1\\t12\\t2010\",\n            \"*/\",\n            \"Thought:There are 3 rows in the \\\"sales\\\" table. \",\n            \"Final Answer: 3\",\n            \"\",\n            \"> Finished chain.\",\n            \"\"\n        ],\n        \"chain_of_thought\": [\n            {\n                \"thought\": \"BEGIN\",\n                \"tool\": \"sql_db_list_tables\",\n                \"tool_input\": \"\",\n                \"observation\": \"sales\"\n            },\n            {\n                \"thought\": \"The only table in the database is \\\"sales\\\". I should query the schema of the \\\"sales\\\" table to see the structure of the data.\",\n                \"tool\": \"sql_db_schema\",\n                \"tool_input\": \"sales\",\n                \"observation\": \"\\nCREATE TABLE sales (\\n\\t\\\"index\\\" INTEGER, \\n\\tinvoice_no TEXT, \\n\\tstock_code TEXT, \\n\\tdescription TEXT, \\n\\tquantity INTEGER, \\n\\tunit_price REAL, \\n\\tcustomer_id INTEGER, \\n\\tcountry TEXT, \\n\\tsales REAL, \\n\\tinvoice_day INTEGER, \\n\\tinvoice_month INTEGER, \\n\\tinvoice_year INTEGER\\n)\\n\\n/*\\n3 rows from sales table:\\nindex\\tinvoice_no\\tstock_code\\tdescription\\tquantity\\tunit_price\\tcustomer_id\\tcountry\\tsales\\tinvoice_day\\tinvoice_month\\tinvoice_year\\n0\\t536365\\t85123A\\tWHITE HANGING HEART T-LIGHT HOLDER\\t6\\t2.55\\t17850\\tUnited Kingdom\\t15.299999999999999\\t1\\t12\\t2010\\n1\\t536365\\t71053\\tWHITE METAL LANTERN\\t6\\t3.39\\t17850\\tUnited Kingdom\\t20.34\\t1\\t12\\t2010\\n2\\t536365\\t84406B\\tCREAM CUPID HEARTS COAT HANGER\\t8\\t2.75\\t17850\\tUnited Kingdom\\t22.0\\t1\\t12\\t2010\\n*/\"\n            }\n        ],\n        \"code\": {\n            \"dialect\": \"sql\",\n            \"snippets\": []\n        },\n        \"metadata\": {\n            \"name\": \"Acme Retail\",\n            \"description\": \"E-Commerce transaction data\",\n            \"url\": \"https://www.kaggle.com/datasets/carrie1/ecommerce-data\",\n            \"files\": [\n                {\n                    \"path\": \"sales.sqlite\",\n                    \"url\": \"sales.sqlite\",\n                    \"tables\": [\n                        {\n                            \"name\": \"sales\",\n                            \"desc\": \" Transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail\",\n                            \"context\": [\n                                \"each row in the table has sales for one product per invoice\",\n                                \"there are multiple rows for each invoice\",\n                                \"date column used month/day/year format\"\n                            ],\n                            \"cols\": [\n                                {\n                                    \"invoice_no\": \"Number of the transaction\"\n                                },\n                                {\n                                    \"stock_code\": \"SKU of the product\"\n                                },\n                                {\n                                    \"description\": \"Description of the product\"\n                                },\n                                {\n                                    \"quantity\": \"Number of units of product\"\n                                },\n                                {\n                                    \"invoice_day\": \"Day of the transaction\"\n                                },\n                                {\n                                    \"invoice_month\": \"Month of the transaction\"\n                                },\n                                {\n                                    \"invoice_year\": \"Year of the transaction\"\n                                },\n                                {\n                                    \"unit_price\": \"Price of one unit of product\"\n                                },\n                                {\n                                    \"customer_id\": \"Customer who bought the product\"\n                                }\n                            ]\n                        }\n                    ]\n                }\n            ]\n        }\n    }\n}"}
{"message": "Query Status: success", "levelname": "DEBUG", "name": "app", "asctime": "2023-07-28 16:55:08,690", "funcName": "qna_status", "lineno": 301, "pathname": "/home/pingali/Code/scribble-llmsdk/llmsdk/services/datagpt.py", "module": "datagpt", "created": 1690543508.6908808, "source": "service", "request_id": "7386fcfe-273e-4a1d-80e8-8b1848114362", "dataset": "acme-retail", "user": "venkata", "data": "{\n    \"query\": \"how many rows are there?\",\n    \"status\": \"success\",\n    \"user\": \"venkata\",\n    \"dataset\": \"acme-retail\",\n    \"params\": {\n        \"user\": \"venkata\",\n        \"dataset\": \"acme-retail\",\n        \"context\": \"\",\n        \"namespace\": \"datagpt\",\n        \"query\": \"how many rows are there?\",\n        \"policy\": {\n            \"schema\": \"v1\",\n            \"policies\": [],\n            \"runtime\": {\n                \"clear_agent_memory\": false\n            }\n        },\n        \"mode\": \"economy\"\n    },\n    \"result\": {\n        \"intermediate_steps\": [\n            [\n                \"AgentAction(tool='sql_db_list_tables', tool_input='', log='Action: sql_db_list_tables\\\\nAction Input: ')\",\n                \"sales\"\n            ],\n            [\n                \"AgentAction(tool='sql_db_schema', tool_input='sales', log='The only table in the database is \\\"sales\\\". I should query the schema of the \\\"sales\\\" table to see the structure of the data.\\\\nAction: sql_db_schema\\\\nAction Input: sales')\",\n                \"\\nCREATE TABLE sales (\\n\\t\\\"index\\\" INTEGER, \\n\\tinvoice_no TEXT, \\n\\tstock_code TEXT, \\n\\tdescription TEXT, \\n\\tquantity INTEGER, \\n\\tunit_price REAL, \\n\\tcustomer_id INTEGER, \\n\\tcountry TEXT, \\n\\tsales REAL, \\n\\tinvoice_day INTEGER, \\n\\tinvoice_month INTEGER, \\n\\tinvoice_year INTEGER\\n)\\n\\n/*\\n3 rows from sales table:\\nindex\\tinvoice_no\\tstock_code\\tdescription\\tquantity\\tunit_price\\tcustomer_id\\tcountry\\tsales\\tinvoice_day\\tinvoice_month\\tinvoice_year\\n0\\t536365\\t85123A\\tWHITE HANGING HEART T-LIGHT HOLDER\\t6\\t2.55\\t17850\\tUnited Kingdom\\t15.299999999999999\\t1\\t12\\t2010\\n1\\t536365\\t71053\\tWHITE METAL LANTERN\\t6\\t3.39\\t17850\\tUnited Kingdom\\t20.34\\t1\\t12\\t2010\\n2\\t536365\\t84406B\\tCREAM CUPID HEARTS COAT HANGER\\t8\\t2.75\\t17850\\tUnited Kingdom\\t22.0\\t1\\t12\\t2010\\n*/\"\n            ]\n        ],\n        \"cascade\": {\n            \"id\": \"economy\",\n            \"platform\": \"openai\",\n            \"model\": \"gpt-3.5-turbo\"\n        },\n        \"success\": true,\n        \"tries\": [\n            {\n                \"seq\": 0,\n                \"cascade_id\": \"economy\",\n                \"success\": true\n            }\n        ],\n        \"query\": \"how many rows are there?\",\n        \"answer\": 3,\n        \"type\": \"json\",\n        \"raw_thoughts\": [\n            \"\",\n            \"\",\n            \"> Entering new  chain...\",\n            \"Action: sql_db_list_tables\",\n            \"Action Input: \",\n            \"Observation: sales\",\n            \"Thought:The only table in the database is \\\"sales\\\". I should query the schema of the \\\"sales\\\" table to see the structure of the data.\",\n            \"Action: sql_db_schema\",\n            \"Action Input: sales\",\n            \"Observation: \",\n            \"CREATE TABLE sales (\",\n            \"\\t\\\"index\\\" INTEGER, \",\n            \"\\tinvoice_no TEXT, \",\n            \"\\tstock_code TEXT, \",\n            \"\\tdescription TEXT, \",\n            \"\\tquantity INTEGER, \",\n            \"\\tunit_price REAL, \",\n            \"\\tcustomer_id INTEGER, \",\n            \"\\tcountry TEXT, \",\n            \"\\tsales REAL, \",\n            \"\\tinvoice_day INTEGER, \",\n            \"\\tinvoice_month INTEGER, \",\n            \"\\tinvoice_year INTEGER\",\n            \")\",\n            \"\",\n            \"/*\",\n            \"3 rows from sales table:\",\n            \"index\\tinvoice_no\\tstock_code\\tdescription\\tquantity\\tunit_price\\tcustomer_id\\tcountry\\tsales\\tinvoice_day\\tinvoice_month\\tinvoice_year\",\n            \"0\\t536365\\t85123A\\tWHITE HANGING HEART T-LIGHT HOLDER\\t6\\t2.55\\t17850\\tUnited Kingdom\\t15.299999999999999\\t1\\t12\\t2010\",\n            \"1\\t536365\\t71053\\tWHITE METAL LANTERN\\t6\\t3.39\\t17850\\tUnited Kingdom\\t20.34\\t1\\t12\\t2010\",\n            \"2\\t536365\\t84406B\\tCREAM CUPID HEARTS COAT HANGER\\t8\\t2.75\\t17850\\tUnited Kingdom\\t22.0\\t1\\t12\\t2010\",\n            \"*/\",\n            \"Thought:There are 3 rows in the \\\"sales\\\" table. \",\n            \"Final Answer: 3\",\n            \"\",\n            \"> Finished chain.\",\n            \"\"\n        ],\n        \"chain_of_thought\": [\n            {\n                \"thought\": \"BEGIN\",\n                \"tool\": \"sql_db_list_tables\",\n                \"tool_input\": \"\",\n                \"observation\": \"sales\"\n            },\n            {\n                \"thought\": \"The only table in the database is \\\"sales\\\". I should query the schema of the \\\"sales\\\" table to see the structure of the data.\",\n                \"tool\": \"sql_db_schema\",\n                \"tool_input\": \"sales\",\n                \"observation\": \"\\nCREATE TABLE sales (\\n\\t\\\"index\\\" INTEGER, \\n\\tinvoice_no TEXT, \\n\\tstock_code TEXT, \\n\\tdescription TEXT, \\n\\tquantity INTEGER, \\n\\tunit_price REAL, \\n\\tcustomer_id INTEGER, \\n\\tcountry TEXT, \\n\\tsales REAL, \\n\\tinvoice_day INTEGER, \\n\\tinvoice_month INTEGER, \\n\\tinvoice_year INTEGER\\n)\\n\\n/*\\n3 rows from sales table:\\nindex\\tinvoice_no\\tstock_code\\tdescription\\tquantity\\tunit_price\\tcustomer_id\\tcountry\\tsales\\tinvoice_day\\tinvoice_month\\tinvoice_year\\n0\\t536365\\t85123A\\tWHITE HANGING HEART T-LIGHT HOLDER\\t6\\t2.55\\t17850\\tUnited Kingdom\\t15.299999999999999\\t1\\t12\\t2010\\n1\\t536365\\t71053\\tWHITE METAL LANTERN\\t6\\t3.39\\t17850\\tUnited Kingdom\\t20.34\\t1\\t12\\t2010\\n2\\t536365\\t84406B\\tCREAM CUPID HEARTS COAT HANGER\\t8\\t2.75\\t17850\\tUnited Kingdom\\t22.0\\t1\\t12\\t2010\\n*/\"\n            }\n        ],\n        \"code\": {\n            \"dialect\": \"sql\",\n            \"snippets\": []\n        },\n        \"metadata\": {\n            \"name\": \"Acme Retail\",\n            \"description\": \"E-Commerce transaction data\",\n            \"url\": \"https://www.kaggle.com/datasets/carrie1/ecommerce-data\",\n            \"files\": [\n                {\n                    \"path\": \"sales.sqlite\",\n                    \"url\": \"sales.sqlite\",\n                    \"tables\": [\n                        {\n                            \"name\": \"sales\",\n                            \"desc\": \" Transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail\",\n                            \"context\": [\n                                \"each row in the table has sales for one product per invoice\",\n                                \"there are multiple rows for each invoice\",\n                                \"date column used month/day/year format\"\n                            ],\n                            \"cols\": [\n                                {\n                                    \"invoice_no\": \"Number of the transaction\"\n                                },\n                                {\n                                    \"stock_code\": \"SKU of the product\"\n                                },\n                                {\n                                    \"description\": \"Description of the product\"\n                                },\n                                {\n                                    \"quantity\": \"Number of units of product\"\n                                },\n                                {\n                                    \"invoice_day\": \"Day of the transaction\"\n                                },\n                                {\n                                    \"invoice_month\": \"Month of the transaction\"\n                                },\n                                {\n                                    \"invoice_year\": \"Year of the transaction\"\n                                },\n                                {\n                                    \"unit_price\": \"Price of one unit of product\"\n                                },\n                                {\n                                    \"customer_id\": \"Customer who bought the product\"\n                                }\n                            ]\n                        }\n                    ]\n                }\n            ]\n        }\n    }\n}"}

Workflow

The scheduler (cron) triggers a Prefect workflow. The prefect workflow triggers batch jobs using a DAG embedded in the workflow. The log name is typically <workflow_name><timestamp>.log

The typical format is

[Timestamp] Level - Method | Log-text

This is default format from Prefect (slightly older version).

[2022-12-08 08:12:53+0530] DEBUG - prefect.run_log_dumper | Error: b''
[2022-12-08 08:12:53+0530] DEBUG - prefect.run_log_dumper | Completed: Log resouce dumper
[2022-12-08 08:12:53+0530] DEBUG - prefect.TaskRunner | Task 'run_log_dumper': Handling state change from Running to Success
[2022-12-08 08:12:53+0530] INFO - prefect.TaskRunner | Task 'run_log_dumper': Finished task run for task with final state: 'Success'
[2022-12-08 08:12:53+0530] INFO - prefect.TaskRunner | Task 'run_routeopt': Starting task run...
[2022-12-08 08:12:53+0530] DEBUG - prefect.TaskRunner | Task 'run_routeopt': Handling state change from Pending to Running
[2022-12-08 08:12:53+0530] DEBUG - prefect.TaskRunner | Task 'run_routeopt': Calling task.run() method...
[2022-12-08 08:12:53+0530] DEBUG - prefect.run_routeopt | Starting: route optimizer
[2022-12-08 08:13:19+0530] DEBUG - prefect.run_routeopt | Output: b'Successfully read secure siteconf\nOverriding credentials from sitecred\nSuccessfully read secure siteconf\n....warnings /app/scribble/enrich/.virtualenvs/scribble3.8/lib/python3.8/site-packages/statsmodels/tsa/base/tsa_model.py:7: FutureWarning: pandas.Float64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.\n  from pandas import (to_datetime, Int64Index, DatetimeIndex, Period,\n\nEXIT CODE: 0\n'
[2022-12-08 08:13:19+0530] DEBUG - prefect.run_routeopt | Error: b''
[2022-12-08 08:13:19+0530] DEBUG - prefect.run_routeopt | Completed: route optimizer
[2022-12-08 08:13:19+0530] DEBUG - prefect.TaskRunner | Task 'run_routeopt': Handling state change from Running to Success

Background Tasks

We use celery to run background tasks. The log has the format:

[Timestamp: Level/ProcessName] TaskName [TaskID]:Module:FunctionName: Message

This is used for heavier tasks like running long running queries that may be triggered from the GUI. Most of that is usually done by the pipelines but sometimes we need this capability.

[2023-07-06 11:21:29,357: INFO/MainProcess] Task search_timeline[9080c216-f7ac-4672-8678-fd3d8f598bcf] received
[2023-07-06 11:21:29,357: DEBUG/MainProcess] TaskPool: Apply <function fast_trace_task at 0x7f55e1ada820> (args:('search_timeline', '9080c216-f7ac-4672-8678-fd3d8f598bcf', {'lang': 'py', 'task': 'search_timeline', 'id': '9080c216-f7ac-4672-8678-fd3d8f598bcf', 'shadow': None, 'eta': None, 'expires': None, 'group': None, 'group_index': None, 'retries': 0, 'timelimit': [None, None], 'root_id': '9080c216-f7ac-4672-8678-fd3d8f598bcf', 'parent_id': None, 'argsrepr': "({'txnids': ['TPWU000058115004'], 'nontxnids': [], 'start_date': '2023-06-30', 'end_date': '2023-07-03', 'limit': 10000, 'goal': 'Select', 'importance': 'Select'},)", 'kwargsrepr': '{}', 'origin': 'gen18892@aip.acmeinc.com', 'ignore_result': False, 'sentry-trace': '81feea2325324df384a04b81339c208c-bf79307d4d2d2c39-0', 'baggage': 'sentry-trace_id=81feea2325324df384a04b81339c208c,sentry-environment=production,sentry-public_key=e085c5e165b6441c92f29beacd0be47e,sentry-transaction=/dashboard/usecases/Operations/applications/timeline/%5B/%5D,sentry-sample_rate=0.0', 'headers': {'sentry-trace': '81feea2325324df384a04b81339c208c-bf79307d4d2d2c39-0', 'baggage':... kwargs:{})
[2023-07-06 11:29:32,723: INFO/ForkPoolWorker-4] Task search_timeline[9080c216-f7ac-4672-8678-fd3d8f598bcf] succeeded in 483.36513194441795s: {'log': ['[OK] enrich-acme/preprocessed/aip.acmeinc.com/logs/eig_in_classifier/v1/2023/06/30/transactions_TP_4.csv.gz', '[OK] enrich-acme/preprocessed/aip.acmeinc.com/logs/eig_out_classifier/v1/2023/06/30/transactions_TP_4.csv.gz', '[OK] enrich-acme/preprocessed/aip.acmeinc.com/logs/terra_core_classifier/v1/2023/06/30/transactions_TP_4.csv.gz', '[OK] enrich-acme/preprocessed/aip.acmeinc.com/logs/eig_in_classifier/v1/2023/07/01/transactions_TP_4.csv.gz', '[OK] enrich-acme/preprocessed/aip.acmeinc.com/logs/eig_out_classifier/v1/2023/07/01/transactions_TP_4.csv.gz', '[OK] enrich-acme/preprocessed/aip.acmeinc.com/logs/terra_core_classifier/v1/2023/07/01/transactions_TP_4.csv.gz', '[OK] enrich-acme/preprocessed/aip.acmeinc.com/logs/eig_in_classifier/v1/2023/07/02/transactions_TP_4.csv.gz', '[OK] enrich-acme/preprocessed/aip.acmeinc.com/logs/eig_out_classifier/v1/2023/07/02/transactions_TP_4.csv.gz', '[OK]...', '[OK]...', '[OK]...', '[OK]...'], 'files': 12, 'processed': 21917429, 'matched': , ...}
[2023-07-06 12:07:36,774: INFO/MainProcess] Task search_transactions[320bbda6-e758-49fc-8a26-e8dc29f88f75] received
[2023-07-06 12:07:36,774: DEBUG/MainProcess] TaskPool: Apply <function fast_trace_task at 0x7f55e1ada820> (args:('search_transactions', '320bbda6-e758-49fc-8a26-e8dc29f88f75', {'lang': 'py', 'task': 'search_transactions', 'id': '320bbda6-e758-49fc-8a26-e8dc29f88f75', 'shadow': None, 'eta': None, 'expires': None, 'group': None, 'group_index': None, 'retries': 0, 'timelimit': [None, None], 'root_id': '320bbda6-e758-49fc-8a26-e8dc29f88f75', 'parent_id': None, 'argsrepr': "({'start_date': '2023-04-06', 'end_date': '2023-07-06', 'referrer': 'https://aip.acmeinc.com/dashboard/usecases/Compliance/applications/persona/details?persona=Customer+ACCs&table=Search&query=GAMMA1234', 'source': 'https://aip.acmeinc.com/dashboard/usecases/Compliance/applications/persona/details?persona=Customer+ACCs&table=Search&query=GAMMA1234', 'acc_txn_ids': ['GAMMA1234'], 'name': 'txnsearch-Customer ACCs-Search-GAMMA1234-2023-07-06'},)", 'kwargsrepr': '{}', 'origin': 'gen18892@aip.acmeinc.com', 'ignore_result': False, 'sentry-trace': 'ad1f51cac0c84c9ab2a00f5e0bced5b6-82c797b46c157901-0', 'baggage':... kwargs:{})
[2023-07-06 12:07:36,776: INFO/ForkPoolWorker-1] search_transactions[320bbda6-e758-49fc-8a26-e8dc29f88f75]: [0] Chunk running SQL: SELECT *
            FROM transactions
            where ((DATE(modified_on) >= '2023-04-06') AND
                        (DATE(modified_on) <= '2023-07-06') AND
                        ((sender_id_no IN ('GAMMA1234')) OR
                        (receiver_id_no IN ('GAMMA1234'))))

[2023-07-06 12:07:36,826: INFO/ForkPoolWorker-1] search_transactions[320bbda6-e758-49fc-8a26-e8dc29f88f75]: [0] CHunk Received 3 records
[2023-07-06 12:07:36,834: INFO/ForkPoolWorker-1] Task search_transactions[320bbda6-e758-49fc-8a26-e8dc29f88f75] succeeded in 0.05865555256605148s: {'records': [['hub_transaction_id', 'qrn', 'mod_id', 'response_code', 'response_message', 'status', 'transaction_date_time_local', 'transaction_date_time_global', 'execution_time', 'source_country_code', 'destination_country_code', 'source_currency_code', 'destination_currency_code', 'quote_time', 'source_sink_type', 'destination_sink_type', 'sender_service_charge', 'receiver_service_charge', 'hub_sender_srvc_chrg', 'hub_receiver_srvc_chrg', 'total_service_charge', 'sender_tax_amount', 'hub_tax_amount', 'reciever_tax_amount', 'exchange_rate', 'total_amount_source', 'total_amount_destination', 'qoute_expire_on', 'transaction_time', 'source_partner_id', 'dest_partner_id', 'quote_response_code', 'quote_response_message', 'sending_party_mob', 'sender_acc_id_type', 'receiving_party_mob', 'receiver_name', 'sender_acc_id_no', 'receiver_message', 'custom_field1', 'custom_field2', 'custom_field3', 'receiver_acc_id_type', 'receiver_acc_id_no', 'credit_transaction_id', 'scnumber', 'quote_status', 'sender_name', 'src_txn_...', ...]]}
[2023-07-06 12:12:32,617: INFO/MainProcess] Task search_transactions[4cc44896-3698-4699-913b-06267909ec9f] received
[2023-07-06 12:12:32,618: DEBUG/MainProcess] TaskPool: Apply <function fast_trace_task at 0x7f55e1ada820> (args:('search_transactions', '4cc44896-3698-4699-913b-06267909ec9f', {'lang': 'py', 'task': 'search_transactions', 'id': '4cc44896-3698-4699-913b-06267909ec9f', 'shadow': None, 'eta': None, 'expires': None, 'group': None, 'group_index': None, 'retries': 0, 'timelimit': [None, None], 'root_id': '4cc44896-3698-4699-913b-06267909ec9f', 'parent_id': None, 'argsrepr': "({'start_date': '2023-04-06', 'end_date': '2023-07-06', 'referrer': 'https://aip.acmeinc.com/dashboard/usecases/Compliance/applications/persona/details?persona=Customer+ACCs&table=Search&query=GAMMA1234', 'source': 'https://aip.acmeinc.com/dashboard/usecases/Compliance/applications/persona/details?persona=Customer+ACCs&table=Search&query=GAMMA1234', 'acc_txn_ids': ['GAMMA1234'], 'name': 'txnsearch-Customer ACCs-Search-GAMMA1234-2023-07-06'},)", 'kwargsrepr': '{}', 'origin': 'gen18892@aip.acmeinc.com', 'ignore_result': False, 'sentry-trace': '8d34fb1669954bd7ae2aece3f3e718ee-82f89c3a73728c8b-0', 'baggage':... kwargs:{})
[2023-07-06 12:12:32,620: INFO/ForkPoolWorker-7] search_transactions[4cc44896-3698-4699-913b-06267909ec9f]: [0] Chunk running SQL: SELECT *
            FROM iox_hub_transaction
            where ((DATE(modified_on) >= '2023-04-06') AND
                        (DATE(modified_on) <= '2023-07-06') AND

Doodle

The metadata server is called Doodle. It is a typical django server with REST API. It has multiple functions including tracking catalog, pipeline performance, and programmatic uses of computed data.

The log has the format

 [Timestamp] [ProcessID] [Level] Method URL

The UUID in the URL is usually a metadata entity id (e.g., a table)

[2023-04-06 02:25:01 +0000] [31917] [DEBUG] POST /metadata/api/v1/features/1cae096e-1b0d-4cac-bc5c-dd2e811908ad
[2023-04-06 02:25:01 +0000] [31917] [DEBUG] POST /metadata/api/v1/features/6d7f21ec-86ef-4349-b261-3b05f244cd02
[2023-04-06 02:25:01 +0000] [31917] [DEBUG] POST /metadata/api/v1/sources/dcae5e91-3208-42d8-9778-d95ad9a5c02c
[2023-04-06 02:25:01 +0000] [31917] [DEBUG] GET /metadata/api/v1/features
[2023-04-06 02:25:01 +0000] [31917] [DEBUG] POST /metadata/api/v1/features/f4cb8c99-67a9-441e-93a5-deb1b2521c06

Dashboard Server

Most of the application activity is captured in the system log. When the exceptions miss the system log such as errors at the time of loading, unhandled exceptions, and basic accesses.

[2023-07-20 10:27:01 +0530] [2501] [DEBUG] GET /accounts/login/
[2023-07-20 10:27:20 +0530] [2501] [DEBUG] GET /accounts/login/
[2023-07-20 10:27:31 +0530] [2501] [DEBUG] GET /accounts/login/

Other Services

These services are optional. But when deployed, they have a standard out of box log format that is service specific.

[I 2021-04-09 11:01:43.064 ServerApp] jupyterlab | extension was successfully linked.
[I 2021-04-09 11:01:43.065 ServerApp] jupyterlab_templates | extension was successfully linked.
[W 2021-04-09 11:01:43.068 NotebookApp] Collisions detected in /home/scribble/.jupyter/jupyter_notebook_config.py and /home/scribble/.jupyter/jupyter_notebook_config.json config files. /home/scribble/.jupyter/jupyter_notebook_config.json has higher priority: {
      "NotebookApp": {
        "nbserver_extensions": "{'jupyterlab_git': True, 'jupyterlab_templates': True} ignored, using {'jupyterlab_templates.extension': True}"
      }
    }
[W 2021-04-09 11:01:43.069 NotebookApp] 'allow_remote_access' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[W 2021-04-09 11:01:43.069 NotebookApp] 'base_url' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[W 2021-04-09 11:01:43.069 NotebookApp] 'ip' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[W 2021-04-09 11:01:43.069 NotebookApp] 'ip' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[W 2021-04-09 11:01:43.069 NotebookApp] 'notebook_dir' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[W 2021-04-09 11:01:43.069 NotebookApp] 'password_required' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[W 2021-04-09 11:01:43.069 NotebookApp] 'port' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[W 2021-04-09 11:01:43.069 NotebookApp] 'port_retries' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[W 2021-04-09 11:01:43.069 NotebookApp] 'password' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[W 2021-04-09 11:01:43.074 ServerApp] notebook_dir is deprecated, use root_dir
[W 2021-04-09 11:01:43.074 ServerApp] No such directory: ''/home/ubuntu/enrich/opt/notebooks''
[I 2021-04-09 11:01:43.081 ServerApp] Writing notebook server cookie secret to /home/scribble/.local/share/jupyter/runtime/jupyter_cookie_secret
[I 2021-04-09 11:01:43.096 LabApp] JupyterLab extension loaded from /app/scribble/enrich/.virtualenvs/jupyterenv/lib/python3.6/site-packages/jupyterlab