Assets
The Scribble Data Platform assets including data and code are kept simple, predictable and transparent to enable auditability, backup and other activities.
Data→
A number of applications including Scribble's own applications deployed on the Enrich Server. The data for each of these application is organized by "Customer" - a flexible organizational entity as defined by user (e.g., a division). Each Customer can have one or more applications.
enrich
├── data
├── customer1
├── app1
├── app2
...
├── customer2
├── app1
├── app2
..
...
├── scribble
├── Contrib
├── Campaigns
LLM Dataset are also organized in a similar way.
enrich
├── llm-agents
├── data
└── docs
├── datagpt
│ ├── acme-retail
│ │ ├── cleaned_data.csv
│ │ ├── data.csv
│ │ ├── metadata.json
│ │ └── sales.sqlite
Application Metadata→
Each application has its own structure. The structure below is typical. It is a namespace that is managed by the application. So it could differ between applications.
enrich
├── data
├── customer1
├── output
├── Pipeline
├── Tasks
├── Services
├── Test
├── shared
├── AML
├── aml.sqlite
...
├── ...
Each run of the pipeline or task generates a significant amount of auditing information and datasets in various forms.
├── customer1
├── app1
├── output
├── Pipeline
├── pipeline1
├── run-201702910-2019271
├── metadata.json
├── log.json
├── export.json
├── outputs
├── A.csv
├── B.sqlite
Platform Metadata→
Enrich accesses and manipulates other data that is generated and used for operations. These include:
- User accounts and preferences data
- Data sources and access credentials
- Platform audit logs
- Background services data from scheduler and performance monitor
Code→
Application cost is segregated by owner. Each customer's codebase is in a separate repository, potentially hosted in the customer's own code management service such as gitlab.
enrich
├── customers
├── acme
├── Surveys
├── pipelines
├── tasks