Assets

The Scribble Data Platform assets including data and code are kept simple, predictable and transparent to enable auditability, backup and other activities.

Data→

A number of applications including Scribble's own applications deployed on the Enrich Server. The data for each of these application is organized by "Customer" - a flexible organizational entity as defined by user (e.g., a division). Each Customer can have one or more applications.

enrich
  ├── data 
      ├── customer1
          ├── app1
          ├── app2
      ...
      ├── customer2
          ├── app1
          ├── app2
  ..
      ...
      ├── scribble
          ├── Contrib
          ├── Campaigns

LLM Dataset are also organized in a similar way.

enrich
  ├── llm-agents
      ├── data
            └── docs
                ├── datagpt
                │   ├── acme-retail
                │   │   ├── cleaned_data.csv
                │   │   ├── data.csv
                │   │   ├── metadata.json
                │   │   └── sales.sqlite

Application Metadata→

Each application has its own structure. The structure below is typical. It is a namespace that is managed by the application. So it could differ between applications.

enrich
  ├── data 
      ├── customer1
          ├── output
              ├── Pipeline
              ├── Tasks
              ├── Services
              ├── Test        
          ├── shared
              ├── AML
                   ├── aml.sqlite
          ...
          ├── ...

Each run of the pipeline or task generates a significant amount of auditing information and datasets in various forms.

  ├── customer1
      ├── app1
          ├── output
              ├── Pipeline
                  ├── pipeline1
                      ├── run-201702910-2019271
                          ├── metadata.json
                          ├── log.json
                          ├── export.json
                          ├── outputs
                              ├── A.csv
                              ├── B.sqlite

Platform Metadata→

Enrich accesses and manipulates other data that is generated and used for operations. These include:

User accounts and preferences data
Data sources and access credentials
Platform audit logs
Background services data from scheduler and performance monitor

Code→

Application cost is segregated by owner. Each customer's codebase is in a separate repository, potentially hosted in the customer's own code management service such as gitlab.

enrich
  ├── customers
      ├── acme
           ├── Surveys   
                ├── pipelines
                ├── tasks