Problem: When we look at any piece of data, we dont know how we arrived at it, and how to reproduce it. The reproduction may required for testing, creating multiple environments, or validating a piece of data.
Best Practice: Every data element should has associated metadata (how, when, where, etc) that enable us to chase the data down to the last record.
Problem: Eventually something bad happens. It may not be clear whether the problem was caused by data issues, layout, sequence of commands run, or some other algorithmic failure. One can choose to chase the bugs down and/or junk everything and start from scratch.
Best Practice: Design to create and check intermediate computation output, and start from scratch if necessary.
Problem: Over a period of time, the volume of data, number and complexity of the models increases. The relative cost grows as the number of models/individual grows.
Best Practice: Best practices of software engineering have to adopted for this purpose including standardization, interface definitions and separation of concerns, and extensive documentation.
Problem: We have multiple runs of the pipeline. Data and code has evolved over time. We dont know how to compare outputs of two runs.
Best Practice: Embed versioning everywhere (code, data, metadata). Each element that can change over time should have version.
Problem: Multiple people work on data. They overwrite each other\'s data. This is a various of the evolution problem.
Best Practice: Embed a name space to isolate datasets. The usually the combination of adminstrative or other attributes such as machine, project, and organization can help.
Problem: Code runs well on the test dataset but generates gibberish when run in production data.
Best Practice: This is often because incoming production data violated some explicit or implicit assumption made about data in the code. It wasnt designed to handle the violation. So when the code is run, the errors propagate to the output. The solution is not trust any input given to the code, and validate all input for integrity.
Problem: Semantics is the \'meaning\' of the data. The semantics is embedded in the code, data syntax, and in people\'s minds. Code changes often do not account for semantics.
Best Practice: Turn as much semantics as possible into syntax for usage. Use better named variables, well documented code, metadata, extensive validation checks.
Problem: Code often assumes the timing, scope, purpose, and person executing it.
Best Practice: Parmeterize as many as critical dimensions as possible such as a scope of the data, degree of computation, and timing of the execution.
Problem: You generate output files that overwrite previously generated ones. Often multiple files are output from a transform. When the transforms breaks after overwriting first few files, we are left with an inconsistent set.
Best Practice: Leave a success/completion indicator after completing the write. So all files are either complete or not.
Problem: Models and analysis are built assuming that data is accurate. The problem is that a lot of data could be useless because of the way it has been collected and handled.
Best Practice: Discipline in collection. More specifically, it includes:
- Plan for multiple instances of attributes: Records oftentimes have attributes that could have multiple instances. For example, a user can have one, but usually more, email addresses. In such cases, it is best to denote one primary attribute instance to store the primary attribute value, and one secondary attribute instance to store all remaining attribute values. Consider the case of a user U having email addresses E1, E2, E3, E4. The database can have a primary_email field populated with one value - E1 - for user record U and another additional_emails field populated with the remaining values - E2,E3,E4 - (comma separated) for user records.
- Design attributes as granular as possible: Certain record attributes can be consumed at various levels of granularity. For example, a user's address can contain a street number, street name, city, province, country, zip code, etc. It is best to break down such attributes to individual fields that hold the granular components as opposed to rolling up into a single address field.
- Restrict values to a predefined list: Certain attributes can be guaranteed to always belong to a predefined list of values. For example, user interests, hobbies, or behavioural traits. Allowing open data entry will lead to errors such as inconsistency in syntax and semantics, and bias from data entry operators down the line. A special catch-all value (OTHER, or something similar) can be used to denote values that have not been defined on the list yet. This recommendation also holds for attributes like city, country, etc.
- Minimize free-form text: Attributes are usable when they are machine interpretable. It is best to keep attributes where values can be free-form text to an absolute minimum. If free-form text attributes are planned for, it must be assumed that these fields will primarily be consumed by humans. The benefits of scale and automation that comes with machine consumption of data must not be expected without considerable work in extracting machine interpretable information from such data. For example, consider an attribute lead_next_step which tracks the next steps for a marketing lead after a call from a customer care agent. Instead of allowing free-form text, there can be a predefined list of next step options that must be selected from as values.
- Document all attributes: As teams grow, educating new members on attribute history, lineage, and nuances is a real challenge. It is best to invest time and effort up front in disciplined attribute documentation to ensure that there are no single-points-of-failure as the data and team sizes grow. This documentation can be done either in the database application itself against each attribute or in separate documentation that must be maintained and kept up-to-date.
- Standardize: Non-standard columns are very hard to process, and data would have to be cleaned before it can be used. It is best if the data is entered in a standardized way. Examples include datetimes, location names, currency, and languages.
- Avoid mixed types: Columns often have multiple bits of information. For example, value of a product is combination of a number (100) and units (USD). If both are combined, the column could have to be stored as a string and validation becomes more complicated to handle typos. If we separate the two, then we can mark the number as being numeric and possibly non-zero. The database will automatically ensure the quality of the field.
- Standardize even subjective labels: Student assessment for example involves using words like good, average, and learner. Interpreting these semantically-heavy words at scale would be possible only if there is a consensus around what these words mean and when they are consistently used.
- Use hierarchical attributes where possible: Often we have to compare people, such as interest levels. There is unlikely to be an exact match. It will help if we are able to compute how close or far individuals are. For example, a person may be interested in rhinos and another person may be interested in outdoor activities. If we know that the second person is interested in jungle adventures within the outdoor activities category, we can relate the two people.
- Use different words for different meanings: Data might be unknown, empty, or invalid. For example, N/A can be used not available, "" for empty, and NULL for invalid values.