This article was inspired by the IISE webinar “Supply Chain Data: Pains, Pitfalls & Some Rays of Hope” by Dr. Mamnoon Jamil and Dr. Cheranellore (Vasu) Vasudevan, Ph.D. It also draws upon my professional experience and personal observations of what can go wrong with socialization of data, product data management system migrations and general IT crowdsourcing.
What Is the Socialization of Data?
The socialization of data would involve the pooling of data into widely accessible repositories. This reduces the costs of administering and maintaining systems while allowing many people to leverage it. In theory, you could rely on volunteers, administrative staff from many organizations or crowdsourced labor via services like Amazon Mturk to add metadata to your records and categorize things. Both of these forms of “socialization” have their advantages and disadvantages.
What Are the Advantages of Socializing Data?
- Potentially faster categorization and analysis, and it could keep up with updates of data as transactions process or data definitions change
- The ability to tap into a broad knowledge base, potentially finding novel solutions and cross-references
- You don’t necessarily need to set up data mining AI or worry you’re losing out by using someone else’s data mining algorithms
- The possibility of keeping up with the masses of data we’re generating
- Language barriers could be broken through by crowdsourcing to bilingual volunteers and crowd-workers, minimizing translation errors
- If someone comes to a realization, they have a clear self-service channel for reporting it and good odds of it filtering up to those who need to know
- Ad hoc reporting and endlessly varied data analytics are possible by anyone and everyone available.
- The data lakes and large repositories allow you to easily combine new data sources and then leverage existing reports and talent to find correlations.
What Are the Disadvantages of Socializing Data?
- The potential scraping and misuse of your data by others is certainly a problem.
- Access control and data protection become simultaneously harder and more important.
- Poor data quality gets in the way of finding useful correlations. Or it leads to incorrect decisions on a far wide scale.
- The people correlating it may not know what is considered important, so they waste time finding relationships that are known to experts or unimportant.
- Bad data results in bad analysis, reinforcing incorrect assumptions by more people than would happen if you were simply giving your report to your boss.
- The possibility that socialized data hurts data quality through duplicate entries, blank values, incorrect entries.
- Inconsistent data formats lead to incorrect assumptions or greater margins of error.
- Giving people too much data to analyze overloads them, hurting quality of work.
What Are the Risks of Socializing Data?
- Quality of work by the crowd tagging and flagging data is highly variable. Maintaining quality adds to the workload of any admins.
- The categorization and analysis of data will be skewed if you don’t seek diversity of thought in the early stages.
- When you store data in data lakes to maximize efficiency and possible correlations, you can lose track of data records or characteristics like metadata needed to properly sort it.
- Data lakes and similar mass repositories of data can hide the fact you’ve lost records or incorrectly tied metadata to records.
- The need to audit, manage and control crowd access to the data is a cost that may not be worth the benefits.
- It can lead to collecting tons of data that has no value in the hope it is useful, adding to the data storage and management costs, lowering the cost-benefit ratio of the whole project.
- Automatic filtering and prioritization of data may remove information and context you need to get real value out of it.
- Automated metadata generation may miss key attributes and contribute to burial of valuable details needed to generate value.
- The data repository or “lake” value depends on how well organized it is from the beginning and maintaining that structure. Poor design almost guarantees GIGO.
- New data may not be better than old data, but failing to keep up with dating of data may prevent you from identifying relevant trends.
- Replacing old data in favor of new data may cost you key insights.
Example: the models that assume temperature isn’t a significant variable in manufacturing because it was so tightly controlled in later production systems.