Navigating the Data Pipeline Landscape

In today’s data- driven world, we use a variety of data to infer insights and make decisions. Right from assessing people’s sentiment on a social media platform, to predicting sales of a product, to automatically driving a car, everything involves data and data-driven decision making.

When it comes to data, there are producers of data and there are consumers of data. Some examples of the producers of data are humans, IoT devices, monitoring tools, etc. Some examples of the consumers of the data are AI/ML programs, dashboards, data warehouse, etc. Consider the following examples:

  • Point-of-sale servers collect sales data which is used to forecast future sales.
  • Monitoring tools collect performance metrics of applications which is used to render real-time dashboards of application health.
  • Customers post reviews about a product which is used to assess product fitment.
  • Various sensors in a car collect data which is used by an AI engine to automatically drive a car.
  • Users search and view videos, the history of which is used by video sharing platforms to provide recommendations.
  • Applications collect biometric data which is used by criminal investigation bodies to identify the victims or criminals.

The producers and consumers often look at the data in different ways. The data generated by the producers often cannot be consumed by the consumers in its as-is state. It often needs cleaning, pre-processing, and transformation, before it can be used by the consumers. Hence, the need for data pipelines.