Pipelines

A pipeline is a series of processing steps that are performed on data as it flows through the pipeline. Pipelines are often used to transform, filter, or otherwise manipulate data in order to prepare it for further processing or storage.

In the context of YAML configuration files, a pipeline typically consists of three main sections: input, actions, and output. The input section specifies the source of the data that will be processed by the pipeline. This could be a file (such as a CSV, JSON, YAML, or XLSX file), a database, or another type of data source. The actions section specifies the series of processing steps that will be applied to the data. These could include operations such as filtering, formatting, or renaming fields. Finally, the output section specifies the destination for the processed data. This could be a file, a database, or another type of data sink.

Overall, pipelines provide a flexible and powerful way to process and transform data in a variety of formats and contexts.

Example 1:

pipeline:
  input:
    reader:
      type: csv
      filename: customer_data.csv
  actions:
    # Retain only the fields "name", "age", and "gender"
    retain_important_fields:
      action: retain
      keys: [name, age, gender]
    # Rename the "name" field to "customer_name"
    rename_name_field:
      action: rename
      from: name
      to: customer_name
    # Format the "age" field as a two-digit integer with leading zeros
    format_age_field:
      action: format
      field: age
      functions: [number]
      format: '%02d'
  output:
    writer:
      type: csv
      filename: processed_customer_data.csv

This pipeline reads in a CSV file with customer data, retains only the name, age, and gender fields, renames the name field to customer_name, and formats the age field as a two-digit integer with leading zeros. The resulting data is then written to a new CSV file.