Demo example: Applying Quality Control.#

In this example, we apply Quality Control (QC) on the demo data.

Create your dataset#

We start by creating a dataset.

import metobs_toolkit
your_dataset = metobs_toolkit.Dataset()
    input_data_file=metobs_toolkit.demo_datafile, # path to the data file


A number of quality control methods are available in the toolkit. We can classify them into two groups:

  1. Quality control for missing/duplicated or invalid timestamps. This is applied to the raw data and is not based on the observational value but merely on the presence of a record.

  2. Quality control for bad observations. These are not automatically executed. These checks are performed in a sequence of specific checks, that are looking for signatures of typically bad observations.

Quality control for missing/duplicated and invalid timestamps#

Since this is applied to the raw data, the following quality control checks are automatically performed when reading the data:

  • Nan check: Test if the value of an observation can be converted to a numeric value.

  • Missing check: Test if there are missing records. These missing records are labeled as missing observation or as gap (if there are consecutive missing records).

  • Duplicate check: Test if each observation (station name, timestamp, observation type) is unique.

As an example you can see that there is a missing timestamp in the time series of some stations:

<Axes: title={'center': 'Temperatuur of vlinder02'}, xlabel='datetime', ylabel='Temperatuur (Celcius) \n 2m-temperature'>

Quality control for bad observations#

The following checks are available:

  • Gross value check: A threshold check that observations should be between the thresholds

  • Persistence check: Test observations to change over a specific period.

  • Repetitions check: Test if an observation changes after several records.

  • Spike check: Test if observations do not produce spikes in time series.

  • Window variation check: Test if the variation exceeds the threshold in moving time windows.

  • Toolkit Buddy check: Spatial buddy check.

  • TITAN Buddy check: The Titanlib version of the buddy check.

  • TITAN Spatial consistency test: Apply the Titanlib (robust) Spatial-Consistency-Test (SCT).

Each check requires a set of specific settings, often stored per specific observation type. A set of default settings, for temperature observations, are stored in the settings of each dataset. Use the show() method, and scroll to the QC section to see all QC settings.

All settings:


Use the update_qc_settings() method to update the default settings.

                                persis_time_win_to_check='30T' #30 minutes

To apply the quality control on the full dataset use the apply_quality_control() method. Spatial quality control checks can be applied by using the apply_buddy_check(), apply_titan_buddy_check() and apply_titan_sct_resistant_check() methods.

        obstype="temp",  # which observations to check
        gross_value=True,  # apply gross_value check?
        persistance=True,  # apply persistence check?
        step=True,  # apply the step check?
        window_variation=True,  # apply internal consistency check?

Use the or the time series plot methods to see the effect of the quality control.

your_dataset.make_plot(obstype='temp', colorby='label')
<Axes: title={'center': 'Temperatuur for all stations. '}, xlabel='datetime', ylabel='Temperatuur (Celcius) \n 2m-temperature'>

If you are interested in the performance of the applied QC, you can use the get_qc_stats() method to get an overview of the frequency statistics.

your_dataset.get_qc_stats(obstype='temp', make_plot=True)
({'ok': 64.28984788359789,
  'QC outliers': 35.707671957671955,
  'missing (gaps)': 0.0,
  'missing (individual)': 0.00248015873015873},
 {'repetitions outlier': 29.658564814814813,
  'gross value outlier': 4.869378306878307,
  'persistance outlier': 1.0085978835978835,
  'in step outlier group': 0.17113095238095238,
  'duplicated timestamp outlier': 0.0,
  'invalid input': 0.0,
  'in window variation outlier group': 0.0,
  'buddy check outlier': 0.0,
  'titan buddy check outlier': 0.0,
  'sct resistant check outlier': 0.0},
 {'duplicated_timestamp': {'not checked': 0.0, 'ok': 100.0, 'outlier': 0.0},
  'invalid_input': {'not checked': 0.0, 'ok': 100.0, 'outlier': 0.0},
  'repetitions': {'not checked': 0.0,
   'ok': 70.34143518518519,
   'outlier': 29.658564814814813},
  'gross_value': {'not checked': 29.658564814814813,
   'ok': 65.47205687830689,
   'outlier': 4.869378306878307},
  'persistance': {'not checked': 34.52794312169312,
   'ok': 64.46345899470899,
   'outlier': 1.0085978835978835},
  'step': {'not checked': 35.53654100529101,
   'ok': 64.29232804232805,
   'outlier': 0.17113095238095238},
  'window_variation': {'not checked': 35.707671957671955,
   'ok': 64.29232804232805,
   'outlier': 0.0},
  'buddy_check': {'not checked': 100.0, 'ok': 0.0, 'outlier': 0.0},
  'titan_buddy_check': {'not checked': 100.0, 'ok': 0.0, 'outlier': 0.0},
  'titan_sct_resistant_check': {'not checked': 100.0,
   'ok': 0.0,
   'outlier': 0.0},
  'is_gap': {'not checked': 0, 'ok': 100.0, 'outlier': 0.0},
  'is_missing_timestamp': {'not checked': 0,
   'ok': 99.99751984126983,
   'outlier': 0.00248015873015873}})

Quality control exercise#

For a more detailed reference you can use this Quality control exercise, which was created in the context of the COST FAIRNESS summer school 2023 in Ghent.