Demo example: Applying Quality Control.#

In this example, we apply Quality Control (QC) on the demo data.

Create your dataset#

We start by creating a dataset.

[1]:

import metobs_toolkit
your_dataset = metobs_toolkit.Dataset()
your_dataset.update_settings(
    input_data_file=metobs_toolkit.demo_datafile, # path to the data file
    input_metadata_file=metobs_toolkit.demo_metadatafile,
    template_file=metobs_toolkit.demo_template,
)

your_dataset.import_data_from_file()

A number of quality control methods are available in the toolkit. We can classify them into two groups:

Quality control for missing/duplicated or invalid timestamps. This is applied to the raw data and is not based on the observational value but merely on the presence of a record.
Quality control for bad observations. These are not automatically executed. These checks are performed in a sequence of specific checks, that are looking for signatures of typically bad observations.

Quality control for missing/duplicated and invalid timestamps#

Since this is applied to the raw data, the following quality control checks are automatically performed when reading the data:

Nan check: Test if the value of an observation can be converted to a numeric value.
Missing check: Test if there are missing records. These missing records are labeled as missing observation or as gap (if there are consecutive missing records).
Duplicate check: Test if each observation (station name, timestamp, observation type) is unique.

As an example you can see that there is a missing timestamp in the time series of some stations:

[2]:

your_dataset.get_station('vlinder02').make_plot(colorby='label')

[2]:

<Axes: title={'center': 'Temperatuur of vlinder02'}, xlabel='datetime', ylabel='Temperatuur (Celcius) \n 2m-temperature'>

Quality control for bad observations#

The following checks are available:

Gross value check: A threshold check that observations should be between the thresholds
Persistence check: Test observations to change over a specific period.
Repetitions check: Test if an observation changes after several records.
Spike check: Test if observations do not produce spikes in time series.
Window variation check: Test if the variation exceeds the threshold in moving time windows.
Toolkit Buddy check: Spatial buddy check.
TITAN Buddy check: The Titanlib version of the buddy check.
TITAN Spatial consistency test: Apply the Titanlib (robust) Spatial-Consistency-Test (SCT).

Each check requires a set of specific settings, often stored per specific observation type. A set of default settings, for temperature observations, are stored in the settings of each dataset. Use the show() method, and scroll to the QC section to see all QC settings.

[3]:

your_dataset.settings.show()

All settings:

 ---------------------------------------

 ---------------- IO (settings) ----------------------

* output_folder:

  -None

* input_data_file:

  -/home/thoverga/Documents/VLINDER_github/MetObs_toolkit/metobs_toolkit/datafiles/demo_datafile.csv

* input_metadata_file:

  -/home/thoverga/Documents/VLINDER_github/MetObs_toolkit/metobs_toolkit/datafiles/demo_metadatafile.csv

 ---------------- db (settings) ----------------------

 ---------------- time_settings (settings) ----------------------

* target_time_res:

  -60T

* resample_method:

  -nearest

* resample_limit:

  -1

* timezone:

  -UTC

* freq_estimation_method:

  -highest

* freq_estimation_simplify:

  -True

* freq_estimation_simplify_error:

  -2T

 ---------------- app (settings) ----------------------

* print_fmt_datetime:

  -%d/%m/%Y %H:%M:%S

* print_max_n:

  -40

* plot_settings:

  - time_series:

    -{'figsize': (15, 5), 'colormap': 'tab20', 'linewidth': 2, 'linestyle_ok': '-', 'linestyle_fill': '--', 'linezorder': 1, 'scattersize': 4, 'scatterzorder': 3, 'dashedzorder': 2, 'legend_n_columns': 5}

  - spatial_geo:

    -{'extent': [2.260609, 49.25, 6.118359, 52.350618], 'cmap': 'inferno_r', 'n_for_categorical': 5, 'figsize': (10, 15), 'fmt': '%d/%m/%Y %H:%M:%S UTC'}

  - pie_charts:

    -{'figsize': (10, 10), 'anchor_legend_big': (-0.25, 0.75), 'anchor_legend_small': (-3.5, 2.2), 'radius_big': 2.0, 'radius_small': 5.0}

  - color_mapper:

    -{'duplicated_timestamp': '#a32a1f', 'invalid_input': '#900357', 'gross_value': '#f1ff2b', 'persistance': '#f0051c', 'repetitions': '#056ff0', 'step': '#05d4f0', 'window_variation': '#05f0c9', 'buddy_check': '#8300c4', 'titan_buddy_check': '#8300c4', 'titan_sct_resistant_check': '#c17fe1', 'gap': '#f00592', 'missing_timestamp': '#f78e0c', 'linear': '#d406c6', 'model_debias': '#6e1868', 'ok': '#07f72b', 'not checked': '#f7cf07', 'outlier': '#f20000'}

  - diurnal:

    -{'figsize': (10, 10), 'alpha_error_bands': 0.3, 'cmap_continious': 'viridis', 'n_cat_max': 20, 'cmap_categorical': 'tab20', 'legend_n_columns': 5}

  - anual:

    -{'figsize': (10, 10), 'alpha_error_bands': 0.3, 'cmap_continious': 'viridis', 'n_cat_max': 20, 'cmap_categorical': 'tab20', 'legend_n_columns': 5}

  - correlation_heatmap:

    -{'figsize': (10, 10), 'vmin': -1, 'vmax': 1, 'cmap': 'cool', 'x_tick_rot': 65, 'y_tick_rot': 0}

  - correlation_scatter:

    -{'figsize': (10, 10), 'p_bins': [0, 0.001, 0.01, 0.05, 999], 'bins_markers': ['*', 's', '^', 'X'], 'scatter_size': 40, 'scatter_edge_col': 'black', 'scatter_edge_line_width': 0.1, 'ymin': -1.1, 'ymax': 1.1, 'cmap': 'tab20', 'legend_ncols': 3, 'legend_text_size': 7}

* world_boundary_map:

  -/home/thoverga/Documents/VLINDER_github/MetObs_toolkit/metobs_toolkit/settings_files/world_boundaries/WB_countries_Admin0_10m.shp

* display_name_mapper:

  - network:

    -network

  - name:

    -station name

  - call_name:

    -pseudo name

  - location:

    -region

  - lat:

    -latitude

  - lon:

    -longtitude

  - temp:

    -temperature

  - radiation_temp:

    -radiation temperature

  - humidity:

    -humidity

  - precip:

    -precipitation intensity

  - precip_sum:

    -cummulated precipitation

  - wind_speed:

    -wind speed

  - wind_gust:

    -wind gust speed

  - wind_direction:

    -wind direction

  - pressure:

    -air pressure

  - pressure_at_sea_level:

    -corrected pressure at sea level

  - lcz:

    -LCZ

* static_fields:

  -['network', 'name', 'lat', 'lon', 'call_name', 'location', 'lcz']

* categorical_fields:

  -['wind_direction', 'lcz']

* location_info:

  -['network', 'lat', 'lon', 'lcz', 'call_name', 'location']

* default_name:

  -unknown_name

 ---------------- qc (settings) ----------------------

* qc_check_settings:

  - duplicated_timestamp:

    -{'keep': False}

  - persistance:

    -{'temp': {'time_window_to_check': '1h', 'min_num_obs': 5}}

  - repetitions:

    -{'temp': {'max_valid_repetitions': 5}}

  - gross_value:

    -{'temp': {'min_value': -15.0, 'max_value': 39.0}}

  - window_variation:

    -{'temp': {'max_increase_per_second': 0.0022222222222222222, 'max_decrease_per_second': 0.002777777777777778, 'time_window_to_check': '1h', 'min_window_members': 3}}

  - step:

    -{'temp': {'max_increase_per_second': 0.0022222222222222222, 'max_decrease_per_second': -0.002777777777777778}}

  - buddy_check:

    -{'temp': {'radius': 15000, 'num_min': 2, 'threshold': 1.5, 'max_elev_diff': 200, 'elev_gradient': -0.0065, 'min_std': 1.0}}

* qc_checks_info:

  - duplicated_timestamp:

    -{'outlier_flag': 'duplicated timestamp outlier', 'numeric_flag': 1, 'apply_on': 'record'}

  - invalid_input:

    -{'outlier_flag': 'invalid input', 'numeric_flag': 2, 'apply_on': 'obstype'}

  - gross_value:

    -{'outlier_flag': 'gross value outlier', 'numeric_flag': 4, 'apply_on': 'obstype'}

  - persistance:

    -{'outlier_flag': 'persistance outlier', 'numeric_flag': 5, 'apply_on': 'obstype'}

  - repetitions:

    -{'outlier_flag': 'repetitions outlier', 'numeric_flag': 6, 'apply_on': 'obstype'}

  - step:

    -{'outlier_flag': 'in step outlier group', 'numeric_flag': 7, 'apply_on': 'obstype'}

  - window_variation:

    -{'outlier_flag': 'in window variation outlier group', 'numeric_flag': 8, 'apply_on': 'obstype'}

  - buddy_check:

    -{'outlier_flag': 'buddy check outlier', 'numeric_flag': 11, 'apply_on': 'obstype'}

  - titan_buddy_check:

    -{'outlier_flag': 'titan buddy check outlier', 'numeric_flag': 9, 'apply_on': 'obstype'}

  - titan_sct_resistant_check:

    -{'outlier_flag': 'sct resistant check outlier', 'numeric_flag': 10, 'apply_on': 'obstype'}

* titan_check_settings:

  - titan_buddy_check:

    -{'temp': {'radius': 50000, 'num_min': 2, 'threshold': 1.5, 'max_elev_diff': 200, 'elev_gradient': -0.0065, 'min_std': 1.0, 'num_iterations': 1}}

  - titan_sct_resistant_check:

    -{'temp': {'num_min_outer': 3, 'num_max_outer': 10, 'inner_radius': 20000, 'outer_radius': 50000, 'num_iterations': 10, 'num_min_prof': 5, 'min_elev_diff': 100, 'min_horizontal_scale': 250, 'max_horizontal_scale': 100000, 'kth_closest_obs_horizontal_scale': 2, 'vertical_scale': 200, 'mina_deviation': 10, 'maxa_deviation': 10, 'minv_deviation': 1, 'maxv_deviation': 1, 'eps2': 0.5, 'tpos': 5, 'tneg': 8, 'basic': True, 'debug': False}}

* titan_specific_labeler:

  - titan_buddy_check:

    -{'ok': [0], 'outl': [1]}

  - titan_sct_resistant_check:

    -{'ok': [0, -999, 11, 12], 'outl': [1]}

 ---------------- gap (settings) ----------------------

* gaps_settings:

  - gaps_finder:

    -{'gapsize_n': 40}

* gaps_info:

  - gap:

    -{'label_columnname': 'is_gap', 'outlier_flag': 'gap', 'negative_flag': 'no gap', 'numeric_flag': 12, 'apply_on': 'record'}

  - missing_timestamp:

    -{'label_columnname': 'is_missing_timestamp', 'outlier_flag': 'missing timestamp', 'negative flag': 'not missing', 'numeric_flag': 13, 'apply_on': 'record'}

* gaps_fill_settings:

  - linear:

    -{'method': 'time', 'max_consec_fill': 100}

  - model_debias:

    -{'debias_period': {'prefered_leading_sample_duration_hours': 48, 'prefered_trailing_sample_duration_hours': 48, 'minimum_leading_sample_duration_hours': 24, 'minimum_trailing_sample_duration_hours': 24}}

  - automatic:

    -{'max_interpolation_duration_str': '5H'}

* gaps_fill_info:

  - label_columnname:

    -final_label

  - label:

    -{'linear': 'gap_interpolation', 'model_debias': 'gap_debiased_era5'}

  - numeric_flag:

    -21

 ---------------- missing_obs (settings) ----------------------

* missing_obs_fill_settings:

  - linear:

    -{'method': 'time'}

* missing_obs_fill_info:

  - label_columnname:

    -final_label

  - label:

    -{'linear': 'missing_obs_interpolation'}

  - numeric_flag:

    -23

 ---------------- templates (settings) ----------------------

* template_file:

  -/home/thoverga/Documents/VLINDER_github/MetObs_toolkit/metobs_toolkit/datafiles/demo_templatefile.csv

 ---------------- gee (settings) ----------------------

* gee_dataset_info:

  - global_lcz_map:

    -{'location': 'RUB/RUBCLIM/LCZ/global_lcz_map/v1', 'usage': 'LCZ', 'band_of_use': 'LCZ_Filter', 'value_type': 'categorical', 'dynamical': False, 'scale': 100, 'is_image': False, 'is_imagecollection': True, 'categorical_mapper': {1: 'Compact highrise', 2: 'Compact midrise', 3: 'Compact lowrise', 4: 'Open highrise', 5: 'Open midrise', 6: 'Open lowrise', 7: 'Lightweight lowrise', 8: 'Large lowrise', 9: 'Sparsely built', 10: 'Heavy industry', 11: 'Dense Trees (LCZ A)', 12: 'Scattered Trees (LCZ B)', 13: 'Bush, scrub (LCZ C)', 14: 'Low plants (LCZ D)', 15: 'Bare rock or paved (LCZ E)', 16: 'Bare soil or sand (LCZ F)', 17: 'Water (LCZ G)'}, 'credentials': 'Demuzere M.; Kittner J.; Martilli A.; Mills, G.; Moede, C.; Stewart, I.D.; van Vliet, J.; Bechtel, B. A global map of local climate zones to support earth system modelling and urban-scale environmental science. Earth System Science Data 2022, 14 Volume 8: 3835-3873. doi:10.5194/essd-14-3835-2022'}

  - DEM:

    -{'location': 'CGIAR/SRTM90_V4', 'usage': 'elevation', 'band_of_use': 'elevation', 'value_type': 'numeric', 'dynamical': False, 'scale': 100, 'is_image': True, 'is_imagecollection': False, 'credentials': 'SRTM Digital Elevation Data Version 4'}

  - ERA5_hourly:

    -{'location': 'ECMWF/ERA5_LAND/HOURLY', 'usage': 'ERA5', 'band_of_use': {'temp': {'name': 'temperature_2m', 'units': 'K'}}, 'value_type': 'numeric', 'dynamical': True, 'scale': 2500, 'is_image': False, 'is_imagecollection': True, 'time_res': '1H', 'credentials': ''}

  - worldcover:

    -{'location': 'ESA/WorldCover/v200', 'usage': 'landcover', 'band_of_use': 'Map', 'value_type': 'categorical', 'dynamical': False, 'scale': 10, 'is_image': False, 'is_imagecollection': True, 'categorical_mapper': {10: 'Tree cover', 20: 'Shrubland', 30: 'Grassland', 40: 'Cropland', 50: 'Built-up', 60: 'Bare / sparse vegetation', 70: 'Snow and ice', 80: 'Permanent water bodies', 90: 'Herbaceous wetland', 95: 'Mangroves', 100: 'Moss and lichen'}, 'aggregation': {'water': [70, 80, 90, 95], 'pervious': [10, 20, 30, 40, 60, 100], 'impervious': [50]}, 'colorscheme': {10: '006400', 20: 'ffbb22', 30: 'ffff4c', 40: 'f096ff', 50: 'fa0000', 60: 'b4b4b4', 70: 'f0f0f0', 80: '0064c8', 90: '0096a0', 95: '00cf75', 100: 'fae6a0'}, 'credentials': 'https://spdx.org/licenses/CC-BY-4.0.html'}

Use the update_qc_settings() method to update the default settings.

[4]:

your_dataset.update_qc_settings(obstype='temp',
                                gross_value_max_value=26.3,
                                persis_time_win_to_check='30T' #30 minutes
                                )

To apply the quality control on the full dataset use the apply_quality_control() method. Spatial quality control checks can be applied by using the apply_buddy_check(), apply_titan_buddy_check() and apply_titan_sct_resistant_check() methods.

[5]:

your_dataset.apply_quality_control(
        obstype="temp",  # which observations to check
        gross_value=True,  # apply gross_value check?
        persistance=True,  # apply persistence check?
        step=True,  # apply the step check?
        window_variation=True,  # apply internal consistency check?
    )

Use the dataset.show() or the time series plot methods to see the effect of the quality control.

[6]:

your_dataset.make_plot(obstype='temp', colorby='label')

[6]:

<Axes: title={'center': 'Temperatuur for all stations. '}, xlabel='datetime', ylabel='Temperatuur (Celcius) \n 2m-temperature'>

If you are interested in the performance of the applied QC, you can use the get_qc_stats() method to get an overview of the frequency statistics.

[7]:

your_dataset.get_qc_stats(obstype='temp', make_plot=True)

[7]:

({'ok': 64.28984788359789,
  'QC outliers': 35.707671957671955,
  'missing (gaps)': 0.0,
  'missing (individual)': 0.00248015873015873},
 {'repetitions outlier': 29.658564814814813,
  'gross value outlier': 4.869378306878307,
  'persistance outlier': 1.0085978835978835,
  'in step outlier group': 0.17113095238095238,
  'duplicated timestamp outlier': 0.0,
  'invalid input': 0.0,
  'in window variation outlier group': 0.0,
  'buddy check outlier': 0.0,
  'titan buddy check outlier': 0.0,
  'sct resistant check outlier': 0.0},
 {'duplicated_timestamp': {'not checked': 0.0, 'ok': 100.0, 'outlier': 0.0},
  'invalid_input': {'not checked': 0.0, 'ok': 100.0, 'outlier': 0.0},
  'repetitions': {'not checked': 0.0,
   'ok': 70.34143518518519,
   'outlier': 29.658564814814813},
  'gross_value': {'not checked': 29.658564814814813,
   'ok': 65.47205687830689,
   'outlier': 4.869378306878307},
  'persistance': {'not checked': 34.52794312169312,
   'ok': 64.46345899470899,
   'outlier': 1.0085978835978835},
  'step': {'not checked': 35.53654100529101,
   'ok': 64.29232804232805,
   'outlier': 0.17113095238095238},
  'window_variation': {'not checked': 35.707671957671955,
   'ok': 64.29232804232805,
   'outlier': 0.0},
  'buddy_check': {'not checked': 100.0, 'ok': 0.0, 'outlier': 0.0},
  'titan_buddy_check': {'not checked': 100.0, 'ok': 0.0, 'outlier': 0.0},
  'titan_sct_resistant_check': {'not checked': 100.0,
   'ok': 0.0,
   'outlier': 0.0},
  'is_gap': {'not checked': 0, 'ok': 100.0, 'outlier': 0.0},
  'is_missing_timestamp': {'not checked': 0,
   'ok': 99.99751984126983,
   'outlier': 0.00248015873015873}})

Quality control exercise#

For a more detailed reference you can use this Quality control exercise, which was created in the context of the COST FAIRNESS summer school 2023 in Ghent.