Demo example: Using a Dataset#

This is an introduction to get started with the MetObs toolkit. These examples are making use of the demo data files that comes with the toolkit. Once the MetObs toolkit package is installed, you can import its functionality by:

[1]:
import metobs_toolkit

The Dataset#

A dataset is a collection of all observational data. Most of the methods are applied directly to a dataset. Start by creating an empty Dataset object:

[2]:
your_dataset = metobs_toolkit.Dataset()

The most relevant attributes of a Dataset are: * .df –> a pandas DataFrame where all the observational data are stored * .metadf –> a pandas DataFrame where all the metadata for each station are stored * .settings –> a Settings object to store all specific settings. * .missing_obs and .gaps –> here the missing records and gaps are stored if present.

Note that each Dataset will be equipped with the default settings.

We created a dataset and stored in under the variable ‘your_dataset’. The show method prints out an overview of data in the dataset:

[3]:
your_dataset.show() # or .get_info()
--------  General ---------

Empty instance of a Dataset.

 --------  Observation types ---------

temp observation with:
     * standard unit: Celsius
     * data column as None in None
     * known units and aliases: {'Celsius': ['celsius', '°C', '°c', 'celcius', 'Celcius'], 'Kelvin': ['K', 'kelvin'], 'Farenheit': ['farenheit']}
     * description: 2m - temperature
     * conversions to known units: {'Kelvin': ['x - 273.15'], 'Farenheit': ['x-32.0', 'x/1.8']}
     * originates from data column: None with None as native unit.
humidity observation with:
     * standard unit: %
     * data column as None in None
     * known units and aliases: {'%': ['percent', 'percentage']}
     * description: 2m - relative humidity
     * conversions to known units: {}
     * originates from data column: None with None as native unit.
radiation_temp observation with:
     * standard unit: Celsius
     * data column as None in None
     * known units and aliases: {'Celsius': ['celsius', '°C', '°c', 'celcius', 'Celcius'], 'Kelvin': ['K', 'kelvin'], 'Farenheit': ['farenheit']}
     * description: 2m - Black globe
     * conversions to known units: {'Kelvin': ['x - 273.15'], 'Farenheit': ['x-32.0', 'x/1.8']}
     * originates from data column: None with None as native unit.
pressure observation with:
     * standard unit: pa
     * data column as None in None
     * known units and aliases: {'pa': ['Pascal', 'pascal', 'Pa'], 'hpa': ['hecto pascal', 'hPa'], 'psi': ['Psi'], 'bar': ['Bar']}
     * description: atmospheric pressure (at station)
     * conversions to known units: {'hpa': ['x * 100'], 'psi': ['x * 6894.7573'], 'bar': ['x * 100000.']}
     * originates from data column: None with None as native unit.
pressure_at_sea_level observation with:
     * standard unit: pa
     * data column as None in None
     * known units and aliases: {'pa': ['Pascal', 'pascal', 'Pa'], 'hpa': ['hecto pascal', 'hPa'], 'psi': ['Psi'], 'bar': ['Bar']}
     * description: atmospheric pressure (at sea level)
     * conversions to known units: {'hpa': ['x * 100'], 'psi': ['x * 6894.7573'], 'bar': ['x * 100000.']}
     * originates from data column: None with None as native unit.
precip observation with:
     * standard unit: mm/m²
     * data column as None in None
     * known units and aliases: {'mm/m²': ['mm', 'liter', 'liters', 'l/m²', 'milimeter']}
     * description: precipitation intensity
     * conversions to known units: {}
     * originates from data column: None with None as native unit.
precip_sum observation with:
     * standard unit: mm/m²
     * data column as None in None
     * known units and aliases: {'mm/m²': ['mm', 'liter', 'liters', 'l/m²', 'milimeter']}
     * description: Cummulated precipitation
     * conversions to known units: {}
     * originates from data column: None with None as native unit.
wind_speed observation with:
     * standard unit: m/s
     * data column as None in None
     * known units and aliases: {'m/s': ['meters/second', 'm/sec'], 'km/h': ['kilometers/hour', 'kph'], 'mph': ['miles/hour']}
     * description: wind speed
     * conversions to known units: {'km/h': ['x / 3.6'], 'mph': ['x * 0.44704']}
     * originates from data column: None with None as native unit.
wind_gust observation with:
     * standard unit: m/s
     * data column as None in None
     * known units and aliases: {'m/s': ['meters/second', 'm/sec'], 'km/h': ['kilometers/hour', 'kph'], 'mph': ['miles/hour']}
     * description: wind gust
     * conversions to known units: {'km/h': ['x / 3.6'], 'mph': ['x * 0.44704']}
     * originates from data column: None with None as native unit.
wind_direction observation with:
     * standard unit: ° from north (CW)
     * data column as None in None
     * known units and aliases: {'° from north (CW)': ['°', 'degrees']}
     * description: wind direction
     * conversions to known units: {}
     * originates from data column: None with None as native unit.

 --------  Settings ---------

(to show all settings use the .show_settings() method, or set show_all_settings = True)

 --------  Outliers ---------

No outliers.

 --------  Meta data ---------

No metadata is found.

TIP: to get an extensive overview of an object, call the .show() method on it.

Importing data#

To import your data into a Dataset, the following files are required:

  • data file: This is the CSV file containing the observations

  • (optional) metadata file: The CSV file containing metadata for all stations.

  • template file: This is a (json) file that is used to interpret your data, and metadata file (if present).

In practice, you need to start by creating a template file for your data. More information on the creation of the template can be found in the documentation (under Mapping to the toolkit).

TIP: Use the ``build_template_prompt()`` of the toolkit for creating a template file.

[4]:
# metobs_toolkit.build_template_prompt()

To import data, you must specify the paths to your data, metadata and template. For this example, we use the demo data, metadata and template that come with the toolkit.

[5]:
your_dataset.update_settings(
    input_data_file=metobs_toolkit.demo_datafile, # path to the data file
    input_metadata_file=metobs_toolkit.demo_metadatafile,
    template_file=metobs_toolkit.demo_template,
)

The settings of your Dataset are updated with the required paths. Now the data can be imported into your empty Dataset:

[6]:
your_dataset.import_data_from_file()

Inspecting the Template#

In practice, you need to start by creating a template file for your data. The role of the template is to translate your data and metadata file, to a standard structure used by the toolkit. It is thus the explanation of how your raw data is structured. More information on the creation of the template can be found in the documentation (under Mapping to the toolkit).

TIP: Use the ``build_template_prompt()`` of the toolkit for creating a template file.

As an illustration, you can use the show() method on the template attribute to print out an overview of the template:

[7]:
your_dataset.template.show()
------ Data obstypes map ---------
 * temp            <---> Temperatuur
     (raw data in Celsius)
     (description: 2mT passive)

 * humidity        <---> Vochtigheid
     (raw data in %)
     (description: 2m relative humidity passive)

 * wind_speed      <---> Windsnelheid
     (raw data in km/h)
     (description: Average 2m  10-min windspeed)

 * wind_direction  <---> Windrichting
     (raw data in ° from north (CW))
     (description: Average 2m  10-min windspeed)


------ Data extra mapping info ---------
 * name column (data) <---> Vlinder

------ Data timestamp map ---------
 * datetimecolumn  <---> None
 * time_column     <---> Tijd (UTC)
 * date_column     <---> Datum
 * fmt             <---> %Y-%m-%d %H:%M:%S
 * Timezone        <---> None

------ Metadata map ---------
 * name            <---> Vlinder
 * lat             <---> lat
 * lon             <---> lon
 * school          <---> school

Inspecting the Data#

To get an overview of the data stored in your Dataset you can use

[8]:
your_dataset.show()
--------  General ---------

Dataset instance containing:
     *28 stations
     *['temp', 'humidity', 'wind_speed', 'wind_direction'] observation types
     *120957 observation records
     *0 records labeled as outliers
     *0 gaps
     *3 missing observations
     *records range: 2022-09-01 00:00:00+00:00 --> 2022-09-15 23:55:00+00:00 (total duration:  14 days 23:55:00)
     *time zone of the records: UTC
     *Coordinates are available for all stations.

 --------  Observation types ---------

temp observation with:
     * standard unit: Celsius
     * data column as Temperatuur in Celsius
     * known units and aliases: {'Celsius': ['celsius', '°C', '°c', 'celcius', 'Celcius'], 'Kelvin': ['K', 'kelvin'], 'Farenheit': ['farenheit']}
     * description: 2mT passive
     * conversions to known units: {'Kelvin': ['x - 273.15'], 'Farenheit': ['x-32.0', 'x/1.8']}
     * originates from data column: Temperatuur with Celsius as native unit.
humidity observation with:
     * standard unit: %
     * data column as Vochtigheid in %
     * known units and aliases: {'%': ['percent', 'percentage']}
     * description: 2m relative humidity passive
     * conversions to known units: {}
     * originates from data column: Vochtigheid with % as native unit.
wind_speed observation with:
     * standard unit: m/s
     * data column as Windsnelheid in km/h
     * known units and aliases: {'m/s': ['meters/second', 'm/sec'], 'km/h': ['kilometers/hour', 'kph'], 'mph': ['miles/hour']}
     * description: Average 2m  10-min windspeed
     * conversions to known units: {'km/h': ['x / 3.6'], 'mph': ['x * 0.44704']}
     * originates from data column: Windsnelheid with km/h as native unit.
wind_direction observation with:
     * standard unit: ° from north (CW)
     * data column as Windrichting in ° from north (CW)
     * known units and aliases: {'° from north (CW)': ['°', 'degrees']}
     * description: Average 2m  10-min windspeed
     * conversions to known units: {}
     * originates from data column: Windrichting with ° from north (CW) as native unit.

 --------  Settings ---------

(to show all settings use the .show_settings() method, or set show_all_settings = True)

 --------  Outliers ---------

No outliers.

 --------  Meta data ---------

The following metadata is found: ['lat', 'lon', 'school', 'geometry', 'assumed_import_frequency', 'dataset_resolution']

 The first rows of the metadf looks like:
                 lat       lon        school                  geometry  \
name
vlinder01  50.980438  3.815763         UGent  POINT (3.81576 50.98044)
vlinder02  51.022379  3.709695         UGent  POINT (3.70969 51.02238)
vlinder03  51.324583  4.952109   Heilig Graf  POINT (4.95211 51.32458)
vlinder04  51.335522  4.934732   Heilig Graf  POINT (4.93473 51.33552)
vlinder05  51.052655  3.675183  Sint-Barbara  POINT (3.67518 51.05266)

          assumed_import_frequency dataset_resolution
name
vlinder01          0 days 00:05:00    0 days 00:05:00
vlinder02          0 days 00:05:00    0 days 00:05:00
vlinder03          0 days 00:05:00    0 days 00:05:00
vlinder04          0 days 00:05:00    0 days 00:05:00
vlinder05          0 days 00:05:00    0 days 00:05:00
-------- Missing observations info --------
(Note: missing observations are defined on the frequency estimation of the native dataset.)
  * 3 missing observations

 name
vlinder02   2022-09-10 17:10:00+00:00
vlinder02   2022-09-10 17:15:00+00:00
vlinder02   2022-09-10 17:45:00+00:00
Name: datetime, dtype: datetime64[ns, UTC]

  * For these stations: ['vlinder02']
  * The missing observations are not filled.
(More details on the missing observation can be found in the .series and .fill_df attributes.)
None

 --------  Gaps ---------

There are no gaps.
None

If you want to inspect the data in your Dataset directly, you can take a look at the .df and .metadf attributes

[9]:
print(your_dataset.df.head())
# equivalent for the metadata
print(your_dataset.metadf.head())

                                     temp  humidity  wind_speed  \
name      datetime
vlinder01 2022-09-01 00:00:00+00:00  18.8        65    1.555556
          2022-09-01 00:05:00+00:00  18.8        65    1.527778
          2022-09-01 00:10:00+00:00  18.8        65    1.416667
          2022-09-01 00:15:00+00:00  18.7        65    1.666667
          2022-09-01 00:20:00+00:00  18.7        65    1.388889

                                     wind_direction
name      datetime
vlinder01 2022-09-01 00:00:00+00:00              65
          2022-09-01 00:05:00+00:00              75
          2022-09-01 00:10:00+00:00              75
          2022-09-01 00:15:00+00:00              85
          2022-09-01 00:20:00+00:00              65
                 lat       lon        school                  geometry  \
name
vlinder01  50.980438  3.815763         UGent  POINT (3.81576 50.98044)
vlinder02  51.022379  3.709695         UGent  POINT (3.70969 51.02238)
vlinder03  51.324583  4.952109   Heilig Graf  POINT (4.95211 51.32458)
vlinder04  51.335522  4.934732   Heilig Graf  POINT (4.93473 51.33552)
vlinder05  51.052655  3.675183  Sint-Barbara  POINT (3.67518 51.05266)

          assumed_import_frequency dataset_resolution
name
vlinder01          0 days 00:05:00    0 days 00:05:00
vlinder02          0 days 00:05:00    0 days 00:05:00
vlinder03          0 days 00:05:00    0 days 00:05:00
vlinder04          0 days 00:05:00    0 days 00:05:00
vlinder05          0 days 00:05:00    0 days 00:05:00

Inspecting a Station#

If you are interested in one station, you can extract all the info for that one station from the dataset by:

[10]:
favorite_station = your_dataset.get_station(stationname="vlinder02")

Favorite station now contains all the information of that one station. All methods that are applicable to a Dataset are also applicable to a Station. So to inspect your favorite station, you can:

[11]:
print(favorite_station.show())
--------  General ---------

Station instance containing:
     *1 stations
     *['temp', 'humidity', 'wind_speed', 'wind_direction'] observation types
     *4317 observation records
     *0 records labeled as outliers
     *0 gaps
     *3 missing observations
     *records range: 2022-09-01 00:00:00+00:00 --> 2022-09-15 23:55:00+00:00 (total duration:  14 days 23:55:00)
     *time zone of the records: UTC
     *Coordinates are available for all stations.

 --------  Observation types ---------

temp observation with:
     * standard unit: Celsius
     * data column as Temperatuur in Celsius
     * known units and aliases: {'Celsius': ['celsius', '°C', '°c', 'celcius', 'Celcius'], 'Kelvin': ['K', 'kelvin'], 'Farenheit': ['farenheit']}
     * description: 2mT passive
     * conversions to known units: {'Kelvin': ['x - 273.15'], 'Farenheit': ['x-32.0', 'x/1.8']}
     * originates from data column: Temperatuur with Celsius as native unit.
humidity observation with:
     * standard unit: %
     * data column as Vochtigheid in %
     * known units and aliases: {'%': ['percent', 'percentage']}
     * description: 2m relative humidity passive
     * conversions to known units: {}
     * originates from data column: Vochtigheid with % as native unit.
wind_speed observation with:
     * standard unit: m/s
     * data column as Windsnelheid in km/h
     * known units and aliases: {'m/s': ['meters/second', 'm/sec'], 'km/h': ['kilometers/hour', 'kph'], 'mph': ['miles/hour']}
     * description: Average 2m  10-min windspeed
     * conversions to known units: {'km/h': ['x / 3.6'], 'mph': ['x * 0.44704']}
     * originates from data column: Windsnelheid with km/h as native unit.
wind_direction observation with:
     * standard unit: ° from north (CW)
     * data column as Windrichting in ° from north (CW)
     * known units and aliases: {'° from north (CW)': ['°', 'degrees']}
     * description: Average 2m  10-min windspeed
     * conversions to known units: {}
     * originates from data column: Windrichting with ° from north (CW) as native unit.

 --------  Settings ---------

(to show all settings use the .show_settings() method, or set show_all_settings = True)

 --------  Outliers ---------

No outliers.

 --------  Meta data ---------

The following metadata is found: ['lat', 'lon', 'school', 'geometry', 'assumed_import_frequency', 'dataset_resolution']

 The first rows of the metadf looks like:
                 lat       lon school                    geometry  \
name
vlinder02  51.022379  3.709695  UGent  POINT (3.709695 51.022379)

          assumed_import_frequency dataset_resolution
name
vlinder02          0 days 00:05:00    0 days 00:05:00
-------- Missing observations info --------
(Note: missing observations are defined on the frequency estimation of the native dataset.)
  * 3 missing observations

 name
vlinder02   2022-09-10 17:10:00+00:00
vlinder02   2022-09-10 17:15:00+00:00
vlinder02   2022-09-10 17:45:00+00:00
Name: datetime, dtype: datetime64[ns, UTC]

  * For these stations: ['vlinder02']
  * The missing observations are not filled.
(More details on the missing observation can be found in the .series and .fill_df attributes.)
None

 --------  Gaps ---------

There are no gaps.
None
None

Making timeseries plots#

To make timeseries plots, use the following syntax to plot the temperature observations of the full Dataset:

[12]:
your_dataset.make_plot(obstype='temp')
[12]:
<Axes: title={'center': 'Temperatuur for all stations. '}, ylabel='temp (Celsius)'>
../_images/examples_doc_example_24_2.png

See the documentation of the make_plot() method for more details. Here an example of common used arguments.

[13]:
#Import the standard datetime library to make timestamps from datetime objects
from datetime import datetime

your_dataset.make_plot(
    # specify the names of the stations in a list, or use None to plot all of them.
    stationnames=['vlinder01', 'vlinder03', 'vlinder05'],
    # what obstype to plot (default is 'temp')
    obstype="humidity",
    # choose how to color the timeseries:
    #'name' : a specific color per station
    #'label': a specific color per quality control label
    colorby="label",
    # choose a start and endtime for the series (datetime).
    # Default is None, which uses all available data
    starttime=None,
    endtime=datetime(2022, 9, 9),
    # Specify a title if you do not want the default title
    title='your custom title',
    # Add legend to plot?, by default true
    legend=True,
    # Plot observations that are labeled as outliers.
    show_outliers=True,
)
[13]:
<Axes: title={'center': 'your custom title'}, xlabel='datetime', ylabel='humidity (%)'>
../_images/examples_doc_example_26_2.png

as mentioned above, one can apply the same methods to a Station object:

[14]:
favorite_station.make_plot(colorby='label')
[14]:
<Axes: title={'center': 'Temperatuur of vlinder02'}, xlabel='datetime', ylabel='temp (Celsius)'>
../_images/examples_doc_example_28_2.png

Resampling the time resolution#

Coarsening the time resolution (i.g. frequency) of your data can be done by using the coarsen_time_resolution().

[16]:
your_dataset.coarsen_time_resolution(freq='30min') #'30min' means 30 minutes

your_dataset.df.head()
[16]:
temp humidity wind_speed wind_direction
name datetime
vlinder01 2022-09-01 00:00:00+00:00 18.8 65 1.555556 65
2022-09-01 00:30:00+00:00 18.7 65 1.500000 85
2022-09-01 01:00:00+00:00 18.4 65 1.416667 55
2022-09-01 01:30:00+00:00 18.0 65 1.972222 55
2022-09-01 02:00:00+00:00 17.1 68 1.583333 45

Introduction exercise#

For a more detailed reference, you can use this introduction exercise, that was created in the context of the COST FAIRNESS summerschool 2023 in Ghent.