Demo example: Using a Dataset#
This is an introduction to get started with the MetObs toolkit. These examples are making use of the demo data files that comes with the toolkit. Once the MetObs toolkit package is installed, you can import its functionality by:
[1]:
import metobs_toolkit
The Dataset#
A dataset is a collection of all observational data. Most of the methods are applied directly to a dataset. Start by creating an empty Dataset object:
[2]:
your_dataset = metobs_toolkit.Dataset()
The most relevant attributes of a Dataset are: * .df –> a pandas DataFrame where all the observational data are stored * .metadf –> a pandas DataFrame where all the metadata for each station are stored * .settings –> a Settings object to store all specific settings. * .missing_obs and .gaps –> here the missing records and gaps are stored if present.
Note that each Dataset will be equipped with the default settings.
We created a dataset and stored in under the variable ‘your_dataset’. The show method prints out an overview of data in the dataset:
[3]:
your_dataset.show() # or .get_info()
-------- General ---------
Empty instance of a Dataset.
-------- Observation types ---------
temp observation with:
* standard unit: Celsius
* data column as None in None
* known units and aliases: {'Celsius': ['celsius', '°C', '°c', 'celcius', 'Celcius'], 'Kelvin': ['K', 'kelvin'], 'Farenheit': ['farenheit']}
* description: 2m - temperature
* conversions to known units: {'Kelvin': ['x - 273.15'], 'Farenheit': ['x-32.0', 'x/1.8']}
* originates from data column: None with None as native unit.
humidity observation with:
* standard unit: %
* data column as None in None
* known units and aliases: {'%': ['percent', 'percentage']}
* description: 2m - relative humidity
* conversions to known units: {}
* originates from data column: None with None as native unit.
radiation_temp observation with:
* standard unit: Celsius
* data column as None in None
* known units and aliases: {'Celsius': ['celsius', '°C', '°c', 'celcius', 'Celcius'], 'Kelvin': ['K', 'kelvin'], 'Farenheit': ['farenheit']}
* description: 2m - Black globe
* conversions to known units: {'Kelvin': ['x - 273.15'], 'Farenheit': ['x-32.0', 'x/1.8']}
* originates from data column: None with None as native unit.
pressure observation with:
* standard unit: pa
* data column as None in None
* known units and aliases: {'pa': ['Pascal', 'pascal', 'Pa'], 'hpa': ['hecto pascal', 'hPa'], 'psi': ['Psi'], 'bar': ['Bar']}
* description: atmospheric pressure (at station)
* conversions to known units: {'hpa': ['x * 100'], 'psi': ['x * 6894.7573'], 'bar': ['x * 100000.']}
* originates from data column: None with None as native unit.
pressure_at_sea_level observation with:
* standard unit: pa
* data column as None in None
* known units and aliases: {'pa': ['Pascal', 'pascal', 'Pa'], 'hpa': ['hecto pascal', 'hPa'], 'psi': ['Psi'], 'bar': ['Bar']}
* description: atmospheric pressure (at sea level)
* conversions to known units: {'hpa': ['x * 100'], 'psi': ['x * 6894.7573'], 'bar': ['x * 100000.']}
* originates from data column: None with None as native unit.
precip observation with:
* standard unit: mm/m²
* data column as None in None
* known units and aliases: {'mm/m²': ['mm', 'liter', 'liters', 'l/m²', 'milimeter']}
* description: precipitation intensity
* conversions to known units: {}
* originates from data column: None with None as native unit.
precip_sum observation with:
* standard unit: mm/m²
* data column as None in None
* known units and aliases: {'mm/m²': ['mm', 'liter', 'liters', 'l/m²', 'milimeter']}
* description: Cummulated precipitation
* conversions to known units: {}
* originates from data column: None with None as native unit.
wind_speed observation with:
* standard unit: m/s
* data column as None in None
* known units and aliases: {'m/s': ['meters/second', 'm/sec'], 'km/h': ['kilometers/hour', 'kph'], 'mph': ['miles/hour']}
* description: wind speed
* conversions to known units: {'km/h': ['x / 3.6'], 'mph': ['x * 0.44704']}
* originates from data column: None with None as native unit.
wind_gust observation with:
* standard unit: m/s
* data column as None in None
* known units and aliases: {'m/s': ['meters/second', 'm/sec'], 'km/h': ['kilometers/hour', 'kph'], 'mph': ['miles/hour']}
* description: wind gust
* conversions to known units: {'km/h': ['x / 3.6'], 'mph': ['x * 0.44704']}
* originates from data column: None with None as native unit.
wind_direction observation with:
* standard unit: ° from north (CW)
* data column as None in None
* known units and aliases: {'° from north (CW)': ['°', 'degrees']}
* description: wind direction
* conversions to known units: {}
* originates from data column: None with None as native unit.
-------- Settings ---------
(to show all settings use the .show_settings() method, or set show_all_settings = True)
-------- Outliers ---------
No outliers.
-------- Meta data ---------
No metadata is found.
TIP: to get an extensive overview of an object, call the .show() method on it.
Importing data#
To import your data into a Dataset, the following files are required:
data file: This is the CSV file containing the observations
(optional) metadata file: The CSV file containing metadata for all stations.
template file: This is a (json) file that is used to interpret your data, and metadata file (if present).
In practice, you need to start by creating a template file for your data. More information on the creation of the template can be found in the documentation (under Mapping to the toolkit).
TIP: Use the ``build_template_prompt()`` of the toolkit for creating a template file.
[4]:
# metobs_toolkit.build_template_prompt()
To import data, you must specify the paths to your data, metadata and template. For this example, we use the demo data, metadata and template that come with the toolkit.
[5]:
your_dataset.update_settings(
input_data_file=metobs_toolkit.demo_datafile, # path to the data file
input_metadata_file=metobs_toolkit.demo_metadatafile,
template_file=metobs_toolkit.demo_template,
)
The settings of your Dataset are updated with the required paths. Now the data can be imported into your empty Dataset:
[6]:
your_dataset.import_data_from_file()
Inspecting the Template#
In practice, you need to start by creating a template file for your data. The role of the template is to translate your data and metadata file, to a standard structure used by the toolkit. It is thus the explanation of how your raw data is structured. More information on the creation of the template can be found in the documentation (under Mapping to the toolkit).
TIP: Use the ``build_template_prompt()`` of the toolkit for creating a template file.
As an illustration, you can use the show()
method on the template attribute to print out an overview of the template:
[7]:
your_dataset.template.show()
------ Data obstypes map ---------
* temp <---> Temperatuur
(raw data in Celsius)
(description: 2mT passive)
* humidity <---> Vochtigheid
(raw data in %)
(description: 2m relative humidity passive)
* wind_speed <---> Windsnelheid
(raw data in km/h)
(description: Average 2m 10-min windspeed)
* wind_direction <---> Windrichting
(raw data in ° from north (CW))
(description: Average 2m 10-min windspeed)
------ Data extra mapping info ---------
* name column (data) <---> Vlinder
------ Data timestamp map ---------
* datetimecolumn <---> None
* time_column <---> Tijd (UTC)
* date_column <---> Datum
* fmt <---> %Y-%m-%d %H:%M:%S
* Timezone <---> None
------ Metadata map ---------
* name <---> Vlinder
* lat <---> lat
* lon <---> lon
* school <---> school
Inspecting the Data#
To get an overview of the data stored in your Dataset you can use
[8]:
your_dataset.show()
-------- General ---------
Dataset instance containing:
*28 stations
*['temp', 'humidity', 'wind_speed', 'wind_direction'] observation types
*120957 observation records
*0 records labeled as outliers
*0 gaps
*3 missing observations
*records range: 2022-09-01 00:00:00+00:00 --> 2022-09-15 23:55:00+00:00 (total duration: 14 days 23:55:00)
*time zone of the records: UTC
*Coordinates are available for all stations.
-------- Observation types ---------
temp observation with:
* standard unit: Celsius
* data column as Temperatuur in Celsius
* known units and aliases: {'Celsius': ['celsius', '°C', '°c', 'celcius', 'Celcius'], 'Kelvin': ['K', 'kelvin'], 'Farenheit': ['farenheit']}
* description: 2mT passive
* conversions to known units: {'Kelvin': ['x - 273.15'], 'Farenheit': ['x-32.0', 'x/1.8']}
* originates from data column: Temperatuur with Celsius as native unit.
humidity observation with:
* standard unit: %
* data column as Vochtigheid in %
* known units and aliases: {'%': ['percent', 'percentage']}
* description: 2m relative humidity passive
* conversions to known units: {}
* originates from data column: Vochtigheid with % as native unit.
wind_speed observation with:
* standard unit: m/s
* data column as Windsnelheid in km/h
* known units and aliases: {'m/s': ['meters/second', 'm/sec'], 'km/h': ['kilometers/hour', 'kph'], 'mph': ['miles/hour']}
* description: Average 2m 10-min windspeed
* conversions to known units: {'km/h': ['x / 3.6'], 'mph': ['x * 0.44704']}
* originates from data column: Windsnelheid with km/h as native unit.
wind_direction observation with:
* standard unit: ° from north (CW)
* data column as Windrichting in ° from north (CW)
* known units and aliases: {'° from north (CW)': ['°', 'degrees']}
* description: Average 2m 10-min windspeed
* conversions to known units: {}
* originates from data column: Windrichting with ° from north (CW) as native unit.
-------- Settings ---------
(to show all settings use the .show_settings() method, or set show_all_settings = True)
-------- Outliers ---------
No outliers.
-------- Meta data ---------
The following metadata is found: ['lat', 'lon', 'school', 'geometry', 'assumed_import_frequency', 'dataset_resolution']
The first rows of the metadf looks like:
lat lon school geometry \
name
vlinder01 50.980438 3.815763 UGent POINT (3.81576 50.98044)
vlinder02 51.022379 3.709695 UGent POINT (3.70969 51.02238)
vlinder03 51.324583 4.952109 Heilig Graf POINT (4.95211 51.32458)
vlinder04 51.335522 4.934732 Heilig Graf POINT (4.93473 51.33552)
vlinder05 51.052655 3.675183 Sint-Barbara POINT (3.67518 51.05266)
assumed_import_frequency dataset_resolution
name
vlinder01 0 days 00:05:00 0 days 00:05:00
vlinder02 0 days 00:05:00 0 days 00:05:00
vlinder03 0 days 00:05:00 0 days 00:05:00
vlinder04 0 days 00:05:00 0 days 00:05:00
vlinder05 0 days 00:05:00 0 days 00:05:00
-------- Missing observations info --------
(Note: missing observations are defined on the frequency estimation of the native dataset.)
* 3 missing observations
name
vlinder02 2022-09-10 17:10:00+00:00
vlinder02 2022-09-10 17:15:00+00:00
vlinder02 2022-09-10 17:45:00+00:00
Name: datetime, dtype: datetime64[ns, UTC]
* For these stations: ['vlinder02']
* The missing observations are not filled.
(More details on the missing observation can be found in the .series and .fill_df attributes.)
None
-------- Gaps ---------
There are no gaps.
None
If you want to inspect the data in your Dataset directly, you can take a look at the .df and .metadf attributes
[9]:
print(your_dataset.df.head())
# equivalent for the metadata
print(your_dataset.metadf.head())
temp humidity wind_speed \
name datetime
vlinder01 2022-09-01 00:00:00+00:00 18.8 65 1.555556
2022-09-01 00:05:00+00:00 18.8 65 1.527778
2022-09-01 00:10:00+00:00 18.8 65 1.416667
2022-09-01 00:15:00+00:00 18.7 65 1.666667
2022-09-01 00:20:00+00:00 18.7 65 1.388889
wind_direction
name datetime
vlinder01 2022-09-01 00:00:00+00:00 65
2022-09-01 00:05:00+00:00 75
2022-09-01 00:10:00+00:00 75
2022-09-01 00:15:00+00:00 85
2022-09-01 00:20:00+00:00 65
lat lon school geometry \
name
vlinder01 50.980438 3.815763 UGent POINT (3.81576 50.98044)
vlinder02 51.022379 3.709695 UGent POINT (3.70969 51.02238)
vlinder03 51.324583 4.952109 Heilig Graf POINT (4.95211 51.32458)
vlinder04 51.335522 4.934732 Heilig Graf POINT (4.93473 51.33552)
vlinder05 51.052655 3.675183 Sint-Barbara POINT (3.67518 51.05266)
assumed_import_frequency dataset_resolution
name
vlinder01 0 days 00:05:00 0 days 00:05:00
vlinder02 0 days 00:05:00 0 days 00:05:00
vlinder03 0 days 00:05:00 0 days 00:05:00
vlinder04 0 days 00:05:00 0 days 00:05:00
vlinder05 0 days 00:05:00 0 days 00:05:00
Inspecting a Station#
If you are interested in one station, you can extract all the info for that one station from the dataset by:
[10]:
favorite_station = your_dataset.get_station(stationname="vlinder02")
Favorite station now contains all the information of that one station. All methods that are applicable to a Dataset are also applicable to a Station. So to inspect your favorite station, you can:
[11]:
print(favorite_station.show())
-------- General ---------
Station instance containing:
*1 stations
*['temp', 'humidity', 'wind_speed', 'wind_direction'] observation types
*4317 observation records
*0 records labeled as outliers
*0 gaps
*3 missing observations
*records range: 2022-09-01 00:00:00+00:00 --> 2022-09-15 23:55:00+00:00 (total duration: 14 days 23:55:00)
*time zone of the records: UTC
*Coordinates are available for all stations.
-------- Observation types ---------
temp observation with:
* standard unit: Celsius
* data column as Temperatuur in Celsius
* known units and aliases: {'Celsius': ['celsius', '°C', '°c', 'celcius', 'Celcius'], 'Kelvin': ['K', 'kelvin'], 'Farenheit': ['farenheit']}
* description: 2mT passive
* conversions to known units: {'Kelvin': ['x - 273.15'], 'Farenheit': ['x-32.0', 'x/1.8']}
* originates from data column: Temperatuur with Celsius as native unit.
humidity observation with:
* standard unit: %
* data column as Vochtigheid in %
* known units and aliases: {'%': ['percent', 'percentage']}
* description: 2m relative humidity passive
* conversions to known units: {}
* originates from data column: Vochtigheid with % as native unit.
wind_speed observation with:
* standard unit: m/s
* data column as Windsnelheid in km/h
* known units and aliases: {'m/s': ['meters/second', 'm/sec'], 'km/h': ['kilometers/hour', 'kph'], 'mph': ['miles/hour']}
* description: Average 2m 10-min windspeed
* conversions to known units: {'km/h': ['x / 3.6'], 'mph': ['x * 0.44704']}
* originates from data column: Windsnelheid with km/h as native unit.
wind_direction observation with:
* standard unit: ° from north (CW)
* data column as Windrichting in ° from north (CW)
* known units and aliases: {'° from north (CW)': ['°', 'degrees']}
* description: Average 2m 10-min windspeed
* conversions to known units: {}
* originates from data column: Windrichting with ° from north (CW) as native unit.
-------- Settings ---------
(to show all settings use the .show_settings() method, or set show_all_settings = True)
-------- Outliers ---------
No outliers.
-------- Meta data ---------
The following metadata is found: ['lat', 'lon', 'school', 'geometry', 'assumed_import_frequency', 'dataset_resolution']
The first rows of the metadf looks like:
lat lon school geometry \
name
vlinder02 51.022379 3.709695 UGent POINT (3.709695 51.022379)
assumed_import_frequency dataset_resolution
name
vlinder02 0 days 00:05:00 0 days 00:05:00
-------- Missing observations info --------
(Note: missing observations are defined on the frequency estimation of the native dataset.)
* 3 missing observations
name
vlinder02 2022-09-10 17:10:00+00:00
vlinder02 2022-09-10 17:15:00+00:00
vlinder02 2022-09-10 17:45:00+00:00
Name: datetime, dtype: datetime64[ns, UTC]
* For these stations: ['vlinder02']
* The missing observations are not filled.
(More details on the missing observation can be found in the .series and .fill_df attributes.)
None
-------- Gaps ---------
There are no gaps.
None
None
Making timeseries plots#
To make timeseries plots, use the following syntax to plot the temperature observations of the full Dataset:
[12]:
your_dataset.make_plot(obstype='temp')
[12]:
<Axes: title={'center': 'Temperatuur for all stations. '}, ylabel='temp (Celsius)'>
See the documentation of the make_plot()
method for more details. Here an example of common used arguments.
[13]:
#Import the standard datetime library to make timestamps from datetime objects
from datetime import datetime
your_dataset.make_plot(
# specify the names of the stations in a list, or use None to plot all of them.
stationnames=['vlinder01', 'vlinder03', 'vlinder05'],
# what obstype to plot (default is 'temp')
obstype="humidity",
# choose how to color the timeseries:
#'name' : a specific color per station
#'label': a specific color per quality control label
colorby="label",
# choose a start and endtime for the series (datetime).
# Default is None, which uses all available data
starttime=None,
endtime=datetime(2022, 9, 9),
# Specify a title if you do not want the default title
title='your custom title',
# Add legend to plot?, by default true
legend=True,
# Plot observations that are labeled as outliers.
show_outliers=True,
)
[13]:
<Axes: title={'center': 'your custom title'}, xlabel='datetime', ylabel='humidity (%)'>
as mentioned above, one can apply the same methods to a Station object:
[14]:
favorite_station.make_plot(colorby='label')
[14]:
<Axes: title={'center': 'Temperatuur of vlinder02'}, xlabel='datetime', ylabel='temp (Celsius)'>
Resampling the time resolution#
Coarsening the time resolution (i.g. frequency) of your data can be done by using the coarsen_time_resolution()
.
[16]:
your_dataset.coarsen_time_resolution(freq='30min') #'30min' means 30 minutes
your_dataset.df.head()
[16]:
temp | humidity | wind_speed | wind_direction | ||
---|---|---|---|---|---|
name | datetime | ||||
vlinder01 | 2022-09-01 00:00:00+00:00 | 18.8 | 65 | 1.555556 | 65 |
2022-09-01 00:30:00+00:00 | 18.7 | 65 | 1.500000 | 85 | |
2022-09-01 01:00:00+00:00 | 18.4 | 65 | 1.416667 | 55 | |
2022-09-01 01:30:00+00:00 | 18.0 | 65 | 1.972222 | 55 | |
2022-09-01 02:00:00+00:00 | 17.1 | 68 | 1.583333 | 45 |
Introduction exercise#
For a more detailed reference, you can use this introduction exercise, that was created in the context of the COST FAIRNESS summerschool 2023 in Ghent.