Using irregular timestamp datasets#
Some datasets have irregular time frequencies of the observations. These datasets come with some extra challenges. Here is some information on how to deal with them.
A common problem that can arise is that most observations are not present and that a lot of missing observations (and gaps) are introduced. This is because the toolkit assumes that each station has observations at a constant frequency. So the toolkit expects perfectly regular timestamp series. The toolkit will hence ignore observations that are not on the frequency, so observations get lost. Also, it looks for observations on perfectly regular time intervals, so when a timestamp is not present, it is assumed to be missing.
To avoid these problems you can synchronize your observations. Synchronizing will convert your irregular dataset to a regular dataset and an easy origin is chosen if possible. (The origin is the first timestamp of your dataset.) Converting your dataset to a regular dataset is performed by shifting the timestamp of an observation. For example, if a frequency of 5 minutes is assumed and the observation has a timestamp at 54 minutes and 47 seconds, the timestamp is shifted to 55 minutes. A certain maximal threshold needs to be set to avoid observations being shifted too much. This threshold is called the tolerance and it indicates what the maximal time-translation error can be for one observation timestamp.
Synchronizing your observations can be performed with he sync_observations()
method. As an argument of this function you must provide a tolerance.
Example#
Let’s take an example dataset with Netatmo(*) data. These data are known for having irregular timestamps. On average the time resolution is 5 minutes. In the data file, we can see that there are 4320 observational records. However, when we import it into the toolkit, only 87 observational records remain:
(*) Netatmo is a commercial company that sells automatic weather stations for personal use.
#code illustration
#initialize dataset
your_dataset = metobs_toolkit.Dataset()
#specify paths
dataset.update_settings(
input_data_file=' .. path to netatmo data ..',
data_template_file=' .. template file .. ',
)
#import the data
dataset.import_data_from_file()
print(dataset)
Dataset instance containing:
*1 stations
*['temp', 'humidity'] observation types
*87 observation records
*0 records labeled as outliers
*85 gaps
*0 missing observations
*records range: 2021-02-27 08:56:22+00:00 --> 2021-03-13 18:45:56+00:00 (total duration: 14 days 09:49:34)
The toolkit assumes a certain value for the frequency for each station. We can find this in the .metadf attribute:
print(dataset.metadf['dataset_resolution'])
name
netatmo_station 0 days 00:05:00
Name: dataset_resolution, dtype: timedelta64[ns]
We can synchronize the dataset using this code example:
# Code illustration
# Initialize dataset
your_dataset = metobs_toolkit.Dataset()
# Specify paths
dataset.update_settings(
input_data_file=' .. path to netatmo data ..',
data_template_file=' .. template file .. ',
)
# Import the data
dataset.import_data_from_file(**testdata[use_dataset]['kwargs'])
# Syncronize the data with a tolerance of 3 minutes
dataset.sync_observations(tolerance='3min')
print(dataset)
Dataset instance containing:
*1 stations
*['temp', 'humidity'] observation types
*4059 observation records
*938 records labeled as outliers
*0 gaps
*92 missing observations
*records range: 2021-02-27 08:55:00+00:00 --> 2021-03-13 18:45:00+00:00 (total duration: 14 days 09:50:00)
# Note: the frequency is not changed
print(dataset.metadf['dataset_resolution'])
name
netatmo_station 0 days 00:05:00
Name: dataset_resolution, dtype: timedelta64[ns]
The sync_observations()
method can also
be used to synchronize the time series of multiple stations. In that case, the method works by trying to find stations with similar
resolutions, finding an origin that works for all stations in this group, and creating a regular time series.