I’m working on a master’s project involving multivariate time series analysis where I’m dealing with missing data in an hourly dataset. My goal is to perform data imputation using external variables to train the model.
Project Overview:
Data Type: Hourly multivariate time series.
Main Challenge: Imputing missing data.
Objective: To create a machine learning model that accurately fills these gaps using external variables.
Seeking Advice on:
Effective imputation methods for multivariate time series.
Experiences or case studies on similar datasets and how missing data was handled.
Recommended Python tools or libraries for this type of analysis.
Background:
I have a solid foundation in machine learning, but this is my first project dealing with time series imputation. I appreciate any guidance, suggestions, or resources you could share.
Could you elaborate what kind of external variables your planning to use to fill this missing data. Also while mentioning hourly multivariate time series data could you mention it is pertaining to what data for further response.
Hi, sure!
So, basically, my problem is related to failure to record data (battery problems etc) from sensors, theme:agriculture. I created a main dataset with the union of three tables: table1 ( with some missing values[soil moister and soil temperature] that I want to correct with imput data) and I have table2 (irrigation) and table3( open meteo data) as exogenous input variables.
The missing values are a few frames on the timeline (for 2023, per hour)
I don’t really see this as a machine learning problem.
Once you have data, you create a model, then you use the model to create predictions.
This only fits the classic machine learning concept if you change the problem so you use everything you have as inputs, and then use the field with the missing data as the output you’re trying to predict.
have you tried to check the magnitude of your missing data, to know if it would affect your other variables, these all will depend on the amount of whole/main dataset you have.
say the missing data is 10s and your main dataset is 100000, you could probably ignore, if not you could remove all the tables related to table 2 and table 3 related to table 1 where missing values are there means as per timelines for 2023(but this again will depend on the dataset.
Also I wanted to know are you not planning to analysis seasonality as agriculture data would surely relate to seasonality!!???
I found few time series analysis using python, sharing link, probably might help you