Skip to content

wobniarin/Project-Week-8-Final-Project

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ironhack Logo

Stranger things on Smart Energy

Íngrid Munné Collado

Data Analytics June 2019

Content

1. Project Description

We are every day much more aware of the consequences of the climate change. One of the objective to tackle climate change is to control and lower our electricity consumption. Moreover, we are trying to get rid of fossil fuel and run only with renewable energy resources. For this reason, several initiatives have emerged.

Since 2009, there is a plan in Europe to install Smart Meters in each household and factory to meter their energy consumption in nearly real time. with time periods of 15 to 30 minutes. By doing this, the end-user can be much more aware of their energy consumption, but also the utilities can control fraud and provide personalized offers, tariffs and discounts to their portfolio.

Hence, this is the idea of this project. By using data coming from London Data Store and from more than 5500 households' smart meters we want to analyze their energy consumption and understand the performance of time series variables, such as the electricity consumption. By meand of this project we want to understand this kind of data, as well as being able to answer some questions related to it.

2. Hypotheses / Questions

The main questions we want to answer with this project are:

  • Can our electricity consumption be forecast? In this case, we would like to answer this question by predicting the electricity consumption of one household.

  • Can the electricity consumption explain something about ourselves as end-users? Can we be classified or clusterized in terms of our electricity consumption?

  • Has the weather any relationship with the electricity consumption? If so, which are the most related parameters?

All these questions are of interests when dealing with the integration of renewable energy sources and provide new services to end-users and electricity operators such as local energy and flexibility services.

3. Dataset

In order to answer the questions presented above, as well as to prove the hypotheses, data has been retrieved. The datasets used in this project have been taken from different sources such as Kaggle, London Data Store and Dark Sky API, covering the time period between 2011 and 2014. The dataset structure is defined below:

  • Smart Meters' measurements: This data comes from London Data Store. It has been downloaded directly from the London Data Store. This dataset can be in two different formats. It contains more than 167 million of rows and a total size of 11,5 GB. For this reason, the entire dataset has been split into blocks, so as to be able to read and analyse with a normal computer. Link to the dataset.

  • Weather historical data: This dataset has been extracted from Kaggle.com, on which the user Jean-Michel D. has complemented the Smart Meters Data with weather data in the same time period. This weather data has been retrieved from Dark Sky API and has been uploaded to Kaggle. Link to the dataset.

  • Accorn Classification: The end-users classification is considered for each end-user according to their smart meter's measurements. The data comes also from London Data Store, but the explanation can be found in the link below and also to the CACI Report, a company specialist in providing Integrated Marketing, Location Planning Consultancy, Network Services and Technology Solutions. Link to the CACI Report

The raw data downloaded has a total size of more than 11 GB. For this reason, it should be better to understand what this data tries to explain and how it is structured, in order to know which files will be used for the project and which not. In the folder 0.Data you will find a Codebook with different tables explaining each .csv file and the features included on each one.

    Datasets that will be initially used for the project
    - information_households.csv
    - weather_hourly_darksky.csv
    - hhblock_dataset.zip : Block_12 
    - halfhourly_dataset.zip: Block_12 

4. Cleaning

Describe your full process of data wrangling and cleaning. Document why you chose to fill missing values, extract outliers, or create the variables you did, etc, as well as your thinking process.

The cleaning process of the dataset has been entirely in one Jupyter Notebook named 0_Data_Cleaning.ypnb as it is described in the Organization section. It has been structured as follows:

  1. Dataset overview

    In this case, we are transforming object columns into int or float according to the feature we are measuring.

  2. Handling Missing Values

    When we are transforming object columns into datetime, we notice that there are missing values. In this case, since the timescale is 30', we are not taking into account missing values in timescales below 30' and we are dropping these values out of the dataset.

  3. DateTime columns transformation

    When dealing with weather information or Electricity consumption data, we are working with DateTime objects and time periods. However, sometimes when we import the csv, this column is not considered as a DateTime object. Hence, we should transform this data into DateTime. Furthermore, we should make sure that our DateTime column is sorted descendengly.

  4. Exporting tables into GoogleCloud

    A GoogleCloud has been created to export the data into a SQL Database in order to have it centralized and with different connections between elements.

This process has been done for each csv file used for the project.

5. Analysis

Once the data has been cleaned, and since this project deals with time-series, an important step before performing a ML model is to understand how the time-series looks like. This is also the idea of making user's dashboards where the end-user can see their energy consumption in different timeframes. When starting the project, the main hypotheses or assumptions that were taken into account were:

  1. There are different patterns in terms of energy consumption depending on the time-period

  2. There are differences in energy consumption throughout the year.

yearly_cons

We can see that in Summer months the consumption is lower than in other periods. However, with time-series can be a little bit difficult to notice and for this reason moving averages will be calculated to see the trend of the consumption.

  1. There are differences between users, but in general they follow the same patterns (high consumption during day and low consumption at night.)

In the figure below, the energy consumption of user with ID MAC004280 is shown, and the pattern can be clearly seen.

user_dashboard

  1. There are two types of tariff that end-users have contracted and also, most users will have the Standard Tariff instead of the ToU tariff.

user_tariff

6. Model Training and Evaluation

In this project two ML models have been developed. First of all, according to the first objective, a Time-Series Forecast model has been performed using SARIMA. Secondly, a Clustering on our clients' portfolio has been developed.

6.1. Time - Series Forecast

A Time-Series ML Model has been developed using Time-series decomposition and SARIMA Models, from sklearn library.

The initial step has been to analyse the autocorrelation, the seasonality and the stationarity of the data. To check the stationarity of the data, a Dickey-Fuller test has been performed. Most models in ML based on TimeSeries should ensure that the data is stationary to be able to be modeled by most algorithms in ML. Later on, the time-series signal has been decomposed into 3: Trend, seasonality and residuals. One of the most important challenges in TS is to decide the train and test, as well as the granularity of the signal to develop the model.

A SARIMA model has been used to model the data. In this case, the train dataset has been daily data (total consumption per day), for 5 months in 2013. The test set has been set up with 4 months of 2013, and the test set has been defined by two months in 2013 (May and June). As can be seen in the image below, the forecast is not that bad in the first month but then in the second month the forecast performs badly. For this reason, a further research should be done on time-series.

SARIMA_Forecast

6.2. Client Clustering

According to the second objective, a new tariff structure is being developed in this project. Since we are reformulating the tariff structure and we don't know how many tariffs we would create, we consider this problem as Unsupervised Learning and clustering. In this project two models have been developed: K-Means and DBSCAN.

Thanks to the Elbow methos and the Dendogram, we have decided two cases to study, with k=3 and k=6.

elbowmethod

The model with 3 clusters performs better than the model with 6 clusters, according to the plots. Also, the inertia for k=6 is lower that the inertia for k=7 but we don't improve the model by increasing the number of clusters. Also, we can conclude that DBSCAN does not perform well since we have clusters with different densities. With this in mind, we can conclude that the final model will be a K-Means with k=3 clusters.

mean_std_k3

We can see that there are such clusters in the model and that more tariffs could be created, according to the basic statistics of the users.

7. Conclusion

Once the project has been performed, some conclusions can be extracted. This model has allowed me to learn more about Machine Learning and time-series, and I have been able to create an end-to-end project based on data and energy, a topic that I trully enjoy.

The time-series model has been developed, by the overall accuracy and MSE states that the performance is not good right now. For this reason, further research should be done. However, I have been able to figure out the level of difficulty that a time-series data can have.

Dashboards can be a nice tool for end-users and also for utilities to increase their value as a company and provide better services to end-users, as well as having more information about them. Also, the energy price could be included and some statistics based on previous data.

Regarding the clustering, by means of this project we have defined a new tariff model based on consumption patterns, that can be the starting point to new business models around electric utilities. Furthermore, we could prove that unsupervised learning K-Means models can be implemented in our database and 3 is a good number for clustering our clients.

8. Future Work

We have not been able to answer some questions properly. We should improve the time-series forecast by learning more about the hyperparameters to tune in a SARIMA model. Furthermore, maybe a neural network model could be implemented to forecast the energy consumption.

Regarding the client clustering, the tariff structure we have defined by K-Means is a good way to start defining new and more personalized tariffs. However, it should be noticed that we are not considering here specifically the electricity consumption curve. Hence, a further work could be performed by considering the demand curve as well as the basic statistics.

Last but not least, related to the Dashboard, an entire dashboard could be developed using Plotly and Dash to create fully interactive plots.

9. Workflow

The workflow of the project can be split into three main parts: Data Acquisition and cleaning, Data Storing and Machine learning models, always based on the project's objective. In the figure below, a workflow diagram is shown to highlight the main steps of the project.

Workflow

10. Organization

Two main tools have been used to organize the project. Trello is a very useful tool to organize and manage all the tasks of the project. Github has been the second tool usd to mantain good practises when coding, being able to have a control of versions and upload all the project to the cloud. Last but not least, in order to understand the project, the Repository has been structured in folders as follows:

  1. Data:

    raw_data → Folder containing the raw data csv files coming from the resources mentioned in Section 3.

    cleaned_data → Folder containgin the cleaned csv files obtained from 0_Data_Cleaning.ypnb and that will be used in the 1_Data_Analysis_ML_Model.ypnb notebook.

    This folder is not uploaded into the repository since it has a size of more than 11 GB of data. In case that you want to download the data, access to GoogleCloud Database can be shared with you to retrieve the data, please send me an e-mail at [email protected] or by Github @wobniarin.

  2. Jupyter Notebooks:

    1_Data_Cleaning.ypnb → Data wrangling, cleaning and exporting to GoogleCloud SQL

    2_TimeSeries_Forecast.ypnb → SARIMA Model and Dickey Fuller test for different time periods.

    3_Dashboards_Clustering.ypnb → Dashboards development using Plotly and Clustering modelling by means of K-Means and DBSCAN.

  3. Resources:

    Folder with all the pdf's containing useful information related to the topic of study in this project, such as clients classification.

  4. Figures:

    Folder containing all the .png figures used for the presentation.

  5. README markdown file:

    File that you are currently reading, with all the basic information about the project that has been carried out.

  6. Codebook markdown file:

    Markdown file where a detailed explanation of all the databases used, including the meaning of each feature as well as the units of each feature measurement.

Links

Repository
Slides
Trello

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%