ResidentialEnergyPrediction

Residential Energy Prediction

Contents:

Overview

We are visualizing and predicting the usage of energy across the US. Our resource used was from the Residential Energy Consumption Survey(RECS).

Using Machine learning tools and techniques, we aim to predict,

  1. Total Consumption of Energy in residential homes
  2. Total Cost in dollar for the energy consumed

RECS data for 2001, 2009, 2015 were used to train and test our ML Models. Data extracted from US eia site ( https://www.eia.gov/consumption/residential/index.php) was cleaned and merged into one CSV file.

With a combination of PCA and KMeans to study any obvious clustering or patterns. This then fed into determing X (predictors) for the models.

Feature Engineering : Using Feature Engineering tools such as Feature Selector and PCA we determined the amount of features necessary for prediction. Post determining important features, an array of ML modeling techniques were used for prediction.

Below flow chart explains the model development process Prediction Process

Data Transformation and Merging

The RECS survey data consisted of 500+ variables collected across housing characteristics, appliances used, fuel types, annual consumption and cost of consumption.

violin plots

PCA and KMeans clustering

Principle Component Analysis (PCA) is a dimension-reduction tool. Plugging in the data from the combined set through PCA an elbow curve is created to show how many features can be used to predict model accuracy.

PCA Elbow curve showing features that explain most of the variance (above 95%) PCA Elbow curve

An example of PCA correlation - variance plot - PCA correlation plt

KMeans clustering

After exporting the PCA components, we reimport the new CSV to use for KMeans Clustering. KMeans can be used to determine clusters of data and make decisions based off the clusters which values you can exclude to increase accuracy.

KMeans Clusters

Feature Selector

Post Data Merging and Transformation, we had 185 features that were put through feature engineering to optimise dimensions.

Feature Selector used five methods used to identify features to remove:

Feature Importance Feature Selector cummulative gain

Linear Models

All or Most supervised learning starts with Linear Models. Linear Models provide a varied set of modeling techniques like Ridge, Lasso etc.,

To predict RECS price or consumption, we utilize these linear models along with GridSearchCV to fine tune each of the models

Process we followed, Linear Modeling Process

Linear model results, Accuracy and Error results

Linear model residual plot, Accuracy and Error results

RandomForest Regressor

Random Forest method is used to classify the data into classes and predict the mean (regression) of the forest trees to predict Total Dollar

Process we followed, RF Modeling Process

RandomForest model results, Accuracy and Error results

RandomForest model residual plot, Accuracy and Error results

XGBoost Regressor

….XGBoost (“Extreme Gradient Boosting”) is one of the Ensemble Algorithms used as regressor or classifier. With all the previous models yielding results of around 81% r2 value, thus clearly indicating presence of weak predictors. As XGBoost is known for its ability to create a strong model with weak predictors, we decided to use this for predicting total dollar and consumption.

Process we followed, XGB Modeling Process

XGBoost model results, Accuracy and Error results

XGBoost model regression error plot, Accuracy and Error results

Model Comparison

…Using SKLearn.Cross_validate function, we peformed a comparison of the 7 models, both base and tuned version.

Model Performances are depicted below, Model Performances

Results of RMSE and r2, Models results

How to Run