Chapter 3 Data transformation
Data cleaning and processing was done separately in Python using pandas. The preprocessing scripts can be found here.
The main dataset used in this project was BP’s Statistical Review of World Energy. This data is a large Excel file containing 73 separate sheets of data on emissions, energy generation, energy consumption, and supplemental energy information. The energy data were split up by energy sources into different sheets. A large part of the preprocessing transformations involved selecting and extracting necessary information from specific sheets and combining the data into a single dataframe.
Furthermore, the BP dataset does not contain any geometric geographical information about countries. Thus, data from the Thematic Mapping World Borders Dataset was joined with the main dataframe to provide country metadata as well as polygon data for generating map plots. A major issue during this process was that country names were not identical between the BP and the World Borders dataset. For instance, the United States was listed as “US” in the BP dataset, but as “United States” in the World Borders dataset. Thus, data cleaning was conducted to match up country names and minimize data loss during the join operation.