Analysis of CO² Emissions on Passenger Cars at the E.U. Contries, Year 2015

The Regulation (EC) No 510/2011 of the European Union requires that all member states report annualy info about the registrations of new cars including data like Manufacturer, Commercial Name, CO² emissions, Weight, Fuel Type and others. This is a analysis of the data collected on 2015, available at http://www.eea.europa.eu/data-and-maps/data/co2-cars-emission-11.

Overview and treating of the data

The data is a .csv table file, with 440.646 rows. There are several columns, but this analysis will focus on just a few of them.

Columns

Field Name, Field Definition:

  • MS: Member state

  • Mh: Manufacturer harmonised

  • Cn: Commercial name varchar(120) No

  • r: Total new registrations

  • m (kg): Mass integer

  • e (g/km): Specific CO2 Emissions integer

  • Ft: Fuel type varchar(120)

Python modules to be used and common functions

The following scripts will use mainly: pandas, numpy, matplotlib and bokeh

Run the code on the two following cells even if you are only going to display the data:

To display the already processed data, you only need to run the output cells after the processing ones.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from bokeh.plotting import figure
from bokeh.layouts import row
from bokeh.plotting import ColumnDataSource
from bokeh.models import HoverTool
from bokeh.models import Span, Label
from bokeh.charts import output_notebook, show, Bar, output_file, BoxPlot
In [2]:
import warnings
from IPython.core.display import display, HTML

#disable annoying warnings
warnings.filterwarnings('ignore')

#alternative to output_notebook which loads html from file 
def displayHTML(file):
    with open(file, 'r') as myfile:
        data=myfile.read()
        display(HTML(data))

Reducing the original dataset and converting to csv:

Optional, only do this if the '-less' dataset does not exist yet

In [ ]:
def topX(dataFrame, topx, column, ascendingOrder=False):
    sorted = data.sort_values(column, ascending=ascendingOrder)
    nRows = data.shape[0]
    toRemain = nRows * topx
    toUse = []
    for i in range(nRows):
        if(i <= toRemain):
            toUse.append(True)
        else:
            toUse.append(False)
    return data[toUse]

data = pd.read_csv("../datasets/CO2_passenger_cars_v12.tsv", sep='\t', header=0)

nRows = data.shape[0] #get count of rows
print("Rows before: ")
print(nRows)

lessData = topX(data, 0.01, 'r')

print("Rows now: ")
print(lessData.shape[0])

lessData.to_csv("../datasets/CO2-passenger-cars-v12-less.csv")

Treating the data

(Optional, only do this if the '-treated' datasets do not exist yet)

First, setting the data type for the columns:

In [ ]:
#data = pd.read_csv("../datasets/CO2_passenger_cars_v12.tsv", sep='\t', header=0)#treating the original data
data = pd.read_csv("../datasets/CO2-passenger-cars-v12-less.csv", header=0)#treating the -less data

data = pd.concat([data[col].astype(str).str.upper() for col in data.columns], axis=1)
data['id'] = data['id'].astype(int)
data['r'] = data['r'].astype(int)
data['e (g/km)'] = data['e (g/km)'].astype(float)
data['m (kg)'] = data['m (kg)'].astype(float)
data['w (mm)'] = data['w (mm)'].astype(float)
data['at1 (mm)'] = data['at1 (mm)'].astype(float)
data['at2 (mm)'] = data['at2 (mm)'].astype(float)
data['ec (cm3)'] = data['ec (cm3)'].astype(float)
data['z (Wh/km)'] = data['z (Wh/km)'].astype(float)
data['Er (g/km)'] = data['Er (g/km)'].astype(float)
data['ep (KW)'] = data['ep (KW)'].astype(float)
data.rename(columns={'e (g/km)': 'e', 'm (kg)': 'm'}, inplace=True)

Now, its time to remove some useless columns:

In [ ]:
data = data.drop('MP', 1);
data = data.drop('MMS', 1);
data = data.drop('T', 1);
data = data.drop('w (mm)', 1);
data = data.drop('at1 (mm)', 1);
data = data.drop('at2 (mm)', 1);
data = data.drop('TAN', 1);

Finally, I add a column whose value is the amount of CO² emissions per kg of the car:

In [ ]:
emission = data['e']
kg = data['m']
ePerKG = emission / kg
data['ePerKG(e/m)'] = ePerKG

#data.to_csv("../datasets/CO2-passenger-cars-v12-treated.csv")
data.to_csv("../datasets/CO2-passenger-cars-v12-treated-less.csv")

Select dataset

Processing the full data provided by the E.U. can be very (VERY) slow, requering a powerfull machine. If this notebook is not being run on such a machine, selec the '-less' dataset:

In [ ]:
#datasetPath = "../datasets/CO2-passenger-cars-v12-treated-less.csv"
datasetPath = "../datasets/CO2-passenger-cars-v12-treated.csv"

euCountriesPath = "../datasets/european-union-countries.csv"

Analysis 1: The biggest and the lowest emitters

The most important column is 'e (g/km)'. In fact, the emission rate is the main reason for the EU to collect all of this data every year. If we looked at the cars with the highest and lowest emissions, what would we see?

In [ ]:
data = pd.read_csv(datasetPath)

f = open('../results/lowestEmitters.txt', 'w')
f.write("\nThe lowest CO2 emitters: \n")
count = 0
for label, row in data.sort_values('e', ascending=True).iterrows():
    f.write(row['Mk'] + " " + row['Cn'] + ": " + str(row['e']) + '\n')
    if(count == 12):
        break
    else:
        count = count +1
f.close()
f = open('../results/biggestEmitters.txt', 'w')
f.write("The biggest CO2 emitters: \n")
count = 0
for label, row in data.sort_values('e', ascending=False).iterrows():
    f.write(row['Mk'] + " " + row['Cn'] + ": " + str(row['e'])+'\n')
    if(count == 12):
        break
    else:
        count = count +1
f.close()
In [5]:
f = open('../results/lowestEmitters.txt', 'r')
print(f.read())
f = open('../results/biggestEmitters.txt', 'r')
print(f.read())
The lowest CO2 emitters: 
NISSAN NISSAN LEAF: 0.0
RENAULT ZOE: 0.0
RENAULT ZOE: 0.0
RENAULT ZOE: 0.0
RENAULT ZOE: 0.0
TESLA MOTORS MODEL S: 0.0
MERCEDES-BENZ ELECTRIC DRIVE: 0.0
MERCEDES-BENZ ELECTRIC DRIVE: 0.0
MERCEDES-BENZ ELECTRIC DRIVE: 0.0
RENAULT ZOE: 0.0
TESLA MOTORS MODEL S: 0.0
MERCEDES-BENZ B 250 E: 0.0
RENAULT ZOE: 0.0

The biggest CO2 emitters: 
BUGATTI BUGATTIGRANDSPORTVITESSE : 559.0
NAN KW9/21/26/30/33/34/36: 549.0
NAN K-YACHT85: 548.0
BUGATTI UNKNOWN: 539.0
BUGATTI UNKNOWN: 539.0
BUGATTI BUGATTIGR.SPORTVITESSE: 539.0
BUGATTI UNKNOWN: 539.0
  BUGATTIGRANDSPORT: 539.0
FERRARI 612: 475.0
NAN F12: 453.0
LAMBORGHINI GALLARDO: 450.0
NAN FF: 444.0
MASERATI MASERATI QUATTROPORTE: 443.0

Here we have two very different classes of vehicles.

  • At the rank of lowest emitters, there are many cars with zero emission rates of CO². There are vehicles that make use of alternative fuel types, like electricty. Some of them are:

Tesla Model S: Car pic 1

Renault Zoe: Car pic 1

  • And at the rank of the highest emitters of CO² are the super expensive sportive cars. In order to reach their high speeds, these machines emitte impressive amounts of CO². But not all of them are cars, the K-Yacht 85 is a mobile home. Some others are:

BUGATTI GRAND SPORT VITESSE: Car pic 2

Ferrari 612: Car pic 2

Some of the cars listed are repeated or do not have a manufacturer name. This shows some of the inconsistency on the data.

Analysis 2: Fuel Types in E.U. - Use and Impact

Another important factor available in the data is the Fuel Type ("Ft") for each car registered.

But what is the popularity of each one of them?

And how big is their impact on the environment?

In [ ]:
data = pd.read_csv(datasetPath, header=0)

#create set with the existant fuel types
fuelTypes = set([])
fuelTypeColumn = data['Ft']
for i in fuelTypeColumn.values:
    fuelTypes.add(i)

#search for data on each fuel type
fuelTypeArray = []
emissionArray = []
fuelTypesDataframes = dict([])
fuelTypeRegs = pd.Series()
for f in fuelTypes:
    fuelTypesDataframes[f] = data[data.Ft == f]
    fuelTypeRegs[f] = 0;
    for label, row in fuelTypesDataframes[f].iterrows():
        rgs = row['r']
        fuelTypeRegs[f] = fuelTypeRegs[f] + rgs
In [ ]:
#Now, lets create the plot
sum = 0
for x in fuelTypeRegs:
    sum = sum + x

fuelTypeRegs = (fuelTypeRegs / sum)*100

ft = dict([])
ft['Usage %'] = []
ft['Fuel Type'] = []
ft['Fuel Type (detail)'] = []
for key,value in fuelTypeRegs.items():
    ft['Usage %'].append(value)
    ft['Fuel Type (detail)'].append(key)
    if(value < 10):
        ft['Fuel Type'].append('Others')
    else:
        ft['Fuel Type'].append(key)

p = Bar(ft, values='Usage %', label='Fuel Type', stack='Fuel Type (detail)', legend='top_center')
p.plot_height=500
p.plot_width=600
output_file("../results/bars_fueltypes.html", title="Use of different fuel types")
show(p)
In [3]:
displayHTML('../results/bars_fueltypes.html')
Use of different fuel types

The graph show us 3 bars: Diesel, Petrol and Others.

Petrol and diesel sum up most of the cars, with a slight bigger number of diesel cars having been registered in 2015.

The other types of fuel sum less than 5% of the total. They are diverse, but definetly not popular. Each one of them is a different aproach to the goal reducing the CO² emissions of cars in the world. This leads us to the next graph:

In [ ]:
box = BoxPlot(data, values='e', label='Ft',
              color='Ft', plot_width=900, legend=False)
output_file('../results/box.html')
save(box)
In [4]:
displayHTML('../results/box.html')
Bokeh Plot

The boxplot shows the distribuitions of emission rates per fuel type. In the Y axis is the emission rate, in g of CO² per kilometter. Here we can se how much each of them can impact the environment.

Not surprisingly, the Eletric and Hydrogen motors emitte no CO² at all. Most of the alternative fuel sources tend to have smaller emissions of CO². Exceptions are the 85% Ethanol (E 85) and the LPG Gas. There are also some combinations of traditional fuel types like Petrol or Diesel with Eletric energy.

The boxplot also shows that most of the high emission rates are in the Petrol and Diesel cars. Diesel also has high emission rates, but they are slightly smaller than the Petrol ones. But these two fuel types also have a very high variation on emission rates. For example, there seens to be many cars moved by Petrol with emission rates below 50g/km.

Analysis 3: Most green-friendly manufacturers

Is there a difference between the cars produced by different manufacturers, when it comes to emission rates? The following code will make a scatter plot to allow us to see if some car manufacturers are actully more "green-friendly" than others.

In [ ]:
data = pd.read_csv(datasetPath, header=0)
manufact = set([])
for i in data['Mh'].values:
    manufact.add(i)

manufactArray = []
registers = []
totalEmission = []
averageE = []
carsUnder95 = []
carsUnder95Percent = []
for m in manufact:
    detectedNaN = False
    mData = data[data.Mh == m]
    manufactArray.append(m)
    regs = 0
    em = 0
    c95 = 0
    for label, row in mData.iterrows():
        r = row['r']
        if(math.isnan(r)):
            r = 0
        regs = regs + r
        e = row['e']
        if(math.isnan(e)):
            e = 0
        em = em + (e * r)
        if(e <= 95):
            c95 = c95 + r
    if(math.isnan(regs)):
        regs = 0
    registers.append(regs)
    if(math.isnan(em)):
        em = 0
    totalEmission.append(em)
    if(math.isnan(c95)):
        c95 = 0
    carsUnder95.append(c95)
    if(math.isnan(em / regs) == False):
        averageE.append(em/regs)
    else:
        averageE.append(0)
    if(math.isnan((c95/regs)*100) == False):
        carsUnder95Percent.append((c95/regs)*100)
    else:
        carsUnder95Percent.append(0)
In [ ]:
#after gathering the data, its time to build a dataframe and make a plot with him:
manufactFrame = pd.DataFrame()
manufactFrame = manufactFrame.append(pd.DataFrame({'Mh' : manufactArray}))
manufactFrame['r'] = np.nan
manufactFrame['e'] = np.nan
manufactFrame['averageE'] = np.nan
manufactFrame['carsUnder95'] = np.nan
manufactFrame['carsUnder95Percent'] = np.nan
manufactFrame['circleSize'] = np.nan
manufactFrame['circleColor'] = ''

for i in range(len(registers)):
    manufactFrame.set_value(i, 'r', registers[i])
    manufactFrame.set_value(i, 'e', totalEmission[i])
    manufactFrame.set_value(i, 'carsUnder95', carsUnder95[i])
    manufactFrame.set_value(i, 'averageE', averageE[i])
    manufactFrame.set_value(i, 'carsUnder95Percent', carsUnder95Percent[i])
    manufactFrame.set_value(i, 'circleSize', (carsUnder95Percent[i]/2)+25)
    r = str("%0.2X" % int((averageE[i]/1000)*255))
    g = str("%0.2X" % int((carsUnder95Percent[i]/100)*255))
    b = str("%0.2X" % 30)
    manufactFrame.set_value(i, 'circleColor', "#"+r+g+b)
s = ColumnDataSource(manufactFrame)
p = figure(x_axis_label='Registers', y_axis_label='Average Emission (g/km)', title="Emission on Manufacturers")
p.circle('r', 'averageE', size='circleSize', source=s, alpha=0.6, fill_color='circleColor')

tips=[('Name','@Mh'),
     ('Cars under 95g/km','@carsUnder95')]

hline = Span(location=95, dimension='width', line_color='green', line_width=3, line_dash='dashed')
p.renderers.extend([hline])
p.add_layout(Label(x=40000, y=95, text='95 g/km target'))
hover = HoverTool(tooltips=tips)
p.add_tools(hover)
output_file("../results/emission_manufact.html")
save(p)
In [6]:
displayHTML('../results/emission_manufact.html')
Bokeh Plot

"By 2021, phased in from 2020, the fleet average to be achieved by all new cars is 95 grams of CO2 per kilometre." (https://ec.europa.eu/clima/policies/transport/vehicles/cars_en)

To reduce the emission of CO², the European Union has set a target of 95 grams per km to be achieved until 2021. In the plot, we set a line for us to se how close the manufacturers are to achieve this goal.

Each ball is a manufacturer, the more cars under 95g, the bigger and greener is the ball.

The plot shows that there are many different manufacturers of cars in Europe. Most of them do not register many cars in the year. Most of the manufacturers with many cars registered, companies like Renault or Fiat, are getting close to the target of 95g.

A few manufacturers have not only reached the target, but even have average emission rates close to 0. These are manufacturers who produce Eletric and Hydrogen cars. Some of them are Tesla and Bluecar.

Analysis 4: Emissions per country

With a little more data from other sources, there is much more that can be done:

If we knew the average distance traveled by a car during a year in Europe, we could calcute the annual emission of the cars registered in 2015, with the following function:

  • AnnualEmission(Car) = Car['registers'] x Car['gOfCO²perKm'] x DistancePerYear

Next, we only have to sum the annual emissions of each type of car by country.

The average travelled distance in km in E.U. can be found on: http://odyssee.enerdata.net/database/ -> Transport -> Kilometers -> Cars

It is 12284.03 km per year. This way, we can know the emissions of CO² in each country.

In [ ]:
kmPerYear=12284.03
data = pd.read_csv(datasetPath, header=0)

countriesDF = pd.read_csv(euCountriesPath)
countriesDF['totalEmission'] = ''
countriesDF['circleSize'] = ''
countriesDF['circleColor'] = ''
emissions = []
circleSize = []

for label, row in countriesDF.iterrows():
    print("Processing ", row['id'])
    id = row['id']
    countryDF = data[data.MS == id]
    emission = 0
    for label, row2 in countryDF.iterrows():
        e = row2['e']
        r = row2['r']
        if(math.isnan(r)):
            r = 0
        if(math.isnan(e)):
            e = 0
        emission = emission + (r * e)*kmPerYear
    emissions.append(emission)
    circleSize.append(32 + np.random.randint(low=0, high=10))

for i in range(len(emissions)):
    countriesDF.set_value(i, 'totalEmission', emissions[i]/1000/1000)
    countriesDF.set_value(i, 'circleSize', circleSize[i])
    colorFactor = np.random.randint(low=1, high=230)
    r = str("%0.2X" %(np.random.randint(low=30, high=240)))
    g = str("%0.2X" %(np.random.randint(low=30, high=240)))
    b = str("%0.2X" %(np.random.randint(low=0, high=100)))
    countriesDF.set_value(i, 'circleColor', "#"+r+g+b)

print("Making bokeh plot:")
countriesSource = ColumnDataSource(countriesDF)
p = figure(x_axis_label='Emission of CO2 (ton) in 2015', y_axis_label='Population (millions)', title='Emissions per Country', x_axis_type="log", x_range=[9000, 5013000], y_axis_type="log")
p.circle('totalEmission', 'millions', source=countriesSource, size='circleSize', alpha=0.8, color='circleColor')
p.text('totalEmission', 'millions', text='id', source=countriesSource, text_baseline="middle", text_align="center")
p.add_tools(HoverTool(tooltips=[('Name','@COUNTRYNAME'), ('Pop.', '@millions'), ('CO2', '@totalEmission')]))
p.plot_height=450
p.plot_width=600
#p.xaxis[0].formatter.use_scientific = False

output_file('../results/eu-emission.html')
save(p)
In [7]:
displayHTML('../results/eu-emission.html')
Bokeh Plot

With the population on the y axis and the total emissions on the (logarithmical) x axis, it is clear that the emission of CO² is proportional to the population.

  • Germany, France and Italy (the most populous countries in E.U.), have the highest emissions of CO². Meanwhile, the little countries of Malta and Cyprus have the smallest emissions.

But there are some points out of the curve: There are many people in Bulgaria, but they dont emitte much CO², and in Luxembourg (a very small country) people seen to like having many cars.

In [ ]: