Pandas dataframe tricks

10 Jan 2019

What is it?

No, we are not talking about the species of cute large asian mammals, we are talking about the (equally cute) Python library.

It’s a Python library with a DataFrame class that’s VERY useful for a lot of things. Everything you can do with a Spreadsheet can be done better with Panda’s DataFrame.

Import

import pandas as pd

You may also need to import numpy:

import numpy as np

Load CSV/TSV Files

df = pd.read_csv(filePath, sep=separatorChar)

Write the modified DataFrame

df.to_csv(filePath, sep=separatorChar, index=writeUselessNewIndexesOnFileAndMakeAHugeMess)

Create a new empty DataFrame and then fill it with new data

rows = []
for index in indexes:
    row = dict()
    row['indexName'] = index
    row['dataAboutIndex] = getDataFrom(index)
    # etc, etc
    rows.append(row)
df = pd.DataFrame(rows)
#and for columns with an specific order:
df = pd.DataFrame(rows, columns['indexName', 'dataAboutIndex'])

Utilities:

Some essential tricks for DataFrames

Create new columns

df['column'] = np.NaN
df['anotherColumn'] = ''

Iterate row by row

for index, row in df.iterrows():
    #do something

Assign values to specific cell

df.set_value(rowIndex, column, newValue)

Call a lambda for every row (faster than a for loop)

df.apply(lambda row: function(row), axis=1) #axis=1 for rows, no columns

Can be used to assign values to a whole column too:

df['column'] = df.apply(lambda row: function(row), axis=1)

Drop columns

df.drop('columnX', axis=1, inplace=True)

Drop rows which are not included in a set of indexes

df_sliced = df.take(list(indexes_to_keep))

Create a column with the mean value of other columns

df['median'] = df[np.array(list(columnsSet))].median(axis=1)
# or
df['median'] = df.median(axis=1)
# also workds for .mean() and .quantile(x)

pentalpha blog