Pandas dataframe tricks

What is it?

No, we are not talking about the species of cute large asian mammals, we are talking about the (equally cute) Python library.

It’s a Python library with a DataFrame class that’s VERY useful for a lot of things. Everything you can do with a Spreadsheet can be done better with Panda’s DataFrame.

Import

import pandas as pd

You may also need to import numpy:

import numpy as np

Load CSV/TSV Files

df = pd.read_csv(filePath, sep=separatorChar)

Write the modified DataFrame

df.to_csv(filePath, sep=separatorChar, index=writeUselessNewIndexesOnFileAndMakeAHugeMess)

Create a new empty DataFrame and then fill it with new data

rows = []
for index in indexes:
    row = dict()
    row['indexName'] = index
    row['dataAboutIndex] = getDataFrom(index)
    # etc, etc
    rows.append(row)
df = pd.DataFrame(rows)
#and for columns with an specific order:
df = pd.DataFrame(rows, columns['indexName', 'dataAboutIndex'])

Utilities:

Some essential tricks for DataFrames

Create new columns

df['column'] = np.NaN
df['anotherColumn'] = ''

Iterate row by row

for index, row in df.iterrows():
    #do something

Assign values to specific cell

df.set_value(rowIndex, column, newValue)

Call a lambda for every row (faster than a for loop)

df.apply(lambda row: function(row), axis=1) #axis=1 for rows, no columns

Can be used to assign values to a whole column too:

df['column'] = df.apply(lambda row: function(row), axis=1)

Drop columns

df.drop('columnX', axis=1, inplace=True)

Drop rows which are not included in a set of indexes

df_sliced = df.take(list(indexes_to_keep))

Create a column with the mean value of other columns

df['median'] = df[np.array(list(columnsSet))].median(axis=1)
# or
df['median'] = df.median(axis=1)
# also workds for .mean() and .quantile(x)