Pandas dataframe tricks
10 Jan 2019What is it?
No, we are not talking about the species of cute large asian mammals, we are talking about the (equally cute) Python library.
It’s a Python library with a DataFrame class that’s VERY useful for a lot of things. Everything you can do with a Spreadsheet can be done better with Panda’s DataFrame.
Import
import pandas as pd
You may also need to import numpy:
import numpy as np
Load CSV/TSV Files
df = pd.read_csv(filePath, sep=separatorChar)
Write the modified DataFrame
df.to_csv(filePath, sep=separatorChar, index=writeUselessNewIndexesOnFileAndMakeAHugeMess)
Create a new empty DataFrame and then fill it with new data
rows = []
for index in indexes:
row = dict()
row['indexName'] = index
row['dataAboutIndex] = getDataFrom(index)
# etc, etc
rows.append(row)
df = pd.DataFrame(rows)
#and for columns with an specific order:
df = pd.DataFrame(rows, columns['indexName', 'dataAboutIndex'])
Utilities:
Some essential tricks for DataFrames
Create new columns
df['column'] = np.NaN
df['anotherColumn'] = ''
Iterate row by row
for index, row in df.iterrows():
#do something
Assign values to specific cell
df.set_value(rowIndex, column, newValue)
Call a lambda for every row (faster than a for loop)
df.apply(lambda row: function(row), axis=1) #axis=1 for rows, no columns
Can be used to assign values to a whole column too:
df['column'] = df.apply(lambda row: function(row), axis=1)
Drop columns
df.drop('columnX', axis=1, inplace=True)
Drop rows which are not included in a set of indexes
df_sliced = df.take(list(indexes_to_keep))
Create a column with the mean value of other columns
df['median'] = df[np.array(list(columnsSet))].median(axis=1)
# or
df['median'] = df.median(axis=1)
# also workds for .mean() and .quantile(x)