Table of contents
Methods
Pivot Table
The pivot table takes simple column-wise data as input, and groups the entries into a two-dimensional table that provides a multidimensional summarization of the data.
example:
pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True)[source]
dataDataFramevaluescolumn to aggregate, optional indexcolumn, Grouper, array, or list of the previous
If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values.
columnscolumn, Grouper, array, or list of the previous
]If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table column. If an array is passed, it is being used as the same manner as column values.
aggfuncfunction, list of functions, dict, default numpy.mean
If list of functions passed, the resulting pivot table will have hierarchical columns whose top level are the function names (inferred from the function objects themselves) If dict is passed, the key is column to aggregate and value is function or list of functions.
fill_valuescalar, default None
Value to replace missing values with (in the resulting pivot table, after aggregation).
marginsbool, default False
\Add all row / columns (e.g. for subtotal / grand totals).
dropnabool, default True
Do not include columns whose entries are all NaN. If True, rows with a NaN value in any column will be omitted before computing margins.
margins_namestr, default ‘All’
Name of the row / column that will contain the totals when margins is True.
observedbool, default False
This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
If values not passed then it will work on all numerical columns\
Vectorized String Operations
Very Useful in textual data
Vectors are a collection of data
Vectorized operation implies that the operation will be performed on every element of the array
Problems in Vanilla python is that it will not work very well with vectorized operations
Vectorized string functions for Series and Index.
NAs stay NA unless handled otherwise by a particular method. Patterned after Python’s string methods, with some inspiration from R’s stringr package.
Some important methods with string
lower/upper/capitalze/title
len
strip
# filtering # startswith/endswith df[df['firstname'].str.endswith('A')] # isdigit/isalpha... df[df['firstname'].str.isdigit()]
# replace df['title'] = df['title'].str.replace('Ms.','Miss.') df['title'] = df['title'].str.replace('Mlle.','Miss.')
Applying regex
# applying regex # contains # search john -> both case df[df['firstname'].str.contains('john',case=False)] # find lastnames with start and end char vowel df[df['lastname'].str.contains(' ^[^aeiouAEIOU].+[^aeiouAEIOU]$')]
Datetime Object
Time stamps reference particular moments in time (e.g., Oct 24th, 2022 at 7:00pm
type(pd.Timestamp('2023/1/5'))
Variations
# variations pd.Timestamp('2023-1-5') pd.Timestamp('2023, 1, 5') # only year pd.Timestamp('2023') # using text pd.Timestamp('5th January 2023') # using datetime.datetime object import datetime as dt x = pd.Timestamp(dt.datetime(2023,1,5,9,21,56)) x
x.year x.month x.day x.hour x.minute x.second
# why separate objects to handle data and time when python already has datetime functionality?
syntax wise datetime is very convenient
But the performance takes a hit while working with huge data. List vs Numpy Array
The weaknesses of Python's datetime format inspired the NumPy team to add a set of native time series data type to NumPy.
The datetime64 dtype encodes dates as 64-bit integers, and thus allows arrays of dates to be represented very compactly.
Because of the uniform type in NumPy datetime64 arrays, this type of operation can be accomplished much more quickly than if we were working directly with Python's datetime objects, especially as arrays get large
Pandas Timestamp object combines the ease-of-use of python datetime with the efficient storage and vectorized interface of numpy.datetime64
From a group of these Timestamp objects, Pandas can construct a DatetimeIndex that can be used to index data in a Series or DataFrame
A collection of pandas timestamp us date time object
# from strings type(pd.DatetimeIndex(['2023/1/1','2022/1/1','2021/1/1']))
dt_index = pd.DatetimeIndex([pd.Timestamp(2023,1,1),pd.Timestamp(2022,1,1),pd.Timestamp(2021,1,1)])