Vectorized String Operations & DateTime

Vectorized String Operations & DateTime

Methods

  1. Pivot Table

    • The pivot table takes simple column-wise data as input, and groups the entries into a two-dimensional table that provides a multidimensional summarization of the data.

      example:

        pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True)[source]
      
  2. dataDataFramevaluescolumn to aggregate, optional indexcolumn, Grouper, array, or list of the previous

  3. If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values.

  4. columnscolumn, Grouper, array, or list of the previous

    ]If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table column. If an array is passed, it is being used as the same manner as column values.

  5. aggfuncfunction, list of functions, dict, default numpy.mean

  6. If list of functions passed, the resulting pivot table will have hierarchical columns whose top level are the function names (inferred from the function objects themselves) If dict is passed, the key is column to aggregate and value is function or list of functions.

    fill_valuescalar, default None

  7. Value to replace missing values with (in the resulting pivot table, after aggregation).

    marginsbool, default False

  8. \Add all row / columns (e.g. for subtotal / grand totals).

    dropnabool, default True

  9. Do not include columns whose entries are all NaN. If True, rows with a NaN value in any column will be omitted before computing margins.

    margins_namestr, default ‘All’

  10. Name of the row / column that will contain the totals when margins is True.

    observedbool, default False

  11. This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.

  12. If values not passed then it will work on all numerical columns\

Vectorized String Operations

  1. Very Useful in textual data

  2. Vectors are a collection of data

  3. Vectorized operation implies that the operation will be performed on every element of the array

  4. Problems in Vanilla python is that it will not work very well with vectorized operations

  5. Vectorized string functions for Series and Index.

    • NAs stay NA unless handled otherwise by a particular method. Patterned after Python’s string methods, with some inspiration from R’s stringr package.

    • Some important methods with string

      • lower/upper/capitalze/title

      • len

      • strip

          # filtering
          # startswith/endswith
          df[df['firstname'].str.endswith('A')]
          # isdigit/isalpha...
          df[df['firstname'].str.isdigit()]
        
          # replace
          df['title'] = df['title'].str.replace('Ms.','Miss.')
          df['title'] = df['title'].str.replace('Mlle.','Miss.')
        
    • Applying regex

        # applying regex
        # contains
        # search john -> both case
        df[df['firstname'].str.contains('john',case=False)]
        # find lastnames with start and end char vowel
        df[df['lastname'].str.contains('
        ^[^aeiouAEIOU].+[^aeiouAEIOU]$')]
      

Datetime Object

  1. Time stamps reference particular moments in time (e.g., Oct 24th, 2022 at 7:00pm

     type(pd.Timestamp('2023/1/5'))
    
  2. Variations

     # variations
     pd.Timestamp('2023-1-5')
     pd.Timestamp('2023, 1, 5')
     # only year
     pd.Timestamp('2023')
     # using text
     pd.Timestamp('5th January 2023')
     # using datetime.datetime object
     import datetime as dt
    
     x = pd.Timestamp(dt.datetime(2023,1,5,9,21,56))
     x
    
  3.  x.year
     x.month
     x.day
     x.hour
     x.minute
     x.second
    
  4. # why separate objects to handle data and time when python already has datetime functionality?

    • syntax wise datetime is very convenient

    • But the performance takes a hit while working with huge data. List vs Numpy Array

    • The weaknesses of Python's datetime format inspired the NumPy team to add a set of native time series data type to NumPy.

    • The datetime64 dtype encodes dates as 64-bit integers, and thus allows arrays of dates to be represented very compactly.

    • Because of the uniform type in NumPy datetime64 arrays, this type of operation can be accomplished much more quickly than if we were working directly with Python's datetime objects, especially as arrays get large

    • Pandas Timestamp object combines the ease-of-use of python datetime with the efficient storage and vectorized interface of numpy.datetime64

    • From a group of these Timestamp objects, Pandas can construct a DatetimeIndex that can be used to index data in a Series or DataFrame

  5. A collection of pandas timestamp us date time object

     # from strings
     type(pd.DatetimeIndex(['2023/1/1','2022/1/1','2021/1/1']))
    
     dt_index = pd.DatetimeIndex([pd.Timestamp(2023,1,1),pd.Timestamp(2022,1,1),pd.Timestamp(2021,1,1)])