Asset Classes

Free investment financial education

Language

Multilingual content from IBKR

Close Navigation
Learn more about IBKR accounts
Dplython…dplyr for Python!

Dplython…dplyr for Python!

Posted July 26, 2023 at 10:32 am

Andrew Treadway
TheAutomatic.net

If you’re an avid R user, you probably use the famous dplyr package. Python has a package meant to be similar to dplyr, called dplython. This article will give an introduction for how to use dplython.

For the examples below, we’ll use a sample dataset that comes with R giving attributes about the US states, including population, area, and income levels. You can see the dataset by clicking here.

Initial setup

dplython can be installed using pip:.

pip install dplython

Once the package is installed, let’s load a few methods from it, and read in our dataset.

# load packages
from dplython import select, DplyFrame, X, arrange, count, sift, head, summarize, group_by, tail, mutate
import pandas as pd
 
# read in data
state_df = pd.read_csv("state_info.txt")

After we’ve read in our data, we need to convert it to an object called a DplyFrame, which we do using a method from dplython. This DplyFrame object will allow us to perform “dplyr-like” operations.

state_info = DplyFrame(state_df)

The Basics – How to select columns and get top rows

With dplython, selecting columns is done like below:

state_info >> select(X.State_Name, X.State_Region, X.Population)

Notice how we append X to each field name, which is a little different from the select command in dplyr. We also use >> instead of %>% as a pipe operator. If you’re not as familiar with dplyr from R, the pipe operator basically allows to you apply a function to the output of a function or object to the left the operator (the >> in dplython). In the case above, the select function is just being applied to state_info.

To select only the top 10 or bottom 10 rows, we can chain the head or tail methods, respectively, to our line of code above. Here the head and tail functions are being applied to the result to the left of each respective pipe operator.

# get top ten rows with selected columns
state_info >> select(X.State_Name, X.State_Region, X.Population) >> head(10)
 
# get bottom ten rows with selected columns
state_info >> select(X.State_Name, X.State_Region, X.Population) >> tail(10)

How to filter rows

Filtering can be done using the sift method. For example, if we want to get only the records where a state has an area less than 100,000 we would do this:

# filter initial data frame to states with areas less than 100,000 square miles
state_info >> sift(X.Area < 100000)
 
# filter rows and columns in one step
state_info >> sift(X.Area < 100000) >> 
              select(X.State_Name, X.State_Region, X.Population, X.Area)

Sorting datasets with dplython

To sort our DplyFrame by a column, we can use the arrange method, like in dplyr:

# sort DplyFrame in ascending order by Area
state_info >> arrange(X.Area)
 
# sort first by region, then by area
state_info >> arrange(X.State_Region, X.Area)

How to create new columns with dplython

Also similar to dplyr, we can use the mutate method to add fields to state_info:

with_extra_field = state_info >> mutate(Name_Region = X.State_Name + '-' + X.State_Region)
 
# create multiple fields in one step
more_fields = state_info >> mutate(Name_Region = X.State_Name + '-' + X.State_Region,
                                   Area_Per_Thousand_Sq_Mile = X.Area / 1000)

with_extra_field is now a DplyFrame containing the same fields as state_info, plus a field called “Name_Region”. more_fields contains two extra fields — one concatenating the state name and region, and one dividing the state area field by 1000.

Summarizing data

Aggregating data can be done using the summarize and group_by methods:

# get average area by region
state_info >> group_by(X.State_Region) >> summarize(average_region_area = X.Area.mean())
 
# or do multiple aggregations in one step
state_info >> group_by(X.State_Region) >> summarize(average_region_area = X.Area.mean(), \
                                                 >> max_region_area = X.Area.max), \
                                                 >> max_region_income = X.Income.max())

The first line of code above will show the average (mean) state area by region. The second line calculates multiple aggregations in one step, including mean and max state areas by region, as well as max state income by region.

Transposing a dataset

dplython can also transpose datasets using X.__T. (period followed by double underscore and second period).

# transpose state_info
state_info >> X._.T()

How to join tables together

dplython, like dplyr, has several “SQL-like” functions for joining datasets together:

from dplython import inner_join, left_join
 
 
right_table = more_fields >> select(X.State_Name, X.Name_Region)
 
# inner join
joined_result = inner_join(state_info, right_table, on = "State_Name")
 
# left_join
left_join_result = left_join(state_info, right_table, on = "State_Name")

In each type of join above, the first and second parameters are the left and right tables being joined. The “on” parameter specifies what column or list of columns should be used as the key to join the tables together.

Counting values in a column

To get the count of the values in a specific column, use the count function:

# get a count of each state region
state_info >> count(X.State_Region)

How to randomly sample rows with dplython

The sample_n function can be used to randomly sample rows from our dataset. You just need to pass whatever number of rows you want to randomly select (e.g. 10 in the below example).

from dplython import sample_n
 
# randomly select ten rows from state_info
state_info >> sample_n(10)

dplython can also sample a random proportion of a dataset. For example, if we want to randomly sample 70% of the rows, we can use the sample_frac function:

from dplython import sample_frac
 
# randomly sample 70% of the rows from state_info
state_info >> sample_frac(0.7)

The first line of code above will generate a DplyFrame with two columns: one showing the State Region, and the other showing the average area across each region. The second line contains this information, plus the max state area for each region, and the max state income for each region.

That’s it for this post! Please check out other Python posts of mine by clicking here.

Originally posted on TheAutomatic.net blog.

Join The Conversation

If you have a general question, it may already be covered in our FAQs. If you have an account-specific question or concern, please reach out to Client Services.

Leave a Reply

Disclosure: Interactive Brokers

Information posted on IBKR Campus that is provided by third-parties does NOT constitute a recommendation that you should contract for the services of that third party. Third-party participants who contribute to IBKR Campus are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.

This material is from TheAutomatic.net and is being posted with its permission. The views expressed in this material are solely those of the author and/or TheAutomatic.net and Interactive Brokers is not endorsing or recommending any investment or trading discussed in the material. This material is not and should not be construed as an offer to buy or sell any security. It should not be construed as research or investment advice or a recommendation to buy, sell or hold any security or commodity. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.

IBKR Campus Newsletters

This website uses cookies to collect usage information in order to offer a better browsing experience. By browsing this site or by clicking on the "ACCEPT COOKIES" button you accept our Cookie Policy.