ibis

"How much do we sell..? Let's take the total sum!"
- pandas
- ibis
"Ah, they wanted it by country..."
- pandas
- ibis
"And I guess I should deduct the discount."
- pandas
- ibis
"Oh, and Maria asked me to remove any outliers."
- pandas
- ibis
"I probably should use the median within each country"
- pandas
- ibis

Code

from IPython.core.interactiveshell import InteractiveShell

# `ast_node_interactivity` is a setting that determines how the return value of the last line in a cell is displayed
# with `last_expr_or_assign`, the return value of the last expression is displayed unless it is assigned to a variable
InteractiveShell.ast_node_interactivity = "last_expr_or_assign"

There's an excellent blog post on why Pandas feels clunky for those coming from R.

However in Python, I've found ibis as an alternative to pandas to be a much more natural fit for those coming from R.

ibis uses duckdb as a backend by default, and its API is a mix between duckdb and dplyr.

import ibis

_ in ibis is a special variable that refers to the last expression evaluated this is useful for chaining operations or for using the result of the last expression in subsequent operations

from ibis import _

By default, ibis defers execution until you call execute(). Using ibis.options.interactive = True will make it so that expressions are immediately executed when displayed. This is useful for interactive exploration.

ibis.options.interactive = True

Let's also import pandas to compare the two libraries.

import pandas as pd

Here's the equivalent code in pandas and ibis for the example provided in the blog post:

pandas_df = pd.read_csv("purchases.csv")
pandas_df.head()

	country	amount	discount
0	USA	2000	10
1	USA	3500	15
2	USA	3000	20
3	Canada	120	12
4	Canada	180	18

df = ibis.read_csv("purchases.csv")
df.head()

┏━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┓
┃ country ┃ amount ┃ discount ┃
┡━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━┩
│ string  │ int64  │ int64    │
├─────────┼────────┼──────────┤
│ USA     │   2000 │       10 │
│ USA     │   3500 │       15 │
│ USA     │   3000 │       20 │
│ Canada  │    120 │       12 │
│ Canada  │    180 │       18 │
└─────────┴────────┴──────────┘

"How much do we sell..? Let's take the total sum!"

pandas

pandas_df.amount.sum()

np.int64(17210)

df.amount.sum().execute()

"Ah, they wanted it by country..."

pandas

(
    pandas_df
    .groupby("country")
    .agg(total=("amount", "sum"))
    .reset_index()
)

	country	total
0	Australia	600
1	Brazil	460
2	Canada	3400
3	France	500
4	Germany	570
5	India	720
6	Italy	630
7	Japan	690
8	Spain	660
9	UK	480
10	USA	8500

ibis

(
    df
    .group_by("country")
    .aggregate(total=_.amount.sum())
    .order_by("country") # optional, to align with the pandas output
)

┏━━━━━━━━━━━┳━━━━━━━┓
┃ country   ┃ total ┃
┡━━━━━━━━━━━╇━━━━━━━┩
│ string    │ int64 │
├───────────┼───────┤
│ Australia │   600 │
│ Brazil    │   460 │
│ Canada    │  3400 │
│ France    │   500 │
│ Germany   │   570 │
│ India     │   720 │
│ Italy     │   630 │
│ Japan     │   690 │
│ Spain     │   660 │
│ UK        │   480 │
│ …         │     … │
└───────────┴───────┘

Calling .execute() will run the query and return the result as a pandas DataFrame.

type(
  (
    df
    .group_by("country")
    .aggregate(total=_.amount.sum())
    .order_by("country") # optional, to align with the pandas output
  ).execute()
)

pandas.core.frame.DataFrame

"And I guess I should deduct the discount."

pandas

(
    pandas_df
    .groupby("country")[["amount", "discount"]]
    .apply(lambda df: (df["amount"] - df["discount"]).sum())
    .reset_index()
    .rename(columns={0: "total"})
)

	country	total
0	Australia	540
1	Brazil	414
2	Canada	3349
3	France	450
4	Germany	513
5	India	648
6	Italy	567
7	Japan	621
8	Spain	594
9	UK	432
10	USA	8455

ibis

(
    df
    .group_by("country")
    .aggregate(total=(_.amount - _.discount).sum())
    .order_by("country")
)

┏━━━━━━━━━━━┳━━━━━━━┓
┃ country   ┃ total ┃
┡━━━━━━━━━━━╇━━━━━━━┩
│ string    │ int64 │
├───────────┼───────┤
│ Australia │   540 │
│ Brazil    │   414 │
│ Canada    │  3349 │
│ France    │   450 │
│ Germany   │   513 │
│ India     │   648 │
│ Italy     │   567 │
│ Japan     │   621 │
│ Spain     │   594 │
│ UK        │   432 │
│ …         │     … │
└───────────┴───────┘

"Oh, and Maria asked me to remove any outliers."

pandas

(
    pandas_df
    .query("amount <= amount.median() * 10")
    .groupby("country")[["amount", "discount"]]
    .apply(lambda df: (df["amount"] - df["discount"]).sum())
    .reset_index()
    .rename(columns={0: "total"})
)

	country	total
0	Australia	540
1	Brazil	414
2	Canada	270
3	France	450
4	Germany	513
5	India	648
6	Italy	567
7	Japan	621
8	Spain	594
9	UK	432
10	USA	1990

ibis

(
    df
    .mutate(median=_.amount.median())
    .filter(_.amount <= _.median * 10)
    .group_by("country")
    .aggregate(total=(_.amount - _.discount).sum())
    .order_by("country")
)

┏━━━━━━━━━━━┳━━━━━━━┓
┃ country   ┃ total ┃
┡━━━━━━━━━━━╇━━━━━━━┩
│ string    │ int64 │
├───────────┼───────┤
│ Australia │   540 │
│ Brazil    │   414 │
│ Canada    │   270 │
│ France    │   450 │
│ Germany   │   513 │
│ India     │   648 │
│ Italy     │   567 │
│ Japan     │   621 │
│ Spain     │   594 │
│ UK        │   432 │
│ …         │     … │
└───────────┴───────┘

"I probably should use the median within each country"

pandas

(
    pandas_df
    .assign(country_median=lambda df:
        df.groupby("country")["amount"].transform("median")
    )
    .query("amount <= country_median * 10")
    .groupby("country")[["amount", "discount"]]
    .apply(lambda df: (df["amount"] - df["discount"]).sum())
    .reset_index()
    .rename(columns={0: "total"})
)

	country	total
0	Australia	540
1	Brazil	414
2	Canada	270
3	France	450
4	Germany	513
5	India	648
6	Italy	567
7	Japan	621
8	Spain	594
9	UK	432
10	USA	8455

ibis

For this last example, we have to resort to calculating the median after a group by operation over each country and then join it back to the original DataFrame to replace the outliers. This is similar to the pandas approach.

df.group_by("country").aggregate(median=_.amount.median())

┏━━━━━━━━━━━┳━━━━━━━━━┓
┃ country   ┃ median  ┃
┡━━━━━━━━━━━╇━━━━━━━━━┩
│ string    │ float64 │
├───────────┼─────────┤
│ USA       │  3000.0 │
│ Germany   │   200.0 │
│ India     │   250.0 │
│ Spain     │   230.0 │
│ Japan     │   240.0 │
│ Italy     │   220.0 │
│ Brazil    │   230.0 │
│ Australia │   210.0 │
│ UK        │   160.0 │
│ Canada    │   180.0 │
│ …         │       … │
└───────────┴─────────┘

Here's how you can do it in a single expression in ibis:

(
    df
    .join(
        df.group_by("country").aggregate(median=_.amount.median()),
        predicates=["country"],
    )
    .filter(_.amount <= _.median * 10)
    .group_by("country")
    .aggregate(total=(_.amount - _.discount).sum())
    .order_by("country")
)

┏━━━━━━━━━━━┳━━━━━━━┓
┃ country   ┃ total ┃
┡━━━━━━━━━━━╇━━━━━━━┩
│ string    │ int64 │
├───────────┼───────┤
│ Australia │   540 │
│ Brazil    │   414 │
│ Canada    │   270 │
│ France    │   450 │
│ Germany   │   513 │
│ India     │   648 │
│ Italy     │   567 │
│ Japan     │   621 │
│ Spain     │   594 │
│ UK        │   432 │
│ …         │     … │
└───────────┴───────┘

ibis - a better pandas

Table of contents

"How much do we sell..? Let's take the total sum!"

pandas

ibis

"Ah, they wanted it by country..."

pandas

ibis

"And I guess I should deduct the discount."

pandas

ibis

"Oh, and Maria asked me to remove any outliers."

pandas

ibis

"I probably should use the median within each country"

pandas

ibis

Citation

ibis - a better pandas

Table of contents

"How much do we sell..? Let's take the total sum!"#

pandas#

ibis#

"Ah, they wanted it by country..."#

pandas#

ibis#

"And I guess I should deduct the discount."#

pandas#

ibis#

"Oh, and Maria asked me to remove any outliers."#

pandas#

ibis#

"I probably should use the median within each country"#

pandas#

ibis#

Citation

"How much do we sell..? Let's take the total sum!"

pandas

ibis

"Ah, they wanted it by country..."

pandas

ibis

"And I guess I should deduct the discount."

pandas

ibis

"Oh, and Maria asked me to remove any outliers."

pandas

ibis

"I probably should use the median within each country"

pandas

ibis