#46:

Outlier Detection with Boxplots, Python 3.13 Updates, and Stripe Integration for Django

Hi ,

Welcome to a brand new issue of PythonPro!

In today’sExpert Insight we bring you an excerpt from the recently published, Python Feature Engineering Cookbook - Third Edition, which discusses using boxplots and the inter-quartile range (IQR) proximity rule to visualize outliers in data distributions.

Related Titles

pythonpro-45-outlier-detection-with-boxplots-python-313-updates-and-stripe-integration-for-django-img-1

Covers numerous tools for mastering visualization including NumPy, Pandas, SQL, Matplotlib, and Seaborn
Includes an introductory chapter on Python 3 basics
Features companion files with numerous Python code samples and figures

Get the eBook for $54.99 $37.99!

pythonpro-45-outlier-detection-with-boxplots-python-313-updates-and-stripe-integration-for-django-img-2

Explores cutting-edge techniques using ChatGPT/GPT-4 in harmony with Python for generating visuals that tell more compelling data stories
Tackles actual data scenarios and builds your expertise as you apply learned concepts to real datasets

Get the eBook for $54.99 $37.99!

pythonpro-45-outlier-detection-with-boxplots-python-313-updates-and-stripe-integration-for-django-img-3

Covers Python-based data visualization libraries and techniques
Includes practical examples and Gemini-generated code samples for efficient learning
Integrates Google Gemini for advanced data visualization capabilities

Get the eBook for $51.99 $35.99!

News Highlights: Python 3.13.0rc2 released with new interpreter, free-threaded build, JIT, and incremental garbage collection; Python survey shows pip dominance, rising interest in Conda, Poetry, and uv; and PSF expands CNA role to cover Pallets Projects like Flask and Jinja.

Here are my top 5 picks from our learning resources today:

And, today’s Featured Study, explores how ChatGPT can automate and streamline Python-based federated learning algorithm development, reducing human effort and improving coding efficiency.

Stay awesome!

Divya Anne Selvaraj

Editor-in-Chief

P.S.: This month’ssurvey is live. Do take the opportunity to tell us what you think of PythonPro, request learning resources, and earn your one Packt Credit for this month.

Sign Up|Advertise

pythonpro-45-outlier-detection-with-boxplots-python-313-updates-and-stripe-integration-for-django-img-4

🐍 Python in the Tech 💻 Jungle 🌳

pythonpro-45-outlier-detection-with-boxplots-python-313-updates-and-stripe-integration-for-django-img-6

🗞️News

Python 3.13.0rc2 released: This version introduces several major features such as a new interactive interpreter, an experimental free-threaded build mode, preliminary JIT for performance, and incremental garbage collection.
Packaging Trends in Python: Highlights from the 2023 Developer Survey: Results show a strong preference for pip, with emerging interest in Conda and Poetry, and a new player, uv.
Python Software Foundation (PSF) Expands CNA Scope to Include Pallets Projects: The PSF has expanded its CVE Numbering Authority role to include Pallets Projects like Flask and Jinja, ensuring better vulnerability management.

💼Case Studies and Experiments🔬

Lessons learnt building a real-time audio application in Python: Key learnings covered include accepting inherent latency issues, leveraging modern operating systems' efficient memory management, and utilizing web browsers as effective interfaces for real-time applications.
Breaking Bell's Inequality with Monte Carlo Simulations in Python: Discusses the use of Monte Carlo simulations in Python to challenge Bell's inequality through a quantum mechanics game.

📊Analysis

Rust for the small things?... but what about Python?: Explores the enduring relevance of Python in data engineering, despite the allure of Rust for performance and safety.
Multiversion Python Thoughts: Delves into the complexities of implementing multi-version package imports in Python, motivated by the desire to handle incompatible library versions concurrently.

🎓Tutorials and Guides🤓

pythonpro-45-outlier-detection-with-boxplots-python-313-updates-and-stripe-integration-for-django-img-7

Python QuickStart for People Learning AI: Covers Python fundamentals, including data types, loops, and functions, and provides a concrete AI project example using the OpenAI API for summarizing research papers.
Lists vs Tuples in Python: Explores the characteristics, uses, and differences between lists and tuples in Python, emphasizing their ordered nature, content diversity, mutability, and appropriate usage scenarios.
Layman's Guide to Python Built-in Functions: Simplifies Python's built-in functions for beginners, providing plain English explanations and straightforward examples.
🎥Some tricks with UV: Demonstrates how UV not only facilitates quicker installations but also supports running Python scripts with on-the-fly dependency management.
Python 3 Module of the Week: A series of articles detailing diverse library functionalities ranging from text handling, data structures, and algorithms to more complex areas like cryptography and network communication.
Integrating Stripe Into A One-Product Django Python Shop: Part two of a series on creating a one-product shop using Django, htmx, and Stripe. Covers creating a Stripe account, defining a product, and configuring a webhook for transaction notifications.
Practical Introduction to Polars: Compares Polars' key functionalities with Pandas, offering practical examples to help users transition from Pandas to Polars for more efficient data analysis.

🔑Best Practices and Advice🔏

pythonpro-45-outlier-detection-with-boxplots-python-313-updates-and-stripe-integration-for-django-img-8

Understanding Python's __new__ Method Through a Magical Example: Introduces Python's lesser-known .__new__()method, used for creating instances before they're initialized with .__init__().
Some fun with Python Enum: Explores the Enum class introduced in Python 3.4, detailing its benefits over using literal types for type-safety and avoiding errors in code.
A comparison of hosts / providers for Pythonserverless functions (a.k.a. FaaS): Discusses various providers that support Python, their development experience (DevEx), pricing models, runtime limits, and other platform products.
Python HTTP Clients -Requests vs. HTTPX vs. AIOHTTP: Details each library's strengths and appropriate use cases, helping developers choose the right tool based on project needs.
Shades of testing HTTP requests in Python: Covers different techniques including mocking with AsyncMockand respx, parameterizing HTTP clients for flexible testing setups, and using integration tests with a Starlette server.

🔍Featured Study: Streamlining Federated Learning with Python and ChatGPT💥

In PTB-FLA Development Paradigm Adaptation for ChatGPT, Popovic et al. explore how AI can be used to streamline the development of federated learning algorithms (FLAs). The study adapts a Python-based development paradigm to leverage ChatGPT for improved speed and efficiency in coding for machine learning tasks.

Context

Federated Learning (FL) allows machine learning algorithms to train across decentralized data sources, such as edge devices, without sharing the raw data. PTB-FLA is a Python framework designed to ease this process by providing a structured way for developers to create these algorithms. Traditionally, this has required significant human input. With ChatGPT, the authors of this paper aimed to reduce human effort by automating much of the coding work. This study is important because it shows how LLMs can help build complex systems like FL algorithms, particularly in environments such as edge computing, where efficiency and reduced human oversight are key.

Key Findings

The adapted four-phase paradigm reduced human labour by 50%, achieving double the speed of the original development method.
A new two-phase paradigm further streamlined the process, cutting human effort by 6 times compared to the original approach.
ChatGPT-generated code was of higher quality, showing fewer errors compared to human-generated versions in comparable tasks.
The study demonstrated a significant reduction in costs by reducing the size of ChatGPT prompts by 2.75 times.
Both adapted paradigms were successfully validated using logistic regression as a case study for federated learning.

What This Means for You

If you work with machine learning, particularly in decentralized systems like IoT or edge computing, this research is highly relevant. Using ChatGPT to develop federated learning algorithms can save you substantial time by automating coding tasks that would otherwise require significant effort. By adopting the two-phase paradigm, developers can expect faster, more efficient development cycles, allowing you to focus on innovation rather than repetitive coding. This also reduces costs when using AI-assisted tools like ChatGPT, as it optimises the prompt size.

Examining the Details

The study's methodology revolves around adapting an existing four-phase development process for federated learning into two paradigms tailored for ChatGPT. The original phases involved creating sequential code, transforming it into federated code, incorporating callbacks, and generating the final PTB-FLA code. The new two-phase paradigm simplifies this further by merging phases, allowing ChatGPT to generate the final federated code directly from the sequential code, bypassing intermediary steps. The team validated both paradigms through a case study using logistic regression. They iteratively refined the ChatGPT prompts to find the minimal context needed to achieve correct outputs, ensuring efficiency while maintaining code accuracy. The final results showed ChatGPT could develop high-quality code faster than humans, with far fewer resources.

You can learn more by reading the entirepaper and accessing the PTB-FLA Github repository.

🧠 Expert insight💥

pythonpro-45-outlier-detection-with-boxplots-python-313-updates-and-stripe-integration-for-django-img-9

Here’s an excerpt from “Chapter 5: Working with Outliers” in the Python Feature Engineering Cookbook - Third Edition,by Soledad Galli, published in August 2024.

Visualizing outliers with boxplots and the inter-quartile proximity rule

A common way to visualize outliers is by using boxplots. Boxplots provide a standardized display of the variable’s distribution based on quartiles. The box contains the observations within the first

and third quartiles, known as the Inter-Quartile Range(IQR). The first quartile is the value below which 25% of the observations lie (equivalent to the 25th percentile), while the third quartile is the value below which 75% of the observations lie (equivalent to the 75th percentile). The IQR is calculatedas follows:

IQR = 3rd quartile - 1st quartile

Boxplots also display whiskers, which are lines that protrude from each end of the box toward the minimum and maximum values and up to a limit. These limits are given by the minimum or maximum value of the distribution or, in the presence of extreme values, by thefollowing equations:

upper limit = 3rd quartile + IQR × 1.5

lower limit = 1st quartile - IQR × 1.5

According to theIQR proximity rule, we can consider a value an outlier if it falls beyond the whisker limits determined by the previous equations. In boxplots, outliers are indicatedas dots.

Note

If the variable has a normal distribution, about 99% of the observations will be located within the interval delimited by the whiskers. Hence, we can treat values beyond the whiskers as outliers. Boxplots are, however, non-parametric, which is why we also use them to visualize outliers inskewed variables.

In this recipe, we’ll begin by visualizing the variable distribution with boxplots, and then we’ll calculate the whisker’s limits manually to identify the points beyond which we could consider a value asan outlier.

How to do it...

We will create boxplots utilizing theseabornlibrary. Let’s begin by importing the Python libraries and loadingthe dataset:

Let’s import the Python libraries andthe dataset:

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing

Modify the default background fromseaborn (it makes prettier plots, but that’s subjective, of course):

sns.set(style="darkgrid")

Load the California house prices datasetfrom scikit-learn:

X, y = fetch_california_housing(
  return_X_y=True, as_frame=True)

Make a boxplot of theMedIncvariable to visualizeits distribution:

plt.figure(figsize=(8, 3))
sns.boxplot(data=X["MedInc"], orient="y")
plt.title("Boxplot")
plt.show()

In the following boxplot, we identify the box containing the observations within the IQR, that is, the observations between the first and third quartiles. We also see the whiskers. On the left, the whisker extends to the minimum value ofMedInc; on the right, the whisker goes up to the third quartile plus 1.5 times the IQR. Values beyond the right whisker are represented as dots and couldconstitute outliers:

pythonpro-45-outlier-detection-with-boxplots-python-313-updates-and-stripe-integration-for-django-img-10

Figure 5.1 – Boxplot of the MedInc variable highlighting potential outliers on the right tail of the distribution

Note

As shown inFigure 5.1, the boxplot returns asymmetric boundaries denoted by the varying lengths of the left and right whiskers. This makes boxplots a suitable method for identifying outliers in highly skewed distributions. As we’ll see in the coming recipes, alternative methods to identify outliers create symmetric boundaries around the center of the distribution, which may not be the best option forasymmetric distributions.

Let’s now create a function to plot a boxplot next toa histogram:

def plot_boxplot_and_hist(data, variable):
  f, (ax_box, ax_hist) = plt.subplots(
    2, sharex=True,
    gridspec_kw={"height_ratios": (0.50, 0.85)})
  sns.boxplot(x=data[variable], ax=ax_box)
  sns.histplot(data=data, x=variable, ax=ax_hist)
  plt.show()

Let’s use the previous function to create the plots for theMedInc variable:

plot_boxplot_and_hist(X, "MedInc")

In the following figure, we can see the relationship between the boxplot and the variable’s distribution shown in the histogram. Note how most ofMedInc’s observations are located within the IQR box.MedInc’s potential outliers lie on the right tail, corresponding to people with unusuallyhigh-income salaries:

pythonpro-45-outlier-detection-with-boxplots-python-313-updates-and-stripe-integration-for-django-img-11

Figure 5.2 – Boxplot and histogram – two ways of displaying a variable’s distribution

...

How it works...

In this recipe, we used theboxplotmethod from Seaborn to create the boxplots and then we calculated the limits beyond which a value could be considered an outlier based on the IQRproximity rule.

InFigure 5.2, we saw that the box in the boxplot forMedInc extended from approximately 2 to 5, corresponding to the first and third quantiles (you can determine these values precisely by executing X[“MedInc”].quantile(0.25)andX[“MedInc”].quantile(0.75) ). We also saw that the whiskers start at MedInc’s minimum on the left and extend up to8.013on the right (we know this value exactly because we calculated it instep 8).MedIncshowed values greater than8.013 , which were displayed in the boxplot as dots. Those are the values that could be considered outliers...

Packt library subscribers cancontinue reading the entire book for free. You can buy the Python Feature Engineering Cookbook - Third Edition, by Soledad Galli,here.

Get the eBook for $35.99 $24.99!

pythonpro-45-outlier-detection-with-boxplots-python-313-updates-and-stripe-integration-for-django-img-12

And that’s a wrap.

We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.

If you have any suggestions or feedback, or would like us to find you aPythonlearning resource on a particular subject, take thesurveyor just respond to this email!