





















































Hi ,
Welcome to a brand new issue of PythonPro!
In today’sExpert Insight we bring you an excerpt from the recently published, Python Feature Engineering Cookbook - Third Edition, which discusses using boxplots and the inter-quartile range (IQR) proximity rule to visualize outliers in data distributions.
News Highlights: Python 3.13.0rc2 released with new interpreter, free-threaded build, JIT, and incremental garbage collection; Python survey shows pip dominance, rising interest in Conda, Poetry, and uv; and PSF expands CNA role to cover Pallets Projects like Flask and Jinja.
Here are my top 5 picks from our learning resources today:
And, today’s Featured Study, explores how ChatGPT can automate and streamline Python-based federated learning algorithm development, reducing human effort and improving coding efficiency.
Stay awesome!
Divya Anne Selvaraj
Editor-in-Chief
P.S.: This month’ssurvey is live. Do take the opportunity to tell us what you think of PythonPro, request learning resources, and earn your one Packt Credit for this month.
__new__
Method Through a Magical Example: Introduces Python's lesser-known .__new__()
method, used for creating instances before they're initialized with .__init__()
.AsyncMock
and respx
, parameterizing HTTP clients for flexible testing setups, and using integration tests with a Starlette server.In PTB-FLA Development Paradigm Adaptation for ChatGPT, Popovic et al. explore how AI can be used to streamline the development of federated learning algorithms (FLAs). The study adapts a Python-based development paradigm to leverage ChatGPT for improved speed and efficiency in coding for machine learning tasks.
Federated Learning (FL) allows machine learning algorithms to train across decentralized data sources, such as edge devices, without sharing the raw data. PTB-FLA is a Python framework designed to ease this process by providing a structured way for developers to create these algorithms. Traditionally, this has required significant human input. With ChatGPT, the authors of this paper aimed to reduce human effort by automating much of the coding work. This study is important because it shows how LLMs can help build complex systems like FL algorithms, particularly in environments such as edge computing, where efficiency and reduced human oversight are key.
If you work with machine learning, particularly in decentralized systems like IoT or edge computing, this research is highly relevant. Using ChatGPT to develop federated learning algorithms can save you substantial time by automating coding tasks that would otherwise require significant effort. By adopting the two-phase paradigm, developers can expect faster, more efficient development cycles, allowing you to focus on innovation rather than repetitive coding. This also reduces costs when using AI-assisted tools like ChatGPT, as it optimises the prompt size.
The study's methodology revolves around adapting an existing four-phase development process for federated learning into two paradigms tailored for ChatGPT. The original phases involved creating sequential code, transforming it into federated code, incorporating callbacks, and generating the final PTB-FLA code. The new two-phase paradigm simplifies this further by merging phases, allowing ChatGPT to generate the final federated code directly from the sequential code, bypassing intermediary steps. The team validated both paradigms through a case study using logistic regression. They iteratively refined the ChatGPT prompts to find the minimal context needed to achieve correct outputs, ensuring efficiency while maintaining code accuracy. The final results showed ChatGPT could develop high-quality code faster than humans, with far fewer resources.
You can learn more by reading the entirepaper and accessing the PTB-FLA Github repository.
Here’s an excerpt from “Chapter 5: Working with Outliers” in the Python Feature Engineering Cookbook - Third Edition,by Soledad Galli, published in August 2024.
Visualizing outliers with boxplots and the inter-quartile proximity rule
A common way to visualize outliers is by using boxplots. Boxplots provide a standardized display of the variable’s distribution based on quartiles. The box contains the observations within the first
and third quartiles, known as the Inter-Quartile Range(IQR). The first quartile is the value below which 25% of the observations lie (equivalent to the 25th percentile), while the third quartile is the value below which 75% of the observations lie (equivalent to the 75th percentile). The IQR is calculatedas follows:
IQR = 3rd quartile - 1st quartile
Boxplots also display whiskers, which are lines that protrude from each end of the box toward the minimum and maximum values and up to a limit. These limits are given by the minimum or maximum value of the distribution or, in the presence of extreme values, by thefollowing equations:
upper limit = 3rd quartile + IQR × 1.5
lower limit = 1st quartile - IQR × 1.5
According to theIQR proximity rule, we can consider a value an outlier if it falls beyond the whisker limits determined by the previous equations. In boxplots, outliers are indicatedas dots.
Note
If the variable has a normal distribution, about 99% of the observations will be located within the interval delimited by the whiskers. Hence, we can treat values beyond the whiskers as outliers. Boxplots are, however, non-parametric, which is why we also use them to visualize outliers inskewed variables.
In this recipe, we’ll begin by visualizing the variable distribution with boxplots, and then we’ll calculate the whisker’s limits manually to identify the points beyond which we could consider a value asan outlier.
We will create boxplots utilizing theseaborn
library. Let’s begin by importing the Python libraries and loadingthe dataset:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
seaborn
(it makes prettier plots, but that’s subjective, of course):sns.set(style="darkgrid")
X, y = fetch_california_housing(
return_X_y=True, as_frame=True)
MedInc
variable to visualizeits distribution:plt.figure(figsize=(8, 3))
sns.boxplot(data=X["MedInc"], orient="y")
plt.title("Boxplot")
plt.show()
In the following boxplot, we identify the box containing the observations within the IQR, that is, the observations between the first and third quartiles. We also see the whiskers. On the left, the whisker extends to the minimum value ofMedInc
; on the right, the whisker goes up to the third quartile plus 1.5 times the IQR. Values beyond the right whisker are represented as dots and couldconstitute outliers:
Figure 5.1 – Boxplot of the MedInc variable highlighting potential outliers on the right tail of the distribution
Note
As shown inFigure 5.1, the boxplot returns asymmetric boundaries denoted by the varying lengths of the left and right whiskers. This makes boxplots a suitable method for identifying outliers in highly skewed distributions. As we’ll see in the coming recipes, alternative methods to identify outliers create symmetric boundaries around the center of the distribution, which may not be the best option forasymmetric distributions.
def plot_boxplot_and_hist(data, variable):
f, (ax_box, ax_hist) = plt.subplots(
2, sharex=True,
gridspec_kw={"height_ratios": (0.50, 0.85)})
sns.boxplot(x=data[variable], ax=ax_box)
sns.histplot(data=data, x=variable, ax=ax_hist)
plt.show()
MedInc
variable:plot_boxplot_and_hist(X, "MedInc")
In the following figure, we can see the relationship between the boxplot and the variable’s distribution shown in the histogram. Note how most ofMedInc
’s observations are located within the IQR box.MedInc
’s potential outliers lie on the right tail, corresponding to people with unusuallyhigh-income salaries:
Figure 5.2 – Boxplot and histogram – two ways of displaying a variable’s distribution
...
In this recipe, we used theboxplot
method from Seaborn to create the boxplots and then we calculated the limits beyond which a value could be considered an outlier based on the IQRproximity rule.
InFigure 5.2, we saw that the box in the boxplot forMedInc
extended from approximately 2 to 5, corresponding to the first and third quantiles (you can determine these values precisely by executing X[
“MedInc
”].quantile(0.25)
andX[
“MedInc
”].quantile(0.75)
). We also saw that the whiskers start at MedInc
’s minimum on the left and extend up to8.013
on the right (we know this value exactly because we calculated it instep 8).MedInc
showed values greater than8.013
, which were displayed in the boxplot as dots. Those are the values that could be considered outliers...
Packt library subscribers cancontinue reading the entire book for free. You can buy the Python Feature Engineering Cookbook - Third Edition, by Soledad Galli,here.
And that’s a wrap.
We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.
If you have any suggestions or feedback, or would like us to find you aPythonlearning resource on a particular subject, take thesurveyor just respond to this email!