





















































Hi ,
Welcome to a brand new issue of PythonPro!
In today’sExpert Insight we bring you an excerpt from the recently published book, Apache Airflow Best Practices, which explains how to build and test a pipeline in Jupyter Notebook to extract daily images from NASA's APOD API, store them locally, and prepare the workflow for automation using Apache Airflow.
News Highlights:PyPI'saiocpa updated with code to steal private keys via Telegram; AWS Lambda SnapStart now supports Python 3.12+ and .NET 8+ for faster startups; Eel simplifies Python/JS HTML GUI apps with async support; and Marimo raises $5M for an open-source reactive Python notebook.
My top 5 picks from today’s learning resources:
And, today’s Featured Study, introduces CODECLEANER, an open-source toolkit that employs automated code refactoring to mitigate data contamination in Code Language Models, significantly enhancing evaluation reliability across Python and Java through systematic and scalable techniques.
Stay awesome!
Divya Anne Selvaraj
Editor-in-Chief
P.S.:Thank you all who participated in this month's survey. With this issue, we have fulfilled all content requests made this month.
.items()
,.keys()
, and.values()
methods for accessing keys, values, or key-value pairs.range()
function, explaining its use for generating numerical sequences for loops, defining intervals with start, stop, and step parameters.+
and+=
operators, the.join()
method for lists, and tools likeStringIO
for handling large datasets, with best practices for performance and flexibility.In "CODECLEANER: Elevating Standards with a Robust Data Contamination Mitigation Toolkit," Cao et al. address the pervasive issue of data contamination in Code Language Models (CLMs). The study introduces CODECLEANER, an automated code refactoring toolkit designed to mitigate contamination, enabling more reliable performance evaluations for CLMs.
Data contamination occurs when CLMs, trained on vast code repositories, inadvertently include test data, leading to inflated performance metrics. This undermines the credibility of CLMs in real-world applications, posing risks for software companies. Refactoring, a method of restructuring code without altering its functionality, offers a potential solution. However, the lack of automated tools and validated methods has hindered its adoption. CODECLEANER fills this gap by systematically evaluating refactoring operators for Python and Java code, ensuring they reduce contamination without semantic alterations.
This study is particularly valuable for software developers and engineering teams seeking to integrate CLMs into production, researchers aiming to benchmark CLMs accurately, and organisations evaluating AI-based code tools. By addressing data contamination, CODECLEANER enhances the credibility and reliability of CLM-based solutions for real-world applications.
The researchers evaluated CODECLEANER by applying 11 refactoring operators to Python and Java code at method-, class-, and cross-class levels. Effectiveness was measured using metrics like N-gram overlap and perplexity across over 7000 code snippets sampled from The Stack dataset. Four Code Language Models (CLMs), including StarCoder and CodeLlama, were used to assess changes in contamination severity.
Results showed that semantic operators, such as identifier renaming, reduced overlap by up to 39.3%, while applying all operators decreased overlap in Python code by 65%. On larger class-level Python codebases, contamination was reduced by 37%. Application to Java showed modest improvements, with the most effective operator achieving a 17% reduction.
You can learn more by reading the entire paper and accessing the toolkit here.
Here’s an excerpt from “Chapter 4: Basics of Airflow and DAG Authoring” in the Apache Airflow Best Practices by Dylan Intorf, Dylan Storey, and Kendrick van Doorn, published in October 2024.
This pipeline is designed to extract an image every day, store this information in a folder, and notify you of the completion. This entire process will be orchestrated by Apache Airflow and will take advantage of the scheduler to automate the function of re-running. As stated earlier, it is helpful to spend time
working through practicing this in Jupyter Notebook or another tool to ensure the API calls and connections are operating as expected and to troubleshootany issues.
For this data pipeline, we will be extracting data from NASA. My favorite API is theAstronomy Picture of the Day(APOD) where a new photo is selected and displayed. You can easily change the API to another of interest, butfor this example, I recommend you stick with the APOD and explore othersonce completed.
A NASA API key is required to start thisnext step:
Figure 4.3: NASA API Key input screenshot
With the environment configured and the API set up, we can begin authoring a DAG to automate this process. As a reminder, most Python code can be pre-tested in a system outside of Airflow, such as Jupyter Notebook or locally. If you are running into problems, it is recommended to spend time analyzing what the code is doing and workto debug.
In Jupyter Notebook, we are going to use the following code block to represent the function of calling the API, accessing the location of the image, and then storing the image locally. We will keep this example as simple as possible and walk througheach step:
import requests
import json
from datetime import date
from NASA_Keys import api_key
url = f'https://api.nasa.gov/planetary/apod?api_key={api_key}'
response = requests.get(url).json()
response
today_image = response['hdurl']
r = requests.get(today_image)
with open(f'todays_image_{date.today()}.png', 'wb') as f:
f.write(requests.get(today_image).content)
The preceding code snippet is normally how we recommend starting any pipeline, ensuring that the API is functional, the API key works, and the current network requirements are in place to perform the procedures. It is best to ensure that the network connections are available and that no troubleshooting alongside the information security or networking teamsis required.
Here is how the code looks in our JupyterNotebook environment:
requests
: A common Python library for making HTTP requests. It is an easy-to-use library that makes working with HTTP requests simple and allows for easy use ofGET
andPOST
methods.json
: This library allows you to parse JSON from strings or files into a dictionaryor list.datetime
: This library provides the currentdate
andtime
parameters. We will use this later on to title theimage file.NASA_Keys
: This is a local file to our machine holding theapi_key
parameter. This is used in this example to keep things as simple as possible and also maskthe variable.Figure 4.4: What your current Jupyter cell should look like
url
to house the HTTP request call including ourapi_key
variable. This allows theapi_key
variable to be included in the URL while hidden by a mask. It callsapi_key
from theNASA_Keys
file:url = f'https://api.nasa.gov/planetary/apod?api_key={api_key}'
requests
library to perform an HTTPGET
method call on the URL that we have created. This calls on the API to send information for our program to interpret. Finally, we convert this information from theGET
call into JSON format. For our own understanding and analysis of the information being sent back, we print out the response to get a view of how the dictionaryis structured. In this dictionary, it seems that there is only one level with multiple key-value pairs including copyright
, date
,explanation
,hdurl
, media_type
, service_version
, title
, and url
:Figure 4.5: Response from the NASA API call
In the next step, we will utilize thehdurl
key to access the URL associated with the high-definition astronomy image of the day. Since I am an enthusiast and want the highest quality image available, I have decided that the highest definition available meets my user needs. This is a great example of a time to determine whether your users desire or need the highest quality available or whether there is an opportunity to deliver a product that meets their needs at a lower cost or lower requirementof memory.
We storeresponse['hdurl']
within thetoday_image
variable for use in the next step for storing the image. This storing ofhdurl
allows for manipulation of the string later on in thenext step:
Figure 4.6: Saving the hdurl response in a variable
hdurl
and appenddate.today()
to create a new name for the image each day. This is so that an image from yesterday does not have the same name as an image from today, thus reducing the risk of overwrites. There are additional ways to reduce the risk of overwrites, especially when creating an automated system, but this was chosen as the simplest option forour needs:Figure 4.7: Writing the image content to a local file
Figure 4.8: The image file we saved in the local repository or folder
This walk-through in Jupyter Notebook may seem ...excessive..., but taking the time to ensure the API is working and thinking through the logic of the common steps that need to be automated or repeated can be extremely beneficial when stepping into creating the Airflow DAG.
Apache Airflow Best Practiceswas published in October 2024.
And that’s a wrap.
We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.
If you have any suggestions or feedback, or would like us to find you a Python learning resource on a particular subject, just respond to this email!