#48:

Python 3.13 JIT, Boosting Model Inference, and FastAPI Best Practices

3 Days. 25+ AI Experts. 30+ Sessions.

Join the Generative AI In Action conference from Nov 11-13 (LIVE | Virtual) and gain insights from top AI leaders across over 30 sessions. Explore key topics including GenAI tools, AI Agents, Open-Source LLMs, Small Language Models, LLM fine-tuning, and many more! This is your opportunity to dive deep into cutting-edge AI strategies and technologies.

Save 40% with our Early Bird offer using code BIGSAVE40 – don’t miss out!

Secure Your Seat Today!

Hi ,

Welcome to a brand new issue of PythonPro!

In today’sExpert Insight we bring you an excerpt from the recently published book, Machine Learning and Generative AI for Marketing, which discusses how to create effective prompts for Zero-Shot Learning to generate high-quality marketing content.

News Highlights: Opik, a new open-source LLM evaluation tool, integrates with CI/CD, and Model2Vec, a newly launched library, boosts CPU inference 500x and cuts model size by 15x.

Here are my top 5 picks from our learning resources today:

Frankenstein’s Ice cream shop🍦
Python 3.13 Preview: Free Threading and a JIT Compiler⚙️
Graph RAG into Production — Step-by-Step🧩
FastAPI Best Practices and Design Patterns - Building Quality Python APIs🛠️
From Spreadsheets to SDMX Effortless with Python and .Stat Suite📊

And, today’s Featured Study, examines the performance of open-source models like Mistral and LLaMa and provides insights into the hardware needed for efficient deployment, using GPUs and optimisation techniques such as quantification.

Stay awesome!

Divya Anne Selvaraj

Editor-in-Chief

P.S.:With this issue, we have finished covering all content requests made through the September feedback survey. Stay tuned for next month's survey.

Sign Up|Advertise

What changed in the way you code for 2024? What has happened in the tech world in the last months?

Take this shorter version of the Developer Nation survey, learn about new tools, influence the future of development and share your insights with the world!

What’s in it for you?

A virtual goody bag packed with cool resources

The more questions you answer the more chances you have to win amazing prizes including aSamsung Galaxy Watch 7!

Take the Survey now!

🐍 Python in the Tech 💻 Jungle 🌳

pythonpro-48-python-313-jit-boosting-model-inference-and-fastapi-best-practices-img-3

🗞️News

Opik, an open source LLM evaluation framework: The platform can be used for developing, evaluating, and monitoring LLM applications and offers features such as LLM call tracing, annotation, automated evaluation, and integration into CI/CD pipelines.
Model2Vec: Distill a Small Fast Model from any Sentence Transformer: Model2Vec is a Python library that distills sentence transformers into small static embeddings, making inference 500x faster on CPU and reducing model size by 15x.

💼Case Studies and Experiments🔬

Integrated Python and GIS Approach for Geomorphometric Investigation of Man River Basin, Western Madhya Pradesh, India : Analyzes the tectonic influence on the Man River Basin's development using satellite imagery, GIS software, and Python to compute and study geomorphometric indices.
Frankenstein’s Ice cream shop:Details how to automate the cleaning of messy Excel sheets using Python's Pandas library, focusing on a made-up ice cream sales commission dataset.

📊Analysis

The Python Package Index Should Get Rid Of Its Training Wheels: Discusses the challenges of PyPI's exponentially growing storage needs, particularly due to prebuilt binaries and suggests leveraging modern build tools.
UV — I am (somewhat) sold: Initially skeptical, the author of this article found UV useful for handling multiple Python versions, dependency management, and simplifying their development setup.

🎓Tutorials and Guides🤓

pythonpro-48-python-313-jit-boosting-model-inference-and-fastapi-best-practices-img-4

Python 3.13 Preview: Free Threading and a JIT Compiler: Demonstrates the key new features in Python 3.13, including free threading, which makes the GIL optional, and a JIT compiler that compiles Python code into machine code.
Graph RAG into Production — Step-by-Step: Discusses how to implement Graph Retrieval-Augmented Generation (Graph RAG) in production using a fully serverless, parallelized approach without using a graph database.
Python Virtual Environments: A Primer: Covers how to create, activate, and manage virtual environments, explaining their importance for isolating dependencies, avoiding conflicts, and ensuring reproducibility.
Python for Network Programming — A Beginner’s Overview: Explains key concepts such as sockets, TCP, and UDP protocols, and walks you through practical examples of building TCP and UDP client-server applications.
Mastering ChatGPT’s Function Call API - The Smart Way and the… Not-So-Smart Way (in Python): Explains how to use ChatGPT's function call API for automating tasks in Python.
Git With Python HowTo GitPython Tutorial And PyGit2 Tutorial: Covers installation, exception handling, and common tasks like cloning, committing, branching, tagging, and pushing changes.
🎥Program a RAG LLM Chat App with LangChain + Streamlit + *o1, GTP-4o and Claude 3.5 API: Covers loading custom documents, integrating website content into LLM queries, and creating a web app that enables users to interact with GPT-4 and Claude models.

🔑Best Practices and Advice🔏

pythonpro-48-python-313-jit-boosting-model-inference-and-fastapi-best-practices-img-5

Counting Sheep with Contracts in Python: Discusses using code contracts to enhance software development by ensuring preconditions and postconditions are met, making the code safer and easier to maintain.
FastAPI Best Practices and Design Patterns - Building Quality Python APIs: Discusses applying SOLID principles and design patterns like DAO and Service Layer to build clean, maintainable, and scalable APIs using FastAPI.
Recently I read a few articles and have a few questions: Covers managing dependencies without tools like Poetry, and handling Python version installations, particularly when a preferred version lacks an official installer.
Unlocking the Magic of Docstrings: Introduces the power of Python docstrings for documenting code, enhancing readability, and providing functionality like automatic documentation generation and testing.
From Spreadsheets to SDMX Effortless with Python and .Stat Suite: Highlights the importance of SDMX adoption for efficient data sharing among institutions and presents a step-by-step case study using World Bank data.

🔍Featured Study: Deploying Open-Source Large Language Models Efficiently💥

The study "Deploying Open-Source Large Language Models: A Performance Analysis", conducted by Bendi-Ouis et al., compares the performance of open-source large language models. The study aims to assist organisations in evaluating the hardware requirements for efficiently deploying models like Mistral and LLaMa.

Context

Since the release of ChatGPT in November 2023, there has been growing interest in deploying large language models. Many organisations and institutions are keen to harness LLMs, but the computational demands remain a challenge. While proprietary models require substantial resources, open-source models like Mistral and LLaMa provide alternatives that may be deployed with less hardware. This study explores how different hardware configurations and optimisation techniques, such as quantification, can make these models more accessible for public and private entities.

Key Findings

The study used two types of GPUs: NVIDIA V100 16GB and NVIDIA A100 40GB, with tests conducted on models like Mistral-7B, Codestral-22B, Mixtral-8x7B, Mixtral-8x22B, and LLaMa-3-70B.
Mistral-7B generated 119 tokens in 1.9 seconds with one request, but 72.1 seconds with 128 requests on two V100 16GB GPUs.
Codestral-22B produced 63 tokens in 2.3 seconds with one request but took 96.2 seconds with 128 requests on an A100 40GB GPU.
Larger models like Mixtral-8x22B and LLaMa-3-70B faced slower generation times as context size and simultaneous requests increased.
Quantifying models to 4 or 6 bits helped reduce the memory load while maintaining performance, with negligible loss in accuracy for models with up to 70 billion parameters.

What This Means for You

For organisations and developers seeking to deploy LLMs, this analysis provides valuable insights into the hardware requirements and optimisation techniques necessary for efficient deployment. With moderate hardware investments, open-source models can perform competitively, reducing dependency on proprietary systems and enabling better control over digital resources. This ensures digital sovereignty and cost-effective deployment of advanced AI technologies.

Examining the Details

The researchers focused on GPU performance and model quantification to measure how efficiently LLMs could be deployed. Using vLLM, a Python library designed for inference optimisation, the study tested multiple models and configurations. For instance, Mistral-7B, when run on two V100 16GB GPUs, showed an increase in response time with higher numbers of simultaneous requests, highlighting the challenge of scaling for larger user bases.

Quantification emerged as a key method to reduce computational load, allowing models to use less memory by lowering precision from 16 or 32 bits to 4 or 8 bits. This method was effective for larger models, maintaining performance without significant loss in accuracy.

The study concluded that, although proprietary solutions like ChatGPT require significant resources, open-weight models like Mistral and LLaMa can deliver strong performance with commercially available GPUs. By deploying these models with vLLM and quantification techniques, organisations can achieve scalable, efficient AI deployment without excessive hardware costs.

You can learn more by reading the entire paper here.

🧠 Expert insight💥

pythonpro-48-python-313-jit-boosting-model-inference-and-fastapi-best-practices-img-6

Here’s an excerpt from “Chapter 9: Creating Compelling Content with Zero-Shot Learning” in the book, Machine Learning and Generative AI for Marketing by Yoon Hyup Hwang and Nicholas C. Burtch, published in August 2024.

Creating an effective prompt

Being able to debug and troubleshoot code is an important skill to have. When you develop code, it seldom does what you need it to do the first time. You need

Creating an effective prompt is the most crucial step in leveraging ZSL for marketing copy. In ZSL, the prompt effectively becomes the instruction manual for a model, telling it what kind of content to generate, as well as its style, tone, and substance.

The following are some guidelines around how to formulate prompts that will elicit the best possible marketing copy content from the model:

Clarity: Ensure that your prompt is specific about what you want, whether it’s a product description, headline, or call to action.
Contextual: Provide sufficient background to guide a model. For eco-friendly products, mention key selling points like sustainability or biodegradability.
Creative: While clarity is crucial, leaving room for creativity can yield surprising and innovative results. Phrases like “Imagine...” or “Create a story where...” can be particularly powerful.
Concise: Lengthy prompts can dilute the focus. Aim for brevity while including essential details, ensuring that a model stays on topic.

In the following sections, we will illustrate the impact of prompt quality through examples, with different types of marketing copy. While good prompts elicit detailed, relevant, and engaging content, poor prompts can lead to vague and uninspiring outputs. To generate these responses, we will define the following function:

def generate_response(prompt, model="gpt-4", max_tokens=100, temperature=0.7, n=1):
  response = client.chat.completions.create(
    model=model,
    messages=[{"role": "user", "content": prompt}],
    max_tokens=max_tokens,
    n=n,
    temperature=temperature)
return response.choices[0].message.content

This function will be used with different prompt types in the examples that follow.

Example 1: Product descriptions

In this example, we will generate product descriptions for our e-commerce brand, which is launching new lines of eco-friendly kitchenware.

The following is an example of a poor prompt:

poor_product_description = "Talk about bamboo cutlery."
generate_response(poor_product_description)

This produces:

Bamboo cutlery is a type of eating utensil made from bamboo, a fast-growing renewable resource. This form of cutlery includes items such as forks, knives, spoons, and chopsticks. They are an eco-friendly alternative to plastic disposable cutlery due to their reusable nature and biodegradable properties. Bamboo cutlery is lightweight and durable, making it a popular choice for camping, picnics, and travel.

Now, let’s look at the following example of a good prompt:

good_product_description = "Write a captivating description for a bamboo cutlery set designed for eco-conscious consumers, emphasizing its sustainability and style."
generate_response(good_product_description)

This prompt produces the following output:

Discover the perfect harmony of sophistication and sustainability with our Bamboo Cutlery Set. Designed for the eco-conscious consumer, this set is not only a statement of your commitment to the environment but also a reflection of your exquisite taste. Each piece is crafted from 100% organic bamboo, a renewable resource that grows significantly faster than traditional hardwoods. This set boasts a sleek, minimalist design that perfectly complements any table setting, from casual picnic to elegant dinner parties.

From a marketing perspective, this example demonstrates the significance of creating detailed and audience-specific prompts with clear requirements when using ZSL for product descriptions, as well as how this leads to more specificity in the generated response. However, it is worth noting that older consumers may value more straightforward, factual information and, therefore, may favor the more generic prompt’s response from an engagement standpoint. Tailoring GenAI outputs at the level of the individual consumer can be crucial as well and is a topic discussed inChapter 11.

Packt library subscribers can continue reading the entire book for free. You can buy Machine Learning and Generative AI for Marketing,here.

Get the eBook for $39.99 $27.98!

Other Python titles from Packt at 30% off

pythonpro-48-python-313-jit-boosting-model-inference-and-fastapi-best-practices-img-7

Get the eBook for $35.99 $24.99!

pythonpro-48-python-313-jit-boosting-model-inference-and-fastapi-best-practices-img-8

Get the eBook for $25.99 $17.99!

pythonpro-48-python-313-jit-boosting-model-inference-and-fastapi-best-practices-img-9

Get the eBook for $35.99 $24.99!

And that’s a wrap.

We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.

If you have any suggestions or feedback, or would like us to find you aPythonlearning resource on a particular subject, take thesurveyor just respond to this email!