





















































Hi ,
In today’sExpert Insight we bring you an excerpt from the recently published book, Generative AI on Google Cloud with LangChain, which discusses how LLMs generate plausible but sometimes false responses (hallucinations), and demonstrates how structured prompting with LangChain can help mitigate the issue.
News Highlights: Meta's Llama flaw exposes AI servers to remote code execution via Python’s pickle; Codon’s 2025 update brings a faster, open-source NumPy with GPU support; Codegen, a Python library for code refactoring and analysis, is now open source.
My top 5 picks from today’s learning resources:
And, in From the Cutting Edge, we introduce Web2Vec, a Python library that automates website analysis by extracting over 200 structured parameters through web crawling and direct feature extraction, with potential applications in cybersecurity, SEO, and machine learning.
Stay awesome!
Divya Anne Selvaraj
Editor-in-Chief
pickle
module.itertools
, collections
, and datetime,
simulates birthday collisions, analyzes probability with brute-force, and more.pygame
library, covering setting up the environment, storing and playing sounds, representing music in code, and more.In "Web2Vec: A Python Library for Website-to-Vector Transformation," D. Frąszczak and E. Frąszczak introduce Web2Vec, a Python library that converts websites into structured vector representations. The library automates feature extraction from web pages, integrating over 200 parameters from website content and structure to enable efficient analysis.
Website processing involves extracting and transforming web data for analysis. This includes web crawling, which systematically navigates websites to collect data, and web scraping, which extracts specific information from web pages. Website feature extraction identifies key attributes such as structure, security settings, and external references, while vectorisation converts unstructured data into numerical formats for computational analysis.
The internet hosts over a billion websites, with millions actively generating data daily. Extracting insights is essential for market research, cybersecurity, and machine learning. While many research papers discuss web data collection, they often rely on custom scripts, leading to inefficiencies. Existing services like WHOIS, SimilarWeb, and Google Search Index provide valuable data but restrict free API access, requiring users to parse raw HTML instead. Web2Vec addresses these challenges with a unified, open-source solution for automated website data extraction and analysis.
Web2Vec offers a structured approach to web data collection and analysis through:
pip install web2vec
), with community support for updates and improvements.Web2Vec is a valuable tool for professionals and researchers working with web data. Data scientists can automate website feature extraction for large-scale analysis, while cybersecurity professionals can detect phishing and misinformation using structured data and threat intelligence services. SEO and marketing professionals can benefit from its ability to analyse rankings, metadata, and traffic sources. Developers and web scraping practitioners can replace custom scripts with Web2Vec’s unified framework, reducing redundancy and improving efficiency.
Web2Vec processes websites using two primary methods: direct extraction from pre-generated HTML snapshots and active web crawling with Scrapy-based spiders. It extracts 211 parameters across multiple categories, converting website characteristics into structured vector representations. The extracted data, stored in JSON format, includes details such as URL, title, HTML response, and status codes. Users can either crawl entire websites using the Web2VecSpider
or extract specific parameters through direct method invocation. Web2Vec also supports visualising website relationships using directed graphs, making it particularly useful for cybersecurity applications such as phishing detection and misinformation analysis.
You can learn more by reading the entire paper or accessing the library on GitHub.
Here’s an excerpt from “Chapter 3: Grounding Responses” in the book, Generative AI on Google Cloud with LangChain by Leonid Kuligin, Jorge Zaldívar, and Maximilian Tschochohei.
LLMs are trained on large amounts of publicly available data (please take a look atAppendix 1 for more details on how LLMs are trained). By design, they absorb the information they are explicitly given in the prompt and the information they have seen during the
training.By default, LLMs don’t have access to any external information (except for what they have already memorized), and in most cases, they’re autoregressive models (they predict output words or tokens one by one by looking at the previous input, and this limits their reasoning capabilities). We’ll see some examples of how we can expand an LLM’s reasoning capabilities with agentic workflows in Chapters 9,10,and11.
Simply put, LLMs will use the information from their training to respond to a prompt. LLMs effectively reproduce human language and, therefore, their answers sound very credible even if they are just a probabilistic continuation of the prompt. This can be correct if the training corpus has the right information – or it can just be plausible, but factually wrong. If it’s the latter, the LLM is saidto hallucinate.
Let’s take asimple example:
Prompt: Tell me about the black cucumbers.
Response:
Black Cucumbers
Black cucumbers, scientifically known as Cucumis sativus var. Nigra, are an intriguing variety of cucumber characterized by their unique dark green to black skin. They are not genetically modified but instead occur naturally due to a higher concentration of chlorophyll.
Origin and Cultivation
Black cucumbers are native to Europe and Asia, particularly Russia and China. They are grown in warm, moist climates and require well-drained soil rich in organic matter. The fruits typically mature within 60-70 days after planting.
Physical Characteristics
Skin: Deep green to black, with occasional lighter... (The response was truncated because it has reached the token limit. Try to increase the token limit if you need a longer response.)
At least to my knowledge, black cucumbers don’t exist. But the LLM has produced a response based on a description of actual cucumbers that looks plausible and real. In the summer of 2023, a US lawyer used an LLM to answer legal questions. The LLM cited non-existing cases, but they looked so trustworthy that the lawyer used them in court and got intoproblems [1].
Sometimes hallucinations can be overcome by prompt engineering. Let’s look at the followingprompt template:
from langchain.chains import LLMChain
from langchain.prompts.prompt import PromptTemplate
from langchain_google_vertexai import VertexAI
llm = VertexAI(model_name="gemini-1.0-pro",
temperature=0.8, max_output_tokens=128)
template = """Describe {plant}.
First, think whether {plant} exist.
If they {plant} don't exist, answer "I don't have enough information about {plant}".
Otherwise, give their title, a short summary and then talk about origin and cultivation.
After that, describe their physical characteristics.
"""
prompt_template = PromptTemplate(
input_variables=["plant"],
template=template,
)
chain = LLMChain(llm=llm, prompt=prompt_template)
chain.run(plant="black cucumbers")
If we run this chain, we’ll get arelevant answer:
I don't have enough information about black cucumbers.
You can double-check and ask the question about green cucumbers to make sure that the LLM will give a correct answer withthis prompt.
Hallucinations
Hallucinations are one of the key problems that the industry is facing atthe moment.
The good news: There are ways to significantly reduce hallucination rates, and we’re going to discuss them in this and thenext chapters.
The bad news: Anygenerative AI(GenAI) might produce hallucinations, and you need to evaluate and monitor them during application development. We’ll talk about evaluation inChapter 14.
Generative AI on Google Cloud with LangChainwas published in December 2024. Packt library subscribers can continue reading the entire book for free.
And that’s a wrap.
We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.
If you have any suggestions or feedback, or would like us to find you a Python learning resource on a particular subject, just respond to this email!