community[minor]: [GoogleApiYoutubeLoader] Replace API used in _get_document_for_channel from search to playlistItem #24034

TupleType · 2024-07-09T17:58:18Z

Description: Search has a limit of 500 results, playlistItems doesn't. Added a class in except clause to catch another common error.
Issue: None
Dependencies: None
Twitter handle: @TupleType

…Items because search has a limit of 500 results. Added a class in except clause to catch another common error.

vercel · 2024-07-09T17:58:22Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jul 18, 2024 1:04pm

TupleType · 2024-07-09T17:59:06Z

I got a non descriptive linting error, can someone please help?
langchain_community/document_loaders/youtube.py:from xml.etree.ElementTree import ParseError
make: *** [Makefile:49: lint] Error 1

eyurtsev · 2024-07-11T15:26:08Z

libs/community/langchain_community/document_loaders/youtube.py

@@ -7,6 +7,7 @@
 from pathlib import Path
 from typing import Any, Dict, Generator, List, Optional, Sequence, Union
 from urllib.parse import parse_qs, urlparse
+from xml.etree.ElementTree import ParseError


You can do this:

from xml.etree.ElementTree import ParseError # OK: user-must-opt-in

The linter is flagging the fact that the underlying library is relying on the built-in XML library.

It's essentially surfacing this:

https://github.com/jdepoix/youtube-transcript-api/blob/master/youtube_transcript_api/_transcripts.py#L10

Depending on user environment the built in may or may not have vulnerabilities:
https://docs.python.org/3/library/xml.html#xml-vulnerabilities

Do you know if the underlying library reading from a google provided XML file? If so, while not amazing this is probably fine.

Could you add a security note to the GoogleAPIClient saying that it relies on the standard xml library, but we're viewing the input as trusted in this case?

I don't know the library well enough to determine that but this is the error I'm getting for https://www.youtube.com/watch?v=pM8e0Dbzopk:

File "C:\Users\user\repos\rag-demo\loader.py", line 71, in _get_document_for_playlist transcript = self._get_transcripe_for_video_id( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\user\repos\rag-demo\.venv\Lib\site-packages\langchain_community\document_loaders\youtube.py", line 410, in _get_transcripe_for_video_id transcript_pieces = transcript.fetch() ^^^^^^^^^^^^^^^^^^ File "C:\Users\user\repos\rag-demo\.venv\Lib\site-packages\youtube_transcript_api\_transcripts.py", line 292, in fetch return _TranscriptParser(preserve_formatting=preserve_formatting).parse( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\user\repos\rag-demo\.venv\Lib\site-packages\youtube_transcript_api\_transcripts.py", line 358, in parse for xml_element in ElementTree.fromstring(plain_data) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\xml\etree\ElementTree.py", line 1338, in XML parser.feed(text) xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 57, column 99

TupleType · 2024-07-17T13:33:13Z

dependency checks failed but it doesn't have logs

eyurtsev · 2024-07-16T14:49:41Z

libs/community/langchain_community/document_loaders/youtube.py

@@ -452,34 +463,32 @@ def _get_document_for_channel(self, channel: str, **kwargs: Any) -> List[Documen
            )

        channel_id = self._get_channel_id(channel)
-        request = self.youtube_client.search().list(
+        uploads_playlist_id = self._get_uploads_playlist_id(channel_id)


Is this change backwards compatible? I don't use the youtube API at all. Will this break anything for existing users?

yes it's backward compatible

libs/community/langchain_community/document_loaders/youtube.py

@TupleType

…ocument_for_channel from search to playlistItem (langchain-ai#24034) - **Description:** Search has a limit of 500 results, playlistItems doesn't. Added a class in except clause to catch another common error. - **Issue:** None - **Dependencies:** None - **Twitter handle:** @TupleType --------- Co-authored-by: asi-cider <88270351+asi-cider@users.noreply.github.com> Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>

Replace API used in _get_document_for_channel from search to playlist…

e688a16

…Items because search has a limit of 500 results. Added a class in except clause to catch another common error.

dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Jul 9, 2024

dosubot bot added community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases labels Jul 9, 2024

vercel bot deployed to Preview July 9, 2024 18:15 View deployment

eyurtsev reviewed Jul 11, 2024

View reviewed changes

eyurtsev self-assigned this Jul 11, 2024

Merge branch 'master' into master

25db0c2

vercel bot deployed to Preview July 11, 2024 17:58 View deployment

Add security note

913b36e

vercel bot deployed to Preview July 14, 2024 09:22 View deployment

Add comment for linter

bf27316

vercel bot had a problem deploying to Preview July 17, 2024 13:23 Failure

TupleType and others added 2 commits July 17, 2024 16:26

formatting

675ec5c

Merge branch 'master' into master

b0b0021

TupleType requested a review from eyurtsev July 17, 2024 13:32

vercel bot temporarily deployed to Preview July 17, 2024 13:57 Inactive

eyurtsev reviewed Jul 18, 2024

View reviewed changes

eyurtsev added 2 commits July 17, 2024 22:05

Merge branch 'master' into master

3d4d0bd

x

e77c4f2

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Jul 18, 2024

eyurtsev reviewed Jul 18, 2024

View reviewed changes

libs/community/langchain_community/document_loaders/youtube.py Show resolved Hide resolved

eyurtsev changed the title ~~community: [GoogleApiYoutubeLoader] Replace API used in _get_document_for_channel from search to playlistItem~~ community[minor]: [GoogleApiYoutubeLoader] Replace API used in _get_document_for_channel from search to playlistItem Jul 18, 2024

eyurtsev approved these changes Jul 18, 2024

View reviewed changes

dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Jul 18, 2024

vercel bot temporarily deployed to Preview July 18, 2024 02:20 Inactive

Merge branch 'master' into master

eed1d1e

vercel bot deployed to Preview July 18, 2024 13:04 View deployment

eyurtsev merged commit 372c27f into langchain-ai:master Jul 19, 2024
42 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

community[minor]: [GoogleApiYoutubeLoader] Replace API used in _get_document_for_channel from search to playlistItem #24034

community[minor]: [GoogleApiYoutubeLoader] Replace API used in _get_document_for_channel from search to playlistItem #24034

TupleType commented Jul 9, 2024

vercel bot commented Jul 9, 2024 •

edited

Loading

TupleType commented Jul 9, 2024

eyurtsev Jul 11, 2024

TupleType Jul 14, 2024 •

edited

Loading

TupleType commented Jul 17, 2024

eyurtsev Jul 16, 2024

TupleType Jul 18, 2024

community[minor]: [GoogleApiYoutubeLoader] Replace API used in _get_document_for_channel from search to playlistItem #24034

community[minor]: [GoogleApiYoutubeLoader] Replace API used in _get_document_for_channel from search to playlistItem #24034

Conversation

TupleType commented Jul 9, 2024

vercel bot commented Jul 9, 2024 • edited Loading

TupleType commented Jul 9, 2024

eyurtsev Jul 11, 2024

Choose a reason for hiding this comment

TupleType Jul 14, 2024 • edited Loading

Choose a reason for hiding this comment

TupleType commented Jul 17, 2024

eyurtsev Jul 16, 2024

Choose a reason for hiding this comment

TupleType Jul 18, 2024

Choose a reason for hiding this comment

vercel bot commented Jul 9, 2024 •

edited

Loading

TupleType Jul 14, 2024 •

edited

Loading