Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community[minor]: [GoogleApiYoutubeLoader] Replace API used in _get_document_for_channel from search to playlistItem #24034

Merged
merged 9 commits into from
Jul 19, 2024

Conversation

TupleType
Copy link
Contributor

  • Description: Search has a limit of 500 results, playlistItems doesn't. Added a class in except clause to catch another common error.
  • Issue: None
  • Dependencies: None
  • Twitter handle: @TupleType

…Items because search has a limit of 500 results.

Added a class in except clause to catch another common error.
@dosubot dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Jul 9, 2024
Copy link

vercel bot commented Jul 9, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jul 18, 2024 1:04pm

@dosubot dosubot bot added community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases labels Jul 9, 2024
@TupleType
Copy link
Contributor Author

I got a non descriptive linting error, can someone please help?
langchain_community/document_loaders/youtube.py:from xml.etree.ElementTree import ParseError
make: *** [Makefile:49: lint] Error 1

@@ -7,6 +7,7 @@
from pathlib import Path
from typing import Any, Dict, Generator, List, Optional, Sequence, Union
from urllib.parse import parse_qs, urlparse
from xml.etree.ElementTree import ParseError
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do this:

from xml.etree.ElementTree import ParseError # OK: user-must-opt-in

The linter is flagging the fact that the underlying library is relying on the built-in XML library.

It's essentially surfacing this:

https://github.com/jdepoix/youtube-transcript-api/blob/master/youtube_transcript_api/_transcripts.py#L10

Depending on user environment the built in may or may not have vulnerabilities:
https://docs.python.org/3/library/xml.html#xml-vulnerabilities

Do you know if the underlying library reading from a google provided XML file? If so, while not amazing this is probably fine.

Could you add a security note to the GoogleAPIClient saying that it relies on the standard xml library, but we're viewing the input as trusted in this case?

Copy link
Contributor Author

@TupleType TupleType Jul 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the library well enough to determine that but this is the error I'm getting for https://www.youtube.com/watch?v=pM8e0Dbzopk:

File "C:\Users\user\repos\rag-demo\loader.py", line 71, in _get_document_for_playlist
    transcript = self._get_transcripe_for_video_id(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\repos\rag-demo\.venv\Lib\site-packages\langchain_community\document_loaders\youtube.py", line 410, in _get_transcripe_for_video_id
    transcript_pieces = transcript.fetch()
                        ^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\repos\rag-demo\.venv\Lib\site-packages\youtube_transcript_api\_transcripts.py", line 292, in fetch
    return _TranscriptParser(preserve_formatting=preserve_formatting).parse(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\repos\rag-demo\.venv\Lib\site-packages\youtube_transcript_api\_transcripts.py", line 358, in parse
    for xml_element in ElementTree.fromstring(plain_data)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\xml\etree\ElementTree.py", line 1338, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 57, column 99

@eyurtsev eyurtsev self-assigned this Jul 11, 2024
@TupleType TupleType requested a review from eyurtsev July 17, 2024 13:32
@TupleType
Copy link
Contributor Author

dependency checks failed but it doesn't have logs

@@ -452,34 +463,32 @@ def _get_document_for_channel(self, channel: str, **kwargs: Any) -> List[Documen
)

channel_id = self._get_channel_id(channel)
request = self.youtube_client.search().list(
uploads_playlist_id = self._get_uploads_playlist_id(channel_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change backwards compatible? I don't use the youtube API at all. Will this break anything for existing users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it's backward compatible

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Jul 18, 2024
@eyurtsev eyurtsev changed the title community: [GoogleApiYoutubeLoader] Replace API used in _get_document_for_channel from search to playlistItem community[minor]: [GoogleApiYoutubeLoader] Replace API used in _get_document_for_channel from search to playlistItem Jul 18, 2024
@dosubot dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Jul 18, 2024
@eyurtsev eyurtsev merged commit 372c27f into langchain-ai:master Jul 19, 2024
42 checks passed
olgamurraft pushed a commit to olgamurraft/langchain that referenced this pull request Aug 16, 2024
…ocument_for_channel from search to playlistItem (langchain-ai#24034)

- **Description:** Search has a limit of 500 results, playlistItems
doesn't. Added a class in except clause to catch another common error.
- **Issue:** None
- **Dependencies:** None
- **Twitter handle:** @TupleType

---------

Co-authored-by: asi-cider <88270351+asi-cider@users.noreply.github.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases lgtm PR looks good. Use to confirm that a PR is ready for merging. size:M This PR changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants