Skip to content

Content in <script> should be ignored. #13

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jasny opened this issue Dec 27, 2013 · 8 comments
Closed

Content in <script> should be ignored. #13

jasny opened this issue Dec 27, 2013 · 8 comments

Comments

@jasny
Copy link
Contributor

jasny commented Dec 27, 2013

Embedding this articles returns an invalid image URL with am <%= image > tag in it.

If I look at the source, I see that the image tag is actually inside a <script type="text/template"> tag. This is common when a website uses a javascript templating engines.

@oscarotero
Copy link
Collaborator

Thanks for the feedback.
I was trying to solve this but seems that php's DOMDocument object has some troubles with these scripts. These elements cannot contain nodes (only text value), so DOMDocument shouldn't find this img (because it's not a node), but it find it. Even on remove all script elements from the document, the image still existing, so I guess the html parser of DOMDocument class closes the script before the img and treat this code as real nodes.

@jasny
Copy link
Contributor Author

jasny commented Dec 27, 2013

How about using a regular expression. I think this would work in 99% of the cases.

$html = preg_replace('~<script\b[^>]*>([^<]++|<)*</script>~i', '', $html);

@oscarotero
Copy link
Collaborator

I prefer not to use regular expressions because this can solves this problem but generate others. The content of the script elements can be anything and can contain others html elements (even javascript elements) as string, commented code, templates, etc, so generate a solid regular expression that work in all cases is very difficult.

@jasny
Copy link
Contributor Author

jasny commented Dec 27, 2013

True, in some cases it might cause trouble. Luckily I'm rather good in writing regular expressions 😄. So here's a fail safe version:

preg_replace('%<script\b(?:"(?:[^"\\\\\\\\]++|\\\\\\\\.)*+"|\'(?:[^\\\\\\\\]++|\\\\\\\\.)*+\'|[^>"]++)*>(?:"(?:[^"\\\\\\\\]++|\\\\\\\\.)*+"|\'(?:[^\\\\\\\\]++|\\\\\\\\.)*+\'|//.*?\n|/\*(?:[^\*]++|\*)*?\*/|[^<"/]++|/|(?R)|<)*?</\s*script>%si', '', $html);

Breaking down the regex (left out escaping the php string):

  • <script\b(?:)*> - Match the script tag
    • "(?:[^"\\\\]++|\\\\.)*+" - Match any value between double quotes, escaped by a backslash
    • '(?:[^'\\\\]++|\\\\.)*+' - Match any value between single quotes, escaped by a backslash
    • [^>"']++ - Possessive match any character without a special meaning
  • (?:)*? - Match contents between begin and end tag. Lazy match finds the first available end tag.
    • "(?:[^"\\\\]++|\\\\.)*+" - Match any value between double quotes, escaped by a backslash
    • '(?:[^'\\\\]++|\\\\.)*+' - Match any value between single quotes, escaped by a backslash
    • //.*?\n - Match a comment starting with 2 slashes till the end of the line
    • /\*(?:[^\*]++|\*)*?\*/ - Match a comment between /* */
    • [^<'"/]++ - Possessive match any character without a special meaning
    • / - Match a slash (which is not the start of a comment)
    • (?R) - Recursive match, so matches sub <script> block. This makes sure the </script> tag is paired with the correct starting <script> tag.
    • < - Match a < character not being used in the recursive match (and not the end tag)
  • </\s*script> Match the end tag

@oscarotero
Copy link
Collaborator

Wow, that's pretty good!!
I've created a new branch to implement your regexp (d037640) and it's works in this case (with http://www.usatoday.com/story/tech/2013/07/19/microsoft-stock-plummets-12/2569413/) but with other links (for example: http://www.politico.com/story/2013/12/presidents-barack-obama-george-w-bush-second-term-101314.html) there is a timeout

@jasny
Copy link
Contributor Author

jasny commented Dec 28, 2013

I see that the single quote is missing from the possessive match of non-special characters. That might be causing the problem.

@jasny
Copy link
Contributor Author

jasny commented Dec 28, 2013

My approach is actually wrong. If there is a </script> tag inside a javascript string, it's just a valid end tag. You need to either escape the < as &gt;, or put it in a <[CDATA[ ]]> or <!-- --> (comment).

@oscarotero
Copy link
Collaborator

Thank you, seems to work now. I'm going to add you as a contributor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants