Content in <script> should be ignored. #13

jasny · 2013-12-27T17:26:55Z

Embedding this articles returns an invalid image URL with am <%= image > tag in it.

If I look at the source, I see that the image tag is actually inside a <script type="text/template"> tag. This is common when a website uses a javascript templating engines.

The text was updated successfully, but these errors were encountered:

oscarotero · 2013-12-27T19:22:56Z

Thanks for the feedback.
I was trying to solve this but seems that php's DOMDocument object has some troubles with these scripts. These elements cannot contain nodes (only text value), so DOMDocument shouldn't find this img (because it's not a node), but it find it. Even on remove all script elements from the document, the image still existing, so I guess the html parser of DOMDocument class closes the script before the img and treat this code as real nodes.

jasny · 2013-12-27T19:43:31Z

How about using a regular expression. I think this would work in 99% of the cases.

$html = preg_replace('~<script\b[^>]*>([^<]++|<)*</script>~i', '', $html);

oscarotero · 2013-12-27T20:00:31Z

I prefer not to use regular expressions because this can solves this problem but generate others. The content of the script elements can be anything and can contain others html elements (even javascript elements) as string, commented code, templates, etc, so generate a solid regular expression that work in all cases is very difficult.

jasny · 2013-12-27T22:30:56Z

True, in some cases it might cause trouble. Luckily I'm rather good in writing regular expressions 😄. So here's a fail safe version:

preg_replace('%<script\b(?:"(?:[^"\\\\\\\\]++|\\\\\\\\.)*+"|\'(?:[^\\\\\\\\]++|\\\\\\\\.)*+\'|[^>"]++)*>(?:"(?:[^"\\\\\\\\]++|\\\\\\\\.)*+"|\'(?:[^\\\\\\\\]++|\\\\\\\\.)*+\'|//.*?\n|/\*(?:[^\*]++|\*)*?\*/|[^<"/]++|/|(?R)|<)*?</\s*script>%si', '', $html);

Breaking down the regex (left out escaping the php string):

<script\b(?:)*> - Match the script tag
- "(?:[^"\\\\]++|\\\\.)*+" - Match any value between double quotes, escaped by a backslash
- '(?:[^'\\\\]++|\\\\.)*+' - Match any value between single quotes, escaped by a backslash
- [^>"']++ - Possessive match any character without a special meaning
(?:)*? - Match contents between begin and end tag. Lazy match finds the first available end tag.
- "(?:[^"\\\\]++|\\\\.)*+" - Match any value between double quotes, escaped by a backslash
- '(?:[^'\\\\]++|\\\\.)*+' - Match any value between single quotes, escaped by a backslash
- //.*?\n - Match a comment starting with 2 slashes till the end of the line
- /\*(?:[^\*]++|\*)*?\*/ - Match a comment between /* */
- [^<'"/]++ - Possessive match any character without a special meaning
- / - Match a slash (which is not the start of a comment)
- (?R) - Recursive match, so matches sub <script> block. This makes sure the </script> tag is paired with the correct starting <script> tag.
- < - Match a < character not being used in the recursive match (and not the end tag)
</\s*script> Match the end tag

oscarotero · 2013-12-28T11:09:43Z

Wow, that's pretty good!!
I've created a new branch to implement your regexp (d037640) and it's works in this case (with http://www.usatoday.com/story/tech/2013/07/19/microsoft-stock-plummets-12/2569413/) but with other links (for example: http://www.politico.com/story/2013/12/presidents-barack-obama-george-w-bush-second-term-101314.html) there is a timeout

jasny · 2013-12-28T11:57:51Z

I see that the single quote is missing from the possessive match of non-special characters. That might be causing the problem.

jasny · 2013-12-28T15:56:06Z

My approach is actually wrong. If there is a </script> tag inside a javascript string, it's just a valid end tag. You need to either escape the < as >, or put it in a <[CDATA[ ]]> or  (comment).

oscarotero · 2013-12-29T12:39:36Z

Thank you, seems to work now. I'm going to add you as a contributor.

jasny mentioned this issue Dec 28, 2013

Don't match JavaScript comments and strings when removing script tags. #14

Merged

oscarotero closed this as completed Dec 29, 2013

Coornifex mentioned this issue Mar 24, 2024

CURL error when embedding Twitter #531

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Content in <script> should be ignored. #13

Content in <script> should be ignored. #13

jasny commented Dec 27, 2013

oscarotero commented Dec 27, 2013

jasny commented Dec 27, 2013

oscarotero commented Dec 27, 2013

jasny commented Dec 27, 2013

oscarotero commented Dec 28, 2013

jasny commented Dec 28, 2013

jasny commented Dec 28, 2013

oscarotero commented Dec 29, 2013

Content in <script> should be ignored. #13

Content in <script> should be ignored. #13

Comments

jasny commented Dec 27, 2013

oscarotero commented Dec 27, 2013

jasny commented Dec 27, 2013

oscarotero commented Dec 27, 2013

jasny commented Dec 27, 2013

oscarotero commented Dec 28, 2013

jasny commented Dec 28, 2013

jasny commented Dec 28, 2013

oscarotero commented Dec 29, 2013