PHP preg_split en espacios, pero no dentro de tags

Estoy usando preg_split("/\"[^\"]*\"(*SKIP)(*F)|\x20/", $input_line); y lo ejecuto en phpliveregex.com produce array:

 array(10 0=>test 1=>or 2=>oh 3=>yeah 4=>and 5=> 6=>oh 7=>yeah 8=> 9=>"ye we 'hold' it" ) 

NO lo que quiero, debe estar separado por espacios solo fuera de tags html como esta:

 array(5 0=>test 1=>or 2=>oh yeah 3=>and 4=>oh yeah 5=>"ye we 'hold' it" ) 

en esta expresión regular, solo puedo agregar una excepción en “comillas dobles”, pero realmente necesito ayuda para agregar más, como la etiqueta


cualquier explicación sobre cómo funciona esa expresión regular también lo aprecia.

Es más fácil usar DOMDocument ya que no necesita describir qué es una etiqueta html y cómo se ve. Solo necesita verificar nodeType. Cuando es un textNode, divídelo con preg_match_all (es más útil que diseñar un patrón para preg_split ) :

 $html = 'spaces in a text node test or oh yeah and oh yeah "ye we \'hold\' it" "unclosed double quotes at the end'; $dom = new DOMDocument; $dom->loadHTML('
' . $html . '
', LIBXML_HTML_NOIMPLIED); $nodeList = $dom->documentElement->childNodes; $results = []; foreach ($nodeList as $childNode) { if ($childNode->nodeType == XML_TEXT_NODE && preg_match_all('~[^\s"]+|"[^"]*"?~', $childNode->nodeValue, $m)) $results = array_merge($results, $m[0]); else $results[] = $dom->saveHTML($childNode); } print_r($results);

Nota: He elegido un comportamiento predeterminado cuando una parte de doble cotización permanece sin cerrar (sin una cotización de cierre) , siéntase libre de cambiarla.

Nota2: a veces LIBXML_ constantes LIBXML_ no están definidas. Puede resolver este problema antes de probarlo y definirlo cuando sea necesario:

 if (!defined('LIBXML_HTML_NOIMPLIED')) define('LIBXML_HTML_NOIMPLIED', 8192); 

Descripción

En lugar de usar un comando dividido, simplemente haga coincidir las secciones que desea

< (?:(?:img)(?=[\s>\/])(?:[^>=]|=(?:'[^']*'|"[^"]*"|[^'"\s>]*))*\s?\/?>|(a|span|pre|code|strong|b|em|i)(?=[\s>\\])(?:[^>=]|=(?:'[^']*'|"[^"]*"|[^'"\s>]*))*\s?\/?>.*?< \/\1>)|(?:"[^"]*"|[^"< ]*)*

Visualización de expresión regular

Ejemplo

Demo en vivo

https://regex101.com/r/bK8iL3/1

Texto de ejemplo

Tenga en cuenta el caso borde difícil en el segundo párrafo

 test or  this  oh yeah  and oh yeah Here we are "ye we 'hold' it" somegfsfdroides

Partidos de muestra

 MATCH 1 0. [0-11] `test` MATCH 2 0. [11-15] ` or ` MATCH 3 0. [15-38] ` this ` MATCH 4 0. [38-56] ` oh yeah ` MATCH 5 0. [56-61] ` and ` MATCH 6 0. [61-75] `oh yeah` MATCH 7 0. [75-111] ` Here we are "ye we 'hold' it" some` MATCH 8 0. [111-117] `` MATCH 9 0. [117-121] `gfsf` MATCH 10 0. [121-213] `droides` MATCH 11 0. [213-224] `

` MATCH 12 0. [224-237] `` MATCH 13 0. [237-254] `` MATCH 14 0. [254-261] `` MATCH 15 0. [261-270] `` MATCH 16 0. [270-277] ``

Explicación

 NODE EXPLANATION ---------------------------------------------------------------------- < '<' ---------------------------------------------------------------------- (?: group, but do not capture: ---------------------------------------------------------------------- (?: group, but do not capture: ---------------------------------------------------------------------- img 'img' ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- (?= look ahead to see if there is: ---------------------------------------------------------------------- [\s>\/] any character of: whitespace (\n, \r, \t, \f, and " "), '>', '\/' ---------------------------------------------------------------------- ) end of look-ahead ---------------------------------------------------------------------- (?: group, but do not capture (0 or more times (matching the most amount possible)): ---------------------------------------------------------------------- [^>=] any character except: '>', '=' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- = '=' ---------------------------------------------------------------------- (?: group, but do not capture: ---------------------------------------------------------------------- ' '\'' ---------------------------------------------------------------------- [^']* any character except: ''' (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ' '\'' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- " '"' ---------------------------------------------------------------------- [^"]* any character except: '"' (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- " '"' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- [^'"\s>]* any character except: ''', '"', whitespace (\n, \r, \t, \f, and " "), '>' (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- )* end of grouping ---------------------------------------------------------------------- \s? whitespace (\n, \r, \t, \f, and " ") (optional (matching the most amount possible)) ---------------------------------------------------------------------- \/? '/' (optional (matching the most amount possible)) ---------------------------------------------------------------------- > '>' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- a 'a' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- span 'span' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- pre 'pre' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- code 'code' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- strong 'strong' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- b 'b' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- em 'em' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- i 'i' ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- (?= look ahead to see if there is: ---------------------------------------------------------------------- [\s>\\] any character of: whitespace (\n, \r, \t, \f, and " "), '>', '\\' ---------------------------------------------------------------------- ) end of look-ahead ---------------------------------------------------------------------- (?: group, but do not capture (0 or more times (matching the most amount possible)): ---------------------------------------------------------------------- [^>=] any character except: '>', '=' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- = '=' ---------------------------------------------------------------------- (?: group, but do not capture: ---------------------------------------------------------------------- ' '\'' ---------------------------------------------------------------------- [^']* any character except: ''' (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ' '\'' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- " '"' ---------------------------------------------------------------------- [^"]* any character except: '"' (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- " '"' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- [^'"\s>]* any character except: ''', '"', whitespace (\n, \r, \t, \f, and " "), '>' (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- )* end of grouping ---------------------------------------------------------------------- \s? whitespace (\n, \r, \t, \f, and " ") (optional (matching the most amount possible)) ---------------------------------------------------------------------- \/? '/' (optional (matching the most amount possible)) ---------------------------------------------------------------------- > '>' ---------------------------------------------------------------------- .*? any character (0 or more times (matching the least amount possible)) ---------------------------------------------------------------------- < '<' ---------------------------------------------------------------------- \/ '/' ---------------------------------------------------------------------- \1 what was matched by capture \1 ---------------------------------------------------------------------- > '>' ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- (?: group, but do not capture (0 or more times (matching the most amount possible)): ---------------------------------------------------------------------- " '"' ---------------------------------------------------------------------- [^"]* any character except: '"' (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- " '"' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- [^"< ]* any character except: '"', '<' (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- )* end of grouping ----------------------------------------------------------------------