123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657 |
- :ss:"<html><!-- comment --><body/></html>" / Sample xml text with a variety of tag types
- :ds:&'"<>"=\:ss / find the delimiters
- :sp:&'"!?/]-"=\:ss / find indices "helper" characters for (sp)ecial tag types
- / this produces a list of indices for each helper character
- ss@sp+1 / look for the characters following a helper character
- ss@sp-1 / look for the characters preceding a helper character
- (sp:&'"!?/]-"=\:ss)+\:/:-1 1 / Both!!
- ds[0]?sp-1 / See which characters preceding helper chars are opening delimeters
- ~^ds[0]?sp-1 / We don't need to know which particular delimiter
- / Just that there is one
- ~^ds?'(sp:&'"!?/]-"=\:ss)+\:/:-1 1 / Both!
- &''~^ds?'(sp:&'"!?/]-"=\:ss)+\:/:-1 1 / Find the indices into the lists of special characters
- / Find the indices of the actual characters (adjusting back)
- :sp:-1 1+'sp@'/:&''~^ds?'(sp:&'"!?/]-"=\:ss)+\:/:-1 1
- qs:&'"'\""=\:x / Find both single and double quotes
- / Now to tackle comments and CDATA
- sp[1] / closing special delimiters
- sp[1;4 3] / find "->" and "]>"
- ss@cl:sp[1;4 3]-2 / Look back a couple of characters
- "-]"='ss@cl / find "-->" and "]]>"
- cl:2+cl@'&'"-]"='ss@cl / Keep track of where we found this
- / Now look for openers
- ss@(op:sp[0;0])+\:2+!7 / Look for the first seven characters after "<!"
- / See check which of these look like the beginning of a comment or CDATA
- ("--";"[CDATA["){x~(#x)#y}/:\:ss@(op:sp[0;0])+\:2+!7
- / And find the original indices
- op:op@/:&'("--";"[CDATA["){x~(#x)#y}/:\:ss@(op:sp[0;0])+\:2+!7
- / Now we have candidates for opening and closing comments and CDATA
- / Now we have to clean things up
- / What can happen?
- :ss0:"<!-- <!-- --> -->" / Comments don't nest!
- / Actually, this is illegal syntax. "<" must only be used to open a tag
- :ss0:"<!-- >!-- --> -->" / Nothing wrong with this, though
- / The comment is closed by the first closing comment
- :tst:"(open close) close) (open close)" / Let's make it look simpler
- :oc:&'"()"=\:tst / indices of open and closed characters
- (#'oc)#'1 -1 / replace open indices with 1's and close indices with -1's
- (,/(#'oc)#'1 -1)@g:<v:,/oc / flatten and sort by the (flattened) indices
- +\(,/(#'oc)#'1 -1)@g:<v:,/oc / Where the sum scan dips below 0 is when we have too many close chars
- / We want to throw those out, so let's focus on them
- 0&\+\(,/(#'oc)#'1 -1)@g:<v:,/oc / only spots where we're zero or lower
- 0<':0&\+\(,/(#'oc)#'1 -1)@g:<v:,/oc / lower than even the previous one so another redundant close
- / These are the ones we throw out
- fm:{(v@g)@&~0<':0&\+\(,/(#'x)#'1 -1)@g:<v:,/x}
- / This finds the "first match" for each opening character
- @[&#tst;fm@oc;:;1] / For debugging let's mark which indices we've kept
- `0:(tst;" ^"@[&#tst;fm@oc;:;1]) / And line them up with the text
- / Notice that fm returns a list of indices which alternates
- / open and close characters
- ev:#'/|1(2*-2!#:)'\ / ev ensures you have a list of even length
- / technically shouldn't happen, but it couldn't hurt
- / After applying fm to open and closed comments and CDATA sequences,
- / We want to ensure that comments in CDATA and vice versa are just text
- ss:"(easy)([])[()][peasy]([)[)]" / Let's simply by making using parens a braces
- / The idea is to think of each as quoting the other
- / and we're looking for unquoted characters
|