walkthrough.txt 4.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
  1. :ss:"<html><!-- comment --><body/></html>" / Sample xml text with a variety of tag types
  2. :ds:&'"<>"=\:ss / find the delimiters
  3. :sp:&'"!?/]-"=\:ss / find indices "helper" characters for (sp)ecial tag types
  4. / this produces a list of indices for each helper character
  5. ss@sp+1 / look for the characters following a helper character
  6. ss@sp-1 / look for the characters preceding a helper character
  7. (sp:&'"!?/]-"=\:ss)+\:/:-1 1 / Both!!
  8. ds[0]?sp-1 / See which characters preceding helper chars are opening delimeters
  9. ~^ds[0]?sp-1 / We don't need to know which particular delimiter
  10. / Just that there is one
  11. ~^ds?'(sp:&'"!?/]-"=\:ss)+\:/:-1 1 / Both!
  12. &''~^ds?'(sp:&'"!?/]-"=\:ss)+\:/:-1 1 / Find the indices into the lists of special characters
  13. / Find the indices of the actual characters (adjusting back)
  14. :sp:-1 1+'sp@'/:&''~^ds?'(sp:&'"!?/]-"=\:ss)+\:/:-1 1
  15. qs:&'"'\""=\:x / Find both single and double quotes
  16. / Now to tackle comments and CDATA
  17. sp[1] / closing special delimiters
  18. sp[1;4 3] / find "->" and "]>"
  19. ss@cl:sp[1;4 3]-2 / Look back a couple of characters
  20. "-]"='ss@cl / find "-->" and "]]>"
  21. cl:2+cl@'&'"-]"='ss@cl / Keep track of where we found this
  22. / Now look for openers
  23. ss@(op:sp[0;0])+\:2+!7 / Look for the first seven characters after "<!"
  24. / See check which of these look like the beginning of a comment or CDATA
  25. ("--";"[CDATA["){x~(#x)#y}/:\:ss@(op:sp[0;0])+\:2+!7
  26. / And find the original indices
  27. op:op@/:&'("--";"[CDATA["){x~(#x)#y}/:\:ss@(op:sp[0;0])+\:2+!7
  28. / Now we have candidates for opening and closing comments and CDATA
  29. / Now we have to clean things up
  30. / What can happen?
  31. :ss0:"<!-- <!-- --> -->" / Comments don't nest!
  32. / Actually, this is illegal syntax. "<" must only be used to open a tag
  33. :ss0:"<!-- &gt;!-- --> -->" / Nothing wrong with this, though
  34. / The comment is closed by the first closing comment
  35. :tst:"(open close) close) (open close)" / Let's make it look simpler
  36. :oc:&'"()"=\:tst / indices of open and closed characters
  37. (#'oc)#'1 -1 / replace open indices with 1's and close indices with -1's
  38. (,/(#'oc)#'1 -1)@g:<v:,/oc / flatten and sort by the (flattened) indices
  39. +\(,/(#'oc)#'1 -1)@g:<v:,/oc / Where the sum scan dips below 0 is when we have too many close chars
  40. / We want to throw those out, so let's focus on them
  41. 0&\+\(,/(#'oc)#'1 -1)@g:<v:,/oc / only spots where we're zero or lower
  42. 0<':0&\+\(,/(#'oc)#'1 -1)@g:<v:,/oc / lower than even the previous one so another redundant close
  43. / These are the ones we throw out
  44. fm:{(v@g)@&~0<':0&\+\(,/(#'x)#'1 -1)@g:<v:,/x}
  45. / This finds the "first match" for each opening character
  46. @[&#tst;fm@oc;:;1] / For debugging let's mark which indices we've kept
  47. `0:(tst;" ^"@[&#tst;fm@oc;:;1]) / And line them up with the text
  48. / Notice that fm returns a list of indices which alternates
  49. / open and close characters
  50. ev:#'/|1(2*-2!#:)'\ / ev ensures you have a list of even length
  51. / technically shouldn't happen, but it couldn't hurt
  52. / After applying fm to open and closed comments and CDATA sequences,
  53. / We want to ensure that comments in CDATA and vice versa are just text
  54. ss:"(easy)([])[()][peasy]([)[)]" / Let's simply by making using parens a braces
  55. / The idea is to think of each as quoting the other
  56. / and we're looking for unquoted characters