1. Introduction
Tubax is a library to parse and convert XML raw data into native Clojurescript data structures.
It uses sax.js under the hood to provide a fast way to convert from XML.
2. Rationale
Currently there is no good way to parse XML and other markup languages with Clojurescript. There are no Clojurescript-based libraries and most of the Javascript ones require access to the DOM.
This last point is critical because HTML5 Web Workers don’t have access to these APIs so an alternative is necessary.
Another alternative to XML processing is to go to a middle-ground. There are some libraries that will parse XML into a JSON format.
The problem with these is that JSON is not a faithfull representation of the XML format. There are some XML that couldn’t be represented as JSON.
For example, the following XML will loss information when transformed into JSON.
<root>
<field-a>A</field-a>
<field-b>B</field-b>
<field-a>A</field-a>
</root>
Another main objective of tubax is to be fully compatible with the clojure.xml
format so we can access the functionality currently in the Clojure API like zippers.
3. Install
Warning
|
Not on clojars yet. I’ll update the information when it’s available |
4. Usage
All examples will use this XML as if it existed in a (def xml-data "…")
definition.
<rss version="2.0">
<channel>
<title>RSS Title</title>
<description>This is an example of an RSS feed</description>
<link>http://www.example.com/main.html</link>
<lastBuildDate>Mon, 06 Sep 2010 00:01:00 +0000 </lastBuildDate>
<pubDate>Sun, 06 Sep 2009 16:20:00 +0000</pubDate>
<ttl>1800</ttl>
<item>
<title>Example entry</title>
<description>Here is some text containing an interesting description.</description>
<link>http://www.example.com/blog/post/1</link>
<guid isPermaLink="false">7bd204c6-1655-4c27-aeee-53f933c5395f</guid>
<pubDate>Sun, 06 Sep 2009 16:20:00 +0000</pubDate>
</item>
<item>
<title>Example entry2</title>
<description>Here is some text containing an interesting description.</description>
<link>http://www.example.com/blog/post/1</link>
<guid isPermaLink="true">7bd204c6-1655-4c27-aeee-53f933c5395f</guid>
<pubDate>Sun, 06 Sep 2009 16:20:00 +0000</pubDate>
</item>
</channel>
</rss>
4.1. Basic usage
In order to parse a XML file you only have to make a call to the xml→clj
function
(require '[tubax.core :refer [xml->clj]])
(xml->clj xml-data)
4.2. Additional options
The library bundles sax.js library as it’s main dependency. You can pass the following options to the conversion to customize some behaviour.
4.2.1. Strict mode
default true
When not in strict mode the parser will be more forgiving on XML structure. If in strict mode, when there is a format failure the parsing will throw an exception.
Warning
|
Some "loosy" formats could cause unexpected behaviour so it’s not recommended. |
(def xml-data "<a><b></a>")
(core/xml->clj xml-data {:strict false})
;; => {:tag :a :attributes {} :content {:tag :b :attributes {} :content []}}
(core/xml->clj xml-data {:strict true})
;; => js/Error #Parse error
4.2.2. Trim whitespaces
default true
This option will make the parsing to remove all the leading and trailing whitespaces in the text nodes.
(def xml-data "<a> test </a>")
(core/xml->clj xml-data {:trim false})
;; => {:tag :a :attributes {} :content [" test "]}
(core/xml->clj xml-data {:trim true})
;; => {:tag :a :attributes {} :content ["test"]}
4.2.3. Normalize whitespaces
default false
Replace all whitespaces-characters (like tabs, end of lines, etc..) for whitespaces.
(def xml-data "<a>normalize\ntest</a>")
(core/xml->clj xml-data {:normalize false})
;; => {:tag :a :attributes {} :content ["normalize\ntest"]}
(core/xml->clj xml-data {:normalize true})
;; => {:tag :a :attributes {} :content ["normalize test"]}
4.2.4. Lowercase (non-strict mode only)
default true
When on non-strict mode, all tags and attributes can be made upper-case just by setting this option.
(def xml-data "<root att1='t1'>test</root>")
(core/xml->clj xml-data {:strict false :lowercase true})
;; => {:tag :root :attributes {:att1 "t1"} :content ["test"]}
(core/xml->clj xml-data {:strict false :lowercase false})
;; => {:tag :ROOT :attributes {:ATT1 "t1"} :content ["test"]}
4.2.5. Support for XML namespaces
default false
By default there is no additional data when a XML namespace is found.
When the option xmlns is activated there will be more information regarding the namespaces inside the node elements.
(def xml-data "<element xmlns='http://foo'>value</element>")
(core/xml->clj xml-data {:xmlns false})
;; => {:tag :element :attributes {:xmlns "http://foo"} :content ["value"]}
(core/xml->clj xml-data {:xmlns true})
;; => {:tag :element :attributes {:xmlns {:name "xmlns" :value "http://foo" :prefix "xmlns" :local "" :uri "http://www.w3.org/2000/xmlns/"}} :content ["value"]}
4.2.6. Strict entities
default false
When activated, it makes the parser to fail when it founds a non-predefined entity
(def xml-data "<element>á</element>")
(core/xml->clj xml-data {:strict-entities false})
;; => {:tag :element :attributes {} :content ["รก"]}
(core/xml->clj xml-data {:strict-entities true})
;; => js/Error #Parser error
4.3. Utility functions
(require '[tubax.helpers :as th])
For simplicity the following examples suppose:
(require '[tubax.core :refer [xml->clj]])
(def result (xml->clj xml-data))
4.3.1. Access data-structure
(th/tag {:tag :item :attribute {} :content ["Text"]})
;; => :item
(th/attributes {:tag :item :attribute {} :content ["Text"]})
;; => {}
(th/children {:tag :item :attribute {} :content ["Text"]})
;; => ["Text"]
(th/text {:tag :item :attribute {} :content ["Text"]})
;; => Text
(th/text {:tag :item {} :content [{:tag :item :attributes {} :content [...]}]})
;; => nil
4.3.2. Find first node
These methods retrieve the first node that match the query passed as argument.
(th/find-first result {:tag :item})
;; => {:tag :item :attributes {} :content [{:content :title :attributes {} :content ["Hello world"]}]}
(th/find-first result {:path [:rss :channel :description]})
;; => {:tag :description :attributes {} :content ["This is an example of an RSS feed"]}
Search for the first element that have the attribute defined
(th/find-first result {:attribute :isPermaLink})
;; => {:tag :guid :attributes {:isPermaLink "false"} :content ["7bd204c6-1655-4c27-aeee-53f933c5395f"]}
Search for the first element that have an attribute with the specified value
(th/find-first result {:attribute [:isPermaLink true]})
;; => {:tag :guid :attributes {:isPermaLink "true"} :content ["7bd204c6-1655-4c27-aeee-53f933c5395f"]}
4.3.3. Find all nodes
These methods retrieve a lazy sequence with the elements which match the query used as argument.
(th/find-all result {:tag :link})
;; => ({:tag :link :attributes {} :content ["http://www.example.com/main.html"]}
;; {:tag :link :attributes {} :content ["http://www.example.com/blog/post/1"]})
(th/find-all result {:path [:rss :channel :item :title]})
;; => ({:tag :title :attributes {} :content ["Example entry"]}
;; {:tag :title :attributes {} :content ["Example entry2"]})
(th/find-all result {:attribute :isPermaLink})
;; => ({:tag :guid :attributes {:isPermaLink "true"} :content ["7bd204c6-1655-4c27-aeee-53f933c5395f"]}
;; {:tag :guid :attributes {:isPermaLink "false"} :content ["7bd204c6-1655-4c27-aeee-53f933c5395f"]})
(th/find-all result {:attribute [:isPermaLink "true"]})
;; => ({:tag :guid :attributes {:isPermaLink "true"} :content ["7bd204c6-1655-4c27-aeee-53f933c5395f"]})
5. Contribute
Tubax does not have many restrictions for contributions. Just open an issue or pull request.
6. Runing tests
lein test
7. License
This library is under the Apache 2.0 License.