Large texts and XML databases

I’ve been doing some experiments with storing and searching large texts in Knora and in an XML database, to explore the advantages and disadvantages of each approach.

My test data is 50 books from Project Gutenberg, with an average length of 300,000 words per book. To simulate a project with very dense markup, I added markup as follows: I ran each book through a part-of-speech tagger using NLTK, to add <noun>, <verb>, <adj>, and <det> tags. I put every 10 words in a <sentence> tag, and every five sentences in a <p> tag. I made a simple ontology for these books using knora-py, adding custom standoff tags for the parts of speech and the sentences.

With Knora

I imported a resource for each book into Knora, with the book’s entire text in a single text value. This is slow. One reason is that, after saving data in the triplestore, Knora reads the data back to check that it was stored correctly. I added a configuration option to disable this. Even so, it took 5-10 minutes to import each book.

Thinking about the use case of editing and saving your work in a GUI, I also tried splitting each book into 1000-word fragments, and saving each fragment in a separate BookFragment resource, linked to a Book resource. Then I could import a fragment in about 2 seconds.

I was especially interested in the performance of searches that combine text and markup. I tried Gravsearch queries like “find Euchenor marked as a noun and full marked as an adjective, in the same paragraph but not necessarily in the same sentence.”

Gravsearch does this by first using the Lucene full-text index to determine which text values contain these words. It then iterates over the noun and adjective tags in each of those text values, and compares the search terms to the substrings of text contained in each tag (using the SUBSTR function in SPARQL).

If you write the Gravsearch query so that it finds Euchenor first, and then full, it takes about a second:

?book books:hasText ?text .
?text knora-api:valueAsString ?textStr .

# Find 'Euchenor' marked as a noun
?text knora-api:textValueHasStandoff ?nounTag .
?nounTag a books:StandoffNounTag .
FILTER knora-api:matchInStandoff(?textStr, ?nounTag, "Euchenor")

# Find 'full' marked as an adjective
?text knora-api:textValueHasStandoff ?adjectiveTag .
?adjectiveTag a books:StandoffAdjectiveTag .
FILTER knora-api:matchInStandoff(?textStr, ?adjectiveTag, "full")

This is reasonably fast because Euchenor occurs in only one book in my test data (the Iliad). So the first FILTER eliminates all books except one. However, if you reverse the order of the FILTER statements, the query takes about 20 seconds, because full occurs in many books.

We could consider optimising the generated SPARQL to do all the Lucene index searches first, before looking at the standoff. But that wouldn’t help if you were only searching for common words.

In short, the performance of queries like this depends on how many tags the query has to look at. The Lucene index helps by reducing the number of text values that need to be searched, but not if you’re searching for a common word. Splitting books into fragments doesn’t have any effect on this.

I wondered whether it would be possible to create a custom Lucene index linking each tag to the words it contains, but this doesn’t seem to be possible in GraphDB. GraphDB automatically creates Lucene index entries for string literals attached to RDF entities, and there doesn’t seem to be a way to intervene in this process to link RDF entities (standoff tags) to substrings of string literals attached to other RDF entities (text values). The same is true if you use GraphDB Enterprise Edition with Apache Solr.

We could store the contents of each tag redundantly, as a string literal attached to the standoff tag itself. But this would greatly increase the amount of data that needs to be written to the triplestore, slowing down updates even more.

So, in a use case like this, with large amounts of text, a lot of markup, and queries that search for common words, there doesn’t seem to be a way to make knora-api:matchInStandoff very efficient with GraphDB.

XML databases

An XML database is organised like a filesystem, with a hierarchy of directories called ‘collections’ that contain XML documents.

Many aspects of XML databases are not standardised, but they all support a query language called XQuery, which is like SQL or SPARQL for XML.

Full-text search in XML databases

There is a W3C standard for this, but not all XML databases support it.

This means that to support multiple XML databases, we would need to develop something like Gravsearch for XQuery, and generate the appropriate syntax for each database.

There is also an XQuery and XPath Full Text 3.0 spec, but I haven’t found any implementations.

With eXist-db

The most popular open-source XML database seems to be eXist-db.

Importing the test data into eXist-db is much faster than importing it into Knora: the average book took 7 seconds to import, and the largest one (the complete works of Shakespeare, a million words, 16 MB of XML) took 21 seconds.

eXist-db allows you to create custom Lucene indexes for the contents of specific XML elements. Each collection can have its own indexes, and you can update this configuration dynamically. We could have a collection per Knora project, each with its own Lucene indexes.

Here is the XQuery version of “find full marked as an adjective and Euchenor marked as a noun, in the same paragraph but not necessarily in the same sentence”:

for $par in collection("/db/books")//p[.//adj[ft:query(., "full")] and .//noun[ft:query(., "Euchenor")]]
group by $doc := util:document-name(root($par))
return <result doc="{$doc}">{$par}</result>

It takes 1-2 seconds, which I think is reasonable.

Drawbacks of using an XML database with Knora

Searching text and other data together

If we store text/markup in an XML database and metadata in the triplestore, there is no simple way to search both kinds of data in a single query.

Because Knora stores text markup in the triplestore, you can do Gravsearch queries like this:

  • “Find texts that contain this noun and that were published between these dates.”
  • “Find texts containing references to a person who lived between these dates.”

There are commercial databases that work both as RDF triplestores and as XML databases, and that offer capabilities like this, by combining SPARQL and XQuery in non-standard ways. See, for example, the MarkLogic documentation.

Non-hierarchical text structures

Knora’s standoff representation handles these kinds of structures in a straightforward way, but they are a problem for XML. The TEI guidelines have an extensive discussion of different ways of dealing with this problem. One way is to use empty elements, e.g. CLIX boundary markers (which are supported by Knora as an input-output format):

<lg>
 <l>
  <seg>Scorn not the sonnet;</seg>
  <hr:s sID="s02"/>critic, you have frowned, </l>
 <l>Mindless of its just honours; <hr:s eID="s02"/>
  <hr:s sID="s03"/>with this key </l>
 <l>Shakespeare unlocked his heart; <hr:s eID="s03"/>
  <hr:s sID="s04"/>the melody </l>
 <l>Of this small lute gave ease to Petrarch's wound. <hr:s eID="s04"/>
 </l>
</lg>

But how do you search these structures? Suppose you wan to find the <hr:s> containing key or Shakespeare . It is possible to do this in XQuery (see Milestone-chunk.xquery), but eXist-db will not be able to make a Lucene index for <hr:s>, because it is an empty tag. So the query will probably be very slow.

Next steps

If it’s important for us to support use cases like the one in this experiment, we could add functionality to Knora to support storing text in an XML database. In addition to the existing TextValue that stores markup as standoff/RDF, there would be a new value type, XMLValue, whose content would be stored in the XML database. Each project would then need to decide how they want to store their text, depending on their requirements.

We could add a Gravsearch-like feature that would accept XQuery from the client, adapt it to the XML database being used, and deal with permissions, versioning, and so on.

To support combining text/markup searches with metadata searches (at least in a limited way), I think we could probably extend Gravsearch to accept XQuery as part of a WHERE clause. This could first run the XQuery, get back matching resource IRIs from the XML database, then use those IRIs in a SPARQL VALUES block to restrict the results of the rest of the WHERE clause. If there were many matching XML documents, we could consider caching the results of the XQuery, and Knora could maintain a kind of database cursor.

This would be a major development effort, and we would have to allocate substantial resources to it.

As we communicate with editions projects, it would be good to ask them about the size of their texts, the density of the markup, and the kinds of queries they would like to be able to do, to get an idea of how useful this approach would be.

1 Like

Something I forgot to mention: there is a limit to the amount of standoff that Knora can update in a single transaction. With my test data, the largest books could not be imported, because Knora’s SPARQL UPDATE (containing the RDF for all the standoff) was too large to fit in a Java String.

@benjamingeer Thanks a lot for this in-depth analysis which will help us very much in the discussions with both the editions and the SNF. I believe Your view of the challenges is correct. The proposed solution seems very reasonable – if we can allocate the resources to it, which will depend on funding.
Again, very helpful for the discussions of the next few weeks with editions and SNF.

Thanks!!!

XSPARQL Language Specification: “XSPARQL is a query language combining XQuery and SPARQL for transformations between RDF and XML.”

https://www.w3.org/Submission/xsparql-language-specification/