ZIP export is the way to go if you would like to export a larger number of objects, be it transformed or not. Its features include:
While you can export really large datasets, there is a caveat: In normal mode, the ZIP export (unlike TEIcorpus) needs to collect all object’s metadata before starting to actually deliver something. This is required since we need to calculate all objects’ filenames in order to be able to rewrite links between the objects correctly. Thus the ZIP tool (unlike, e.g., TEIcorpus export) might need quite some time before it starts to deliver the first bytes. When you’re unlucky, this head start time exceeds the timeouts of your browser or the intermediate proxy. If this happens, you’ll get a timeout instead of the zip.
In order to be still able to export these large data sets, the ZIP export offers a special streaming mode. When you pass the query parameter stream=true, the Aggregator will deliver data as soon as possible, even if it has not enough ifo to perform correct link rewriting. This may lead to exported files still containing textgrid: URIs, but at least you get files :-)
Each ZIP file that is exported contains an additional file at the root level called .INDEX.imex. This is an XML file that contains a list of all exported objects and that maps textgrid URIs to the file names used in the actual export. If you don’t rename stuff or move stuff around, this can be used by the TextGridLab to re-import your files.
Example:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<importSpec xmlns="http://textgrid.info/import">
<importObject textgrid-uri="textgrid:k2kp.0" local-data="Romane/Goethes_Briefwechsel_mit_einem_Kinde/Arnim,_Bettina_von-Goethes_Briefwechsel_mit_einem_Kinde.xml" local-metadata="Romane/Goethes_Briefwechsel_mit_einem_Kinde/Arnim,_Bettina_von-Goethes_Briefwechsel_mit_einem_Kinde.xml.meta" rewrite-method="xml" rewrite-config="internal:tei#tei"/>
<importObject textgrid-uri="textgrid:k2k1.0" local-data="Romane/Die_Guenderode/Arnim,_Bettina_von-Die_Guenderode.xml" local-metadata="Romane/Die_Guenderode/Arnim,_Bettina_von-Die_Guenderode.xml.meta" rewrite-method="xml" rewrite-config="internal:tei#tei"/>
<importObject textgrid-uri="textgrid:k2k7.0" local-data="Romane/Clemens_Brentanos_Fruehlingskranz/Arnim,_Bettina_von-Clemens_Brentanos_Fruehlingskranz.xml" local-metadata="Romane/Clemens_Brentanos_Fruehlingskranz/Arnim,_Bettina_von-Clemens_Brentanos_Fruehlingskranz.xml.meta" rewrite-method="xml" rewrite-config="internal:tei#tei"/>
</importSpec>
Synopsis:
/zip/{objects}?sid&title&filenames&metanames&dirnames&only&meta&transform&query&filter&target&start&stop&stream
parameter | value | description |
---|---|---|
sid | string | Session ID to access protected resources |
stream | boolean Default: false |
if true, favor fast results over ideal rewriting |
title | string | (optional) title for the exported data, currently only used for generating the filename. If none is given, the first title of the first object will be used. |
There are basically two options what to export:
You export one or more objects/aggregations and everything that they aggregate. To do that, specify the URI(s) as objects part in the request path as with the other exporters:
parameter | value | description |
---|---|---|
objects | string | The TextGridURIs of the TEI documents or aggregations to zip, separated by commas (,) |
Alternatively, specify a query to TG-search. To do so, specify an (unused) object string plus query parameters, so a possible URL may look like <https://textgridlab.org/1.0/aggregator/zip/query?query=waldeinsamkeit>.
You have the full power of the query language, but only a limited set of parameters that will be passed to TG-search:
parameter | value | description |
---|---|---|
query | string | (EXPERIMENTAL) perform the given TGsearch query and use its result as root objects instead of the objects. |
filter | string (repeating) |
for query: additional filters |
target | string Default: both |
if query is used, the query target (metadata, fulltext or both) |
start | int Default: 0 |
for query: start at result no. |
stop | int Default: 65535 |
for query: max. number of results |
Please note that you typically will not need to specify the start and stop parameters, but you may want to use stream=true (cf. above).
In both cases, you can further strip down what to export by specifying one or more content types and by specifying whether metadata and textgrid-specific technical files (i.e. the aggregation files) should be exported:
parameter | value | description |
---|---|---|
only | string (repeating) |
If at least one only parameter is given, restrict export to objects with the given MIME types |
meta | boolean Default: true |
Include metadata and aggregation files in the ZIP file. |
Sometimes you want the text, but you don’t want it in the original form. Since the aggregator has a built-in XSLT processor, you can use it to convert the documents. This typically does not considerably slow down the export process.
parameter | value | description |
---|---|---|
transform | string | (EXPERIMENTAL) Transform each XML document before zipping. Values currently available are text, html, or the textgrid: URI of an XSLT stylesheet. |
If you specify transform=text, a default plain-text transformation will be used on each file. We use the to-plain-text transformation of the bundled TEI XSLTs, so expect something domain-aware sensible. transform=html will use the built-in html transformation instead.
You can also specify a textgrid: URI that points to an XSLT stylesheet – however, keep in mind that this stylesheet must be either public or you need to pass in a valid session ID.
It is possible to modify the filenames used inside the ZIP file (and for rewritten links) by providing file name patterns using three parameters:
parameter | value | description |
---|---|---|
filenames | string Default: {parent|/}{author}-{ti tle}*.{ext} |
Pattern for the generated filenames in the ZIP files. |
metanames | string Default: {filename}.meta |
Pattern for the filenames for the metadata files in the ZIP files. |
dirnames | string Default: {parent|/}{title}* |
Pattern for the directory names generated for aggregations etc. This pattern applied to the parent aggregation is available as {parent} in filenames and metanames. |
The filenames will be generated from the metadata available to the aggregator when it adds the object to its internal list, so it may be that especially the author field is undefined. By default, each metadata field will be transformed to a safe character set containing only ASCII letters and numbers and a limited set of special characters, by running an automatic transcription (so Luſtige Märchen will become Lustige_Maerchen, and ηελλασ will become hellas). A literal * in the pattern will be replaced by either nothing or a disambiguation number if the same name would be generated for different objects otherwise. The filename extension {ext} will depend on the format actually exported, so it is txt if you use transform=text.
A pattern string is a string containing patterns enclosed in curly braces. Each pattern starts with a variable and is optionally followed by one or more options, each introduced by a vertical bar ( |). Please note that all whitespace is significant.
As an example, the string {author|fallback|20}-{title|sep=,}.{uri}.{ext} contains the variables author with the options fallback and 20, the variable title with the option sep=,, and the variables uri and ext, each without any option.
The following basic variables are available in all policies:
Variable | Supported Options | Description |
---|---|---|
author | fallback, sep=String, Number, raw | The object’s author. This tries to find the nearest work object in the aggregation tree and extracts its author or authors. If the fallback option is included and the matching work does not include author fields, use all agents regardless of their role instead. |
title | sep=String, Number, raw | The object’s title or titles. |
uri | — | The object’s TextGrid URI. This only includes the scheme-specific part. |
ext | — | A filename extension that is suitable for the object’s MIME type, or dat if none found. This does not include a leading dot. |
* | pre=String (Default .), post= String | A filename disambiguation pattern, only inserted if required. If filename disambiguation is on (`setUniqueFilenames(b oolean) <http://dev.di gital-humanities.de/ci/j ob/link-rewriter/site/ap idocs/info/textgrid/util s/export/filenames/Confi gurableFilenamePolicy.ht ml#setUniqueFilenames%28 boolean%29>`__), `getFilename(IAggregat ionEntry) <http://dev. digital-humanities.de/ci /job/link-rewriter/site/ apidocs/info/textgrid/ut ils/export/filenames/Con figurableFilenamePolicy. html#getFilename%28info. textgrid.utils.export.ag gregations.IAggregationE ntry%29>`__ will first generate a filename candidate with this pattern expanding to the empty string. If this filename has already been used for a different entry, it will re-run the filename generation with this pattern expanding to the empty string for the first object resolving to the candidate and to prefix + n-1 + postfix for every other object. I.e. for three XML documents by Goethe and the pattern {author}*.{ext} you will get Goethe.xml, Goethe.1.xml and Goethe.2.xml. Instead of {*} without options you can also simply write *. |
If you generate multiple filenames, your pattern should include either ``{uri}`` or ``*`` or you risk to get te same filename for different objects!
Variable | Description |
---|---|
parent | This is the dirnames pattern applied to the the parent aggregation of the current object, if any. In the form {parent /} it appends / iff there is a parent. It is available in all patterns, including in dirnames itself. |
filename | The name for the corresponding the metadata of which we’re processing. Only available in metanames. |
Number | If you pass any non-negative non-zero integer number as an option, the expanded value of the variable will be trimmed after at most Number characters. Trimming occurs after all other processing steps for the variable. |
raw | Insert the result of this variable as-is, without character sanitization. If you do not include this option, the result of the metadata-based variables will be transcribed from its original characters to a safe subset of US-ASCII characters in order to be safe from all kinds of encoding and filename issues. This tries to do something sensible with, e.g., umlauts and non-latin scripts. |
sep=String | If present and the respective metadata field contains multiple values, use all values, joined together with the given separator String. Otherwise, only use the first value. |
fallback | See at the corresponding variable descriptions. |