folge2TMNavTMHarvestPanckoucke
Hamburg 2004

DataProviders

DataProviders provide their data as name-value-pairs over several iterations. This contract is general enough to allow the integration of nearly every kind of datasource in TMHarvest.

Currently TMHarvest comes with support for the following datasources:

<sqlProvider>

Performs a sql-query. For each row of the resultset, the template is processed once. The names of the placeholders are derive from the resultsets columnnames.

The body of the element contains the select statement. The JDBC-Source is defined in the <context> as a <datasource> and is referenced via the datasource-attribute of the <sqlProvider>.

<csvProvider>

Reads data from a csv-file.

It is assumed that the first line of the file contains the column-labels. The labels will be used to identify the corresponding placeholders in the template.

The <csvProvider>-element defines the following attributes:

filename

The file that contains the data. This attribute is required. The filename is resolved relative to the location of the modelfile.

cellDelimiter

A String that is used to separate consecutive fields. This attribute is optional. The default is the tabulator.

buffercapacity

How much rows of data are read in one turn from the csv-file. Higher values need more memory, but yields to better performance. There is little need to modify this attribute. This attribute is optional. The default is 800.

transformColumnLabels

Allows you to transform the columnlabels to upper or to lower case. This attribute is optional. The default is that no transformation is done.

<xpathProvider>

The XPathProvider selects a subset of nodes (the nodeset) from the xml-source. For each of theese nodes the template is processed once.

In order to identify placeholders and substitute them with concrete value, the xpathProvider can contain an arbitrary number of <property>-elements. Each of theese properties define a name and a query. The query is executed on the current node. The result is transformed to a string and is used as the value in subsequent placeholder processing.

The <xpathProvider>-element defines the following attributes:

filename

The xmlfile to be processed. This attribute is required.

expression

The xPath-query, that returns the initial nodeset. This attribute is required.

The <xpathProvider>-element contains the following children:

<property>

The <property>-element allows you to define a name and a query. It defines the attribute's name and expression

<metaDataProvider>

The <metaDataProvider> iterates over a set of files and extracts and returns meta data.

Currently the following filetypes are supported:

The <metaDataProvider>-element defines the following attributes:

directory

The basedirectory, from where the files are read. The directory is parsed recursivly. What kind of files are included, depends on the implementation of the finderclass.

finderclass

Full qualified name of a class that implements the org.tm4j.tmharvest.md.MetaDataFinder-Interface.

Implementations of the org.tm4j.tmharvest.md.MetaDataFinder-Interface differs in what kind of files they are able to process and what meta data the return. In addition, every implementation defines a set of file extensions that will be applied as a filter to all files in the given directory.

TMHarvest comes with the following implementations:

mp3-files

classname:org.tm4j.tmharvest.md.MP3MDFinder

extensions:.mp3

meta data:location, bitrate, album, artist, title, year, genre, comment

Note

The current implementation of mp3-finder uses a library, that opens a mp3-file in read/write mode. Therefore you won't be able to build topicmaps from mp3-files you don't have write access to (read-only media or no write-permission).

 

Open Office-files

classname:org.tm4j.tmharvest.md.OpenOfficeOMDFinder

extensions:.sxi, .sxd, .sxc, .sxw, .sxm

meta data:location, application, language, title, subject, description, creationDate, keywords (currently only comma separated)

 

MS Office-files

classname:org.tm4j.tmharvest.md.MSOfficeMDFinder

extensions:.mpp, .vsd, .mdb, .ppt, .doc, .xls

meta data:location, title, author, lastAuthor, lastEditDate, keywords, creationDate, application, comments, template, pageCount, charCount, version

 

HTML-files, with Dublicore-Metadata

This Metadata-Provider checks for <meta>-tags whose name begin with DC.. If successful it uses the name without the prefix DC. as the key.

classname:org.tm4j.tmharvest.md.HtmlMDFinder

extensions:.htm, .html

meta data:the name of every <meta>-tag, whose name starts with DC.

 

<customProvider>

indicates a java-class, that implements the org.tm4j.harvest.data.DataProvider-Interface.

The <customProvider>-element defines the following attributes:

classname

Contains the full qualified name of a class that implements the org.tm4j.harvest.data.DataProvider-Interface

Note

Please keep in mind, that your custom class must be added to TMHarvest classpath. For a discussion how to add custom pathes to the classpath of TMHarvest see the section with TMHarvest Tips.

The <customProvider>-element contains the following children:

<param>

The <param>-element is used to set properties of the custom class. The class must define beanish setters for all the properties that will be set this way. The <param>-element defines two attributes: name and value.