[[ux_metadata]]

The UNICORE metadata service
----------------------------

UNICORE supports metadata management on a per-storage basis. This means, each storage
instance (for example, the user's home, or  a job working directory) has its own
metadata management service instance.

Metadata management is separated into two parts: a front end (which is a web service) and
a back end.

The front end service allows the user to manipulate and query metadata, as well as manually 
trigger the metadata extraction process. The back end is the actual implementation of the 
metadata management, which is pluggable and can be exchanged by custom implementations. 
The default implementation has the following properties

 - Apache Lucene for indexing,

 - Apache Tika for extracting metadata,

 - metadata is stored as files directly on the storage resource, in files with a special ".metadata" 
   suffix
 
 - the index files are stored on the UNICORE/X server, in a configurable directory

Enabling the metadata service
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

First, UNICORE's service configuration file '<CONF>/wsrflite.xml' needs to be edited and the following
service definition added in the '<services>' section:

-----------------
 <!-- enable the metadata management service -->
 <service name="MetadataManagement" wsrf="true" persistent="true">
  <interface class="de.fzj.unicore.uas.MetadataManagement"/>
  <implementation class="de.fzj.unicore.uas.metadata.MetadataManagementHomeImpl"/>
 </service>
-----------------

You will also need to define which implementation should be used. This is done via properties, which can
be defined either in '<CONF>/wsrflite.xml' or '<CONF>/uas.config'.

In uas.config, set:

-----------------

#
# Metadata manager settings
#

coreServices.metadata.managerClass=eu.unicore.uas.metadata.LuceneMetadataManager

#
# use Tika for extracting metadata 
# (if you do not want this, remove this property)
#
coreServices.metadata.parserClass=org.apache.tika.parser.AutoDetectParser

#
# Lucene index directory:
#
# Configure a directory on the UNICORE/X machine where index
# files should be placed
#
coreServices.metadata.luceneDirectory=/tmp/data/luceneIndexFiles/

-----------------


Controlling metadata extraction
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If a file named +.unicore_metadata_control+ is found in the 
base directory (i.e. where the crawler starts its crawling 
process), it is evaluated to decide which files should be
included or excluded in the metadata extraction process.

By default, all files are included in the extraction process, 
except those matching a fixed set of patterns (".svn", and 
the UNICORE metadata and control files themselves).

The file format is a standard "key=value" properties file.
Currently, the following keys are understood

 * +exclude+ a comma-separated list of string patterns of 
filenames to exclude
 * +include+ a comma-separated list of string patterns 
of filenames to include
 * +useDefaultExcludes+ if set to "false", the predefined 
exclude list will NOT be used

The include/exclude patterns may include wildcards '?' and '*'.

Examples:


To only include pdf and jpg files, you would use

-------
include=*.pdf,*.jpg
-------

To exclude all doc and ppt files,

-------
exclude=*.doc,*.ppt
-------

To include all pdf files except those whose name starts with 2011,

-------
include=*.pdf
exclude=2011*.pdf
-------
