File migration
Rules for format migrations and the tools to be used for this are defined in the file migration-config.xml. It is used in the step Ingest: migrate files.
<?xml version="1.0" encoding="UTF-8"?>
<config>
<application id="1"
name="ImageMagick"
executable="D:\docuteam\apps\ImageMagick\convert.exe"
parameter="-compress#none#{[arg1]}#{[arg2]}" />
...
</config>
In this example the application ImageMagick is defined as application number 1. It is also specified that the program convert.exe is to be executed, which is located in the folder D:\docuteam\apps\ImageMagick. The parameters -compress#none#{[arg1]}#{[arg2]} are passed to the program call, where {[arg1]} is replaced by the source file and {[arg2]} by the target file.
The second part of the file migration-config.xml consists of instructions for format migrations.
<puid name="fmt/41"
applicationID="1"
targetExtension="tif"
targetPronom="fmt/353" />
The example defines that files with a PUID (PRONOM's Persistent Unique Identifier) fmt/41 (Raw JPEG Stream) should be converted to a file with PUID fmt/353 (Tagged Image File Format). The application defined above with the application number 1 (here ImageMagick) should be used.
Besides the specification of a PUID, MIME types and file extensions can also be specified. Format migration according to PUID has first priority. If this is not successful, the second priority is to try to perform the migration according to the MIME type. If this also fails, the file extension is taken into account:
<puid name="fmt/41"
applicationID="1"
targetExtension="tif"
targetPronom="fmt/353" />
<mimeType name="image/jpeg"
applicationID="1"
targetExtension="tif"
targetPronom="fmt/353" />
<extension name="jpg"
applicationID="1"
targetExtension="tif"
targetPronom="fmt/353" />
It is possible to convert a document in multiple steps:
<extension name="msg" premisConverterName="Outlook msg extraction and attachments normalization">
<step applicationID="9" copy="1"/>
<step applicationID="100" excludeExtensions="msg" excludeMimeTypes="application/vnd.ms-outlook"/>
<step applicationID="10" includeExtensions="msg" targetExtension="eml" copy="1"/>
</extension>
In the above example, we first extract the attachments from the email message using MsgAttachmentsExtractor, while keeping a copy of the original file (via the copy="1" option). In the second step, we convert the extracted attachments by calling FileConverter, which applies all defined migration rules. We choose to exclude the original email message file to avoid infinite recursion. Finally, we convert the MSG to an EML format using an external application.
In the following configuration we use multiple steps to first extract the content of a ZIP container and then migrate the files it contains.
<extension name="zip">
<step applicationID="9" />
<step applicationID="100" />
</extension>
The following attributes can be used to configure for multiple-step behavior:
| Attribute | Description |
|---|---|
| includeExtensions | Only consider files with the specified list of extensions for this step. |
| includeMimeTypes | Only consider files with the specified list of MIME types for this step. |
| includePuids | Only consider files with the specified list of PUIDs for this step. |
| excludeExtensions | Do not consider files with the specified list of extensions for this step. |
| excludeMimeTypes | Do not consider files with the specified list of MIME types for this step. |
| excludePuids | Do not consider files with the specified list of PUIDs for this step. |
| copy | When set to 1, the original file is copied to the destination. This option should be used if a subsequent migration step is required later. For example, the first step might extract attachments from an MSG email, while the next step migrates the MSG file into EML format. Using this setting merely to preserve the original file is considered bad practice; the keepOriginals option from the SIPFileMigrator should be used instead. |
The attribute premisConverterName can be used on the puid, mimeType or extension tags to specify the application text in the migration PREMIS event.
File formats that are not listed (whether by PUID, mime type, or file extension) are not migrated.