wiki:ApertureDataAdapter

Aperture Data Adapter

The Data adaptation component implements extract, transform, and load ETL functionalities to integrate existing data sources with OrganiK. It is based on the open source Aperture java content extraction framework.

The OrganiK approach here is to replicate existing data sources into the OrganiK database and index them in the database. The data will be simplified and normalized to fit the OrganiK metaphors of document, blog-post, tag, and wiki page. Links to the original data sources will be kept allowing navigating to the original data. In general, the import process is by crawling data sources in regular intervals and check for modifications. Aperture is integrated with Organik in two ways, firstly, it can be used to extract the full-text content of uploaded files, and secondly, it can be used to crawl existing data-sources and create organik nodes for the resources found.

Features

  • Converts file attachments to plaintext for text indexing.
  • Extraction of plaintext for recommendations.
  • Mass-Import of data from external data sources.

Limitations:

  • It is not planned to implement a live integration by listening to change events.
  • Also a dynamic integration of legacy data is not planned. Dynamic integration of legacy data is an alternative to crawling. It would integrate external sources via a service interface (such as SPARQL or semantic web standards). This means that the Semantic Search Component will answer queries only by analysing data stored in OrganiK, not by querying auxiliary data sources in a live query. This can be considered in theory but will probably not be realized within the project's duration. Not supporting dynamic data integration improves performance and system stability with the tradeoff to need data replication. This was shown in Sauermann, L. & Schwarz, S (2005).

Aperture File Upload Fulltext Extractor

This build on the Core File-upload module. When nodes that have files attached are indexed for searching, the full-text content is extracted and indexed as well. This allows finding nodes based on their attached files using the normal Organik search.

How to use it?

After ApertureDataAdapter/Installation: You can verify if it works by going to the URL http://www.example.com:8080/aperture-webserver/ It will allow you to extract a XML representation from files.

After file uploads (and after cron.php was run), the plaintext of an uploaded file will be indexed and searchable.

Aperture External Datasource Crawler

How to use it?

ApertureDataAdapter is a service to crawl information from outside sources into OrganiK. You can mainly use it to import files. Aperture is running as an extra service (outside OrganiK), so it is a bit more complex to configure and use.

Essentially, you have to

Configuring Data Sources

In the ApertureDataAdapter Drupal module, you can add new data sources. Press the "datasources" button to see a list of possible data sources. Press Filesystem Data Source if you want to crawl files located on the same server where also aperture is installed. (We are working to implement a windows file server crawler also, then the following will be available: Press Windows File Server Data Source if you want to crawl files located on a windows network share). In both cases, you need to have some configuration options at hand.

  • Note: this user interface is an IFRAME managed by aperture and embedded in Drupal. A username and password for aperture-webserver authentification is automatically passed to this frame.
  • Name your data source. If it is a file server something like file server acme.com, projects folder is a good name.
  • Enter the specific configuration of the data source. Read the notes carefully, this is not easy and you can make mistakes.
  • Select how often the data source should be crawled. It asks you the amount of seconds to wait between crawls. Hint: a day has 86400 seconds, a week 604800. If you enter a value, it will start crawling immediately after you pressed save. If you enter no number at all, you have to manually run it (See below)
  • press "save" to store this configuration. For your information: It will be stored inside the aperture java server, not in Drupal/OrganiK.
  • It will now start crawling automatically. It will not inform you about configuration problems at this point. You have to read the reports, see below.

Starting and stopping crawling

In the list of Data Sources your new data source should show up. If it is not crawling already, press start crawling on the data source. This will start the crawling of the data source. If you reload the data source list page frequently, you will notice some changes in the numbers and messages.

If all goes well, the newly crawled documents will show up in your OrganiK now. They will be added with the user name of the person who configured ApertureDataAdapter initially, and will be tagged with the defined tag.

Checking if all works well

For each data source, once you clicked on its name, there is a button detailled report. Pressing it will show you a page with

  • overall statistics of the data source: when was it last crawled, how many elements were crawled in the last crawl, how many documents were crawled in total
  • Log-messages showing severe and warning messages reported by the data source crawler. These are a good help if the data source is not configured correctly or may give explanations why the software is not working.
  • If you click list all resources in the report, it will give you a list of all documents that were crawled and added to organik. This list can be very long and take ages to load. The change-dates may be cryptic numbers you cannot understand, this depends on the specific crawler and datasource used.
  • Note: there may be problems when ApertureDataAdapter tries to store crawled information within OrganiK. Then information is crawled, but not stored in OrganiK. This is problematic, and you will find error messages in the report. But we still need to improve the code to handle these errors gracefully.

Changing a data source

If the password of a data source must be changed, or you want to change the interval used for crawling, click on the name of the datasource in the list of data sources.

Note, that if you change important data, such as the folder where files should be crawled, it will re-crawl the data source assuming that nothing important was changed. If you changed the folder, it will suddenly not find the already crawled files in that new folder and delete all previously crawled documents, one by one. This can take days. In that case, rather delete the data source and add a new data source, then all documents will be deleted in one operation.

Deleting a data source

Press delete data source to delete a data source. It will ask you if you are sure, and it will ask you if you want to delete all crawled data also. Decide and press delete.

Opening External Documents

External documents are crawled and added to OrganiK. They are represented using the externaldocument node type. The data includes the fulltext as body, title, URL identifier of the document, datasource identifier, and a complete RDF description of the file as another field.

To open the original document, the externaldocument module adds a hook_menu to the NODE menu. The HREF link of the URL does not point directly to the URL to be retrieved by the user browser, but redirects to the server to allow special handling of certain external URI types (such as forwarding to a server or streaming the document from another service through Drupal to the end user). Clicking the document URL forwards the user to the URI /node/%nid%/openexternaldocument which internally checks what to do.

Developers can extend this mechanism in the externaldocument.module function externaldocument_open($node).

How to install it?

see ApertureDataAdapter/Installation

How it works?

Aperture is a crawling and extraction framework (if you wish, the "extract" part of data warehousing ETL). It consists of many extractors that convert binary files to RDF (basically, XML), and it has many crawlers who can retrieve lots of objects from structured datasources, such as file directories, Microsoft Outlook, or IMAP email accounts.

Configuration of aperture-webserver

The Aperture webserver is configured via a very simple REST interface. A servlet at address "config/api" accepts GET and PUT requests. Note that the passwords and all data configured in aperture-webserver can be retrieved via GET. The Aperture webserver is protected with HTTP-AUTH and a username and password.

Database Modifications

This module adds

  • the content type external document

People

LeoSauermann created the module, contributions by GunnarGrimnes .

Sourcecode

subversion

for subversion read-only access, use http instead.

Alternatives

If we would use another framework, we would have used:

Last modified 7 years ago Last modified on 07/22/10 18:35:06